You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "WMDE/Wikidata/Growth"

From Wikitech
< WMDE‎ | Wikidata
Jump to navigation Jump to search
imported>Addshore
(Undo revision 1843411 by Addshore (talk) not supported here apparently)
imported>Jforrester
m (→‎MediaInfo: Add context to 2.86m number)
Line 149: Line 149:
 
* 2019-03-xx 273,540 entities, out of 52 million files = 0.5% of files
 
* 2019-03-xx 273,540 entities, out of 52 million files = 0.5% of files
 
* 2019-06-20 1,210,558 entities, out of 54.4 million files = 2.2% of files
 
* 2019-06-20 1,210,558 entities, out of 54.4 million files = 2.2% of files
 +
* 2019-11-09 2,861,376 entities, out of 57.1 million files ≈ 5.0% of files
  
MediaInfo entities have the potential to match the # of files on commons (54 million).
+
MediaInfo entities are expected to have the same number as the number of files on Commons (>50 million).
  
 
"By the end of the calendar year, we expect at least 5 million files to have structured data" - https://phabricator.wikimedia.org/T226093#5268771
 
"By the end of the calendar year, we expect at least 5 million files to have structured data" - https://phabricator.wikimedia.org/T226093#5268771

Revision as of 16:06, 9 November 2019

Last updated March 2019

Data & Writing

Edit rate

Wikidata edit rate (per year)

2019-20 prediction, no vast increase in rate, 200 million - 250 million.

Data from hadoop: https://phabricator.wikimedia.org/P8193 Yearly EPM using X/365/24/60

Past rate:

  • 2012, 2,912,964
  • 2013, 94,323,394 (179 EPM)
  • 2014, 87,411,229 (166 EPM)
  • 2015, 102,362,226 (194 EPM)
  • 2016, 135,511,683 (257 EPM)
  • 2017, 192,353,549 (365 EPM)
  • 2018, 208,944,716 (397 EPM)
  • 2019, (415 EPM) (end of october 2019)

Yearly edit rate equivalent sustained EPMs

In order to put looking at yearly figures in perspective see below conversion table for going from yearly edits to sustained / average EPM for the year.

Year Edits EPM
200 million 380 EPM
300 million 570 EPM
600 million 1141 EPM

Revision count

Can be retrieved at any given time by looking at the rev id of the latest new page creation on https://www.wikidata.org/wiki/Special:NewPages

Wikidata

  • March 2019 we are at 881,499,873 revisions.
  • June 2019 we are at 965,310,320 revisions.
  • October 2019 we are at 1,042,114,532 revisions.

This will probably increase to over 1 billion by the end of 2019..

In 2018 the year edit count was 208,944,716. The rate is predicted to continue increasing at around 200 million - 250 million for 2019-20.

Long term, reaching 4,294,967,295 (bigint irevids)

Based on what we know now we would predicate that we would not need bigints on the revision table until at least 2025, likely further in the future.

Year (end) Increase? Total
2019 200-250 million 1.1-1.2 billion
2020 200-300 million 1.3-1.5 billion
2021 200-350 million 1.5-1.85 billion
2022 200-400 million 1.7-2.25 billion
2023 200-450 million 1.9-2.7 billion
2024 200-500 million 2.1-3.2 billion
2025 200-550 million 2.3-3.75 billion
2025 200-600 million 2.5-4.15 billion

Commons

  • June 2019 we are at 354,280,797 revisions.
  • October we are at 372,682,216 revisions.

Entity size

Average size

  • Average size of items remains pretty steady, ~18KB in March 2019
  • 2019-20 prediction would not see this increase to over ~30KB
  • Lexeme size isn't tracked, but assumed to be much smaller than items.

Max size

  • In 2019 the max size of entities was increased from 2500 to 3000.

Storage in memcached

Currently (March 2019) the size of entities could become an issue for storage in the shared memcached cache when they reach 1MB.

See WMDE/Wikidata/Caching#WikiPageEntityRevisionLookup for more details.

Right now the biggest shared cache entity is less than 200k, meaning the max entity size limit would have to increase to around 15,000 to become an issue[citation needed].

Changes in the way the serialization is stored though could accelerate this.

Number of Entities by type

Grafana: https://grafana.wikimedia.org/d/000000167/wikidata-datamodel

Items

2019-20 predicted growth 10 million - 20 million, resulting in no more than 73 million items.

Past growth:

  • 2016-17 5.3 million
  • 2017-18 17.7 million
  • 2018-19 11.3 million

Properties

2019-20 predicted grown 1500 - 3000 property increase, resulting in no more than 9000 properties.

This takes into account the fact that over the years the rate of creation has increased every year, and also that commons will start using properties in 2019 and we may see an increase property creation due to that.

Past growth:

  • 2016-17, 900
  • 2017-18, 1200
  • 2018-19, 1500

Lexemes

Lexemes were only released to the world in 2018, so their growth is hard to predict.

The last 9 months (to March 2019) have seen an increase from 3509 to 43500.

Unless something drastic happens we would comfortably stay below 1 million lexemes for 2019-2020.

No prediction for Forms or Senses here...

MediaInfo

There is no grafana tracking for mediainfo entities currently.

DB query for counting current # of mediainfo entities https://quarry.wmflabs.org/query/34303

  • 2019-03-xx 273,540 entities, out of 52 million files = 0.5% of files
  • 2019-06-20 1,210,558 entities, out of 54.4 million files = 2.2% of files
  • 2019-11-09 2,861,376 entities, out of 57.1 million files ≈ 5.0% of files

MediaInfo entities are expected to have the same number as the number of files on Commons (>50 million).

"By the end of the calendar year, we expect at least 5 million files to have structured data" - https://phabricator.wikimedia.org/T226093#5268771

DB Tables size

Latest info on auto inc fields running out of space: https://phabricator.wikimedia.org/P8198

wb_terms

wb_terms is VERY big(on disk), and is going to see no further adoption.

It is going to be killed in 2019.

TBA current growth predictions.

text & revisions

These tables will share the same growth pattern in terms of auto inc ids and the need to switch to bigints.

See predicted revision count in WMDE/Wikidata/Growth#Revision_count.

recentchanges & cu_changes

based on predicted revision increase rate WMDE/Wikidata/Growth#Revision_count we would fill the current auto increment fields between 2022-2024.

Data below from March 2019:

        table_schema: wikidatawiki
          table_name: recentchanges
         column_name: rc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 919219099
auto_increment_ratio: 0.4280
        table_schema: wikidatawiki
          table_name: cu_changes
         column_name: cuc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 899023427
auto_increment_ratio: 0.4186

Misc storage

WikibaseQualityConstraints check data

TBA (we are going to persistently store this stuff)

Usage & Reading

TBA more stuff?

Wikidata.org / Repo

3rd party federated wikis

At some point we will develop federation for 3rd parties. This will likely result in an increase in requests to Special:EntityData and or the API. More details to come in the future...

3rd party WDQS updaters

As identified in https://phabricator.wikimedia.org/T217897#5020183 WDQS updaters both internal to WMF and external hit Special:EntityData a lot. These requests account for most of the cache misses on wikidata.org.

The PHP processing for these queries is fairly light weight, but continued uncached requests here will result in a direct connection to increase reads from the shared entity revision cache in memcached.

WDQS

Naturally this is predicted to increase but this is mainly for the WMF discovery team to worry about.

There will likely be a growth in internal WMF requests (particularly from Wikibase quality constraints) as the checks are planned to run after every edit. Thus as edit rate increases the number of these checks increases.

Comments

Lydia growth thoughts from early 2019

  • Creation Rate
    • Items
      • Interest in project is growing
      • OTOH some groups are splitting out into own projects
      • Creation rate will not slow down
    • Property
      • Creation rate may slow down a bit, but follow existing trend
    • MediaInfo
      • Huge growth expected, number of M entities to be similar to number of files on commons
      • Commons it also expected to grow at a high rate
      • Properties for commons?
        • No significant raise of number expected.
    • Lexemes
      • Early stage of project, significant growth expected
      • Auto generating forms and senses - how much data is actually stored(curatable?) vs generated on the fly(i.e. Only materialize when requested)
  • General edit rate growth
    • Client editing from clients (wikipedias)
      • volumes of edits comparable to bot edit volume currently
  • Growth in the size of the entity
    • On average each item will have more data
  • Data used on client wikis
  • WDQS
    • WMF should be taking care of this
  • External to wikidata?
    • Non-WMF federated wikis accessing Wikidata data