You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

WMDE/Wikidata/Growth

From Wikitech-static
< WMDE‎ | Wikidata
Revision as of 10:29, 14 March 2019 by imported>Addshore (→‎Max size: fix link)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Last updated March 2019

Data & Writing

Edit rate

Wikidata edit rate (per year)

2019-20 prediction, no vast increase in rate, 200 million - 250 million.

Data from hadoop: https://phabricator.wikimedia.org/P8193 Yearly EPM using X/365/24/60

Past rate:

  • 2012, 2,912,964
  • 2013, 94,323,394
  • 2014, 87,411,229
  • 2015, 102,362,226 (194 EPM)
  • 2016, 135,511,683 (257 EPM)
  • 2017, 192,353,549 (365 EPM)
  • 2018, 208,944,716 (397 EPM)

Yearly edit rate equivalent sustained EPMs

In order to put looking at yearly figures in perspective see below conversion table for going from yearly edits to sustained / avergae EPM for the year.

Year Edits EPM
200 million 380 EPM
300 million 570 EPM
600 million 1141 EPM

Revision count

March 2019 we are at 881,499,873 revisions. This will probably increase to 1 billion by the end of 2019. In 2018 the year edit count was 208,944,716. The rate is predicted to continue increasing at around 200 million - 250 million for 2019-20.

    • Long term, reaching 4,294,967,295 (bigint irevids)**

Based on what we know now we would predicate that we would not need bigints on the revision table until at least 2025, likely further in the future.

Year (end) Increase? Total
2019 200-250 million 1.1-1.2 billion
2020 200-300 million 1.3-1.5 billion
2021 200-350 million 1.5-1.85 billion
2022 200-400 million 1.7-2.25 billion
2023 200-450 million 1.9-2.7 billion
2024 200-500 million 2.1-3.2 billion
2025 200-550 million 2.3-3.75 billion
2025 200-600 million 2.5-4.15 billion

Entity size

Average size

  • Average size of items remains pretty steady, ~18KB in March 2019
  • 2019-20 prediction would not see this increase to over ~30KB
  • Lexeme size isn't tracked, but assumed to be much smaller than items.

Max size

  • In 2019 the max size of entities was increased from 2500 to 3000.

Storage in memcached

Currently (March 2019) the size of entities could become and issue for storage in the shared memcached cache when they reach 1MB.

See WMDE/Wikidata/Caching#WikiPageEntityRevisionLookup for more details.

Right now the biggest shared cache entity is less than 200k, meaning the max entity size limit would have to increase to around 15,000 to become an issue[citation needed].

Changes in the way the serialization is stored though could accelerate this.

Number of Entities by type

Grafana: https://grafana.wikimedia.org/d/000000167/wikidata-datamodel

Items

2019-20 predicted growth 10 million - 20 million, resulting in no more than 73 million items.

Past growth:

  • 2016-17 5.3 million
  • 2017-18 17.7 million
  • 2018-19 11.3 million

Properties

2019-20 predicted grown 1500 - 3000 property increase, resulting in no more than 9000 properties.

This takes into account the fact that over the years the rate of creation has increased every year, and also that commons will start using properties in 2019 and we may see an increase property creation due to that.

Past growth:

  • 2016-17, 900
  • 2017-18, 1200
  • 2018-19, 1500

Lexemes

Lexemes were only released to the world in 2018, so their growth is hard to predict.

The last 9 months (to March 2019) have seen an increase from 3509 to 43500.

Unless something drastic happens we would comfortably stay below 1 million lexemes for 2019-2020.

No prediction for Forms or Senses here...

MediaInfo

There is no grafana tracking for mediainfo entities currently.

DB query for counting current # of mediainfo entities https://quarry.wmflabs.org/query/34303

March 2019: 273,540 mediainfo entities, out of 52 million files

MediaInfo entities have the potential to match the # of files on commons (50 million).

DB Tables size

Latest info on auto inc fields running out of space: https://phabricator.wikimedia.org/P8198

wb_terms

wb_terms is VERY big(on disk), and is going to see no further adoption.

It is going to be killed in 2019.

TBA current growth predictions.

text & revisions

These tables will share the same growth pattern in terms of auto inc ids and the need to switch to bigints.

See predicted revision count in WMDE/Wikidata/Growth#Revision_count.

recentchanges & cu_changes

based on predicted revision increase rate WMDE/Wikidata/Growth#Revision_count we would fill the current auto increment fields between 2022-2024.

Data below from March 2019:

        table_schema: wikidatawiki
          table_name: recentchanges
         column_name: rc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 919219099
auto_increment_ratio: 0.4280
        table_schema: wikidatawiki
          table_name: cu_changes
         column_name: cuc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 899023427
auto_increment_ratio: 0.4186

Misc storage

WikibaseQualityConstraints check data

TBA (we are going to persistently store this stuff)

Usage & Reading

TBA more stuff?

Wikidata.org / Repo

3rd party federated wikis

At some point we will develop federation for 3rd parties. This will likely result in an increase in requests to Special:EntityData and or the API. More details to come in the future...

3rd party WDQS updaters

As identified in https://phabricator.wikimedia.org/T217897#5020183 WDQS updaters both internal to WMF and external hit Special:EntityData a lot. These requests account for most of the cache misses on wikidata.org.

The PHP processing for these queries is fairly light weight, but continued uncached requests here will result in a direct connection to increase reads from the shared entity revision cache in memcached.

WDQS

Naturally this is predicted to increase but this is mainly for the WMF discover team to worry about.

There will likely be a growth in internal WMF requests (particularly from Wikibase quality constraints) as the checks are planned to run after every edit. Thus as edit rate increases the number of these checks increases.

Comments

Lydia growth thoughts from early 2019

  • Creation Rate
    • Items
      • Interest in project is growing
      • OTOH some groups are splitting out into own projects
      • Creation rate will not slow down
    • Property
      • Creation rate may slow down a bit, but follow existing trend
    • MediaInfo
      • Huge growth expected, number of M entities to be similar to number of files on commons
      • Commons it also expected to grow at a high rate
      • Properties for commons?
        • No significant raise of number expected.
    • Lexemes
      • Early stage of project, significant growth expected
      • Auto generating forms and senses - how much data is actually stored(curatable?) vs generated on the fly(i.e. Only materialize when requested)
  • General edit rate growth
    • Client editing from clients (wikipedias)
      • volumes of edits comparable to bot edit volume currently
  • Growth in the size of the entity
    • On average each item will have more data
  • Data used on client wikis
  • WDQS
    • WMF should be taking care of this
  • External to wikidata?
    • Non-WMF federated wikis accessing Wikidata data