You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Edits/Structured data/Commons entity

From Wikitech-static
Jump to navigation Jump to search


The structured_data.commons_entity table (available on Hive) is a conversion of the commonswiki structured-data entities JSON dumps in parquet. In wikibase (the underlying data-model of structured-data), entities information is stored in JSON, leading to dumps in that format. A conversion is needed for a better data-format in parquet, as JSON makes extensive use of maps (objects with a variable name), which are less easy to use in Parquet when the data-model is actually well-defined.

New full dumps are generated and copied to the analytics hadoop cluster every week, and then converted to Parquet and added as a new snapshot in the hive table.

Current Schema[edit | edit source]

$ hive --database structured_data

hive (wmf)> describe commons_entity;

TODO: Add description when table will be created.

Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Sample queries[edit | edit source]

TODO: Add query example

Changes and known problems since 2020-02[edit | edit source]

Date from Task Details
2021-12 task T258834 Table is created and first imports and conversions are automated.

See also[edit | edit source]