You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Edits/Structured data/Commons entity
structured_data.commons_entity table (available on Hive) is a conversion of the
commonswiki structured-data entities JSON dumps in parquet. In wikibase (the underlying data-model of structured-data), entities information is stored in JSON, leading to dumps in that format. A conversion is needed for a better data-format in parquet, as JSON makes extensive use of maps (objects with a variable name), which are less easy to use in Parquet when the data-model is actually well-defined.
New full dumps are generated and copied to the analytics hadoop cluster every week, and then converted to Parquet and added as a new
snapshot in the hive table.
Current Schema[edit | edit source]
$ hive --database structured_data hive (wmf)> describe commons_entity; TODO: Add description when table will be created.
snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the
where clause of your queries (even if it is just
snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add
where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Sample queries[edit | edit source]
TODO: Add query example
Changes and known problems since 2020-02[edit | edit source]
|2021-12||task T258834||Table is created and first imports and conversions are automated.|
See also[edit | edit source]
- The code that imports and converts: