You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Basic Analysis

From Wikitech-static
< User:AKhatun
Revision as of 15:47, 29 June 2021 by imported>AKhatun (Fix subject and ref, finish ref.)
Jump to navigation Jump to search

The following analysis is done on Wikidata as a means to understand more about Wikidata itself. This includes what kind of subjects, properties, objects etc does it contain the most, the type of triples it contains, how much of it refers to wiki or non-wiki objects etc. The Analysis was done on Wikidata snapshot 20210614 using Python in Jupyter Notebook. Other packages used are: Spark, RDFLib, SPARQLWrapper, and Pandas. Part of some analysis collected data with SPARQL from WDQS endpoint, which fetches the latest data (to easily get labels and data types of literals for example). So a small difference is sometimes found with the snapshot and latest data.

The Wikidata prefixes list can be found here: Full_list_of_prefixes
Phabricator Ticket: T282139
Jupyter Notebook: [[]]

Overview

As of 20210614:

  • Total number of triples: 12910066145 (12.9B)
  • Total number of distinct items (context): 97315151 (0.75% of total triples)
  • Total number of distinct predicates:
  • Total number of distinct wikidata propertes:

Items

How many different things does Wikidata talk about? It's a very high-level overview question and answered based on the context from Wikidata. For example, all triples under Q42 context can be found here: Q42 dump. Top 20 items are shown in the table below.

Top 20 Items in Wikidata
Item Item Label Count
wd:Q39790431 BayGenomics: a resource of insertional mutations in mouse embryonic stem cells 41847
wd:Q57661806 Erratum to: Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in π‘ π‘žπ‘Ÿπ‘‘π‘ =8 s = 8 TeV pp collisions with the ATLAS detector 34517
wd:Q56836084 40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany, 5-9 September 2004 33299
wd:Q64022985 Combinations of single-top-quark production cross-section measurements and fLVVtb determinations at s π‘ π‘žπ‘Ÿπ‘‘π‘ = 7 and 8 TeV with the ATLAS and CMS experiments 32078
wd:Q21558717 Combined Measurement of the Higgs Boson Mass in p p Collisions at s = 7 and 8\u00A0TeV with the ATLAS and CMS Experiments 31791
wd:Q56754739 Measurements of the Higgs boson production and decay rates and constraints on its couplings from a combined ATLAS and CMS analysis of the LHC pp collision data at s = 7 π‘ π‘žπ‘Ÿπ‘‘π‘ =7 and 8 TeV 31653
wd:Q56895655 Combination of inclusive and differential t t \u00AF π‘šπ‘Žπ‘‘β„Žπ‘Ÿπ‘šπ‘‘π‘œπ‘£π‘’π‘Ÿπ‘™π‘–π‘›π‘’π‘šπ‘Žπ‘‘β„Žπ‘Ÿπ‘šπ‘‘ charge asymmetry measurements using ATLAS and CMS data at s = 7 π‘ π‘žπ‘Ÿπ‘‘π‘ =7 and 8 TeV 31562
wd:Q57920219 35th Annual Meeting of the European Association for the Study of Diabetes 27656
wd:Q56883844 35th Annual Meeting of the European Association for the Study of Diabetes : Brussels, Belgium, 28 September-2 October 1999 27632
wd:Q57735077 ABSTRACTS 27267
wd:Q56489295 Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in [Formula: see text] TeV collisions with the ATLAS detector 24994
wd:Q93740619 XXIV World Allergy Congress 2015: Seoul, Korea. 14-17 October 2015 21491
wd:Q21521425 Charged-particle multiplicities in pp interactions at root s=900 GeV measured with the ATLAS detector at the LHC ATLAS Collaboration 19904
wd:Q56289397 Performance of the ATLAS detector using first collision data 19722
wd:Q57018684 Measurement of the W \u2192 \u2113\u03BD and Z/\u03B3 * \u2192 \u2113\u2113 production cross sections in proton-proton collisions at \u221As = 7 TeV with the ATLAS detector 19692
wd:Q57018057 Measurement of inclusive jet and dijet cross sections in proton-proton collisions at 7 TeV centre-of-mass energy with the ATLAS detector 19689
wd:Q56501626 Search for new particles in two-jet final states in 7 TeV proton-proton collisions with the ATLAS detector at the LHC 19640
wd:Q21521423 Search for quark contact interactions in dijet angular distributions in pp collisions at root s=7 TeV measured with the ATLAS detector 19635
wd:Q57016199 Search for heavy vector-like quarks coupling to light quarks in proton\u2013proton collisions at s = 7 TeV with the ATLAS detector 19276
wd:Q57661921 Erratum to: \u201CSearch for first generation scalar leptoquarks in pp collisions at s = 7 TeV with the ATLAS detector\u201D [Phys. Lett. B 709 (2012) 158] 19231
  • Total number of distinct items (context): 97315151 (0.75% of total triples)
  • Top 50 item means items that have the most related triples
  • All of the top 50 seem to be scholarly articles
  • These have *lots* of authors and more related information about the authors as statements

Top Subjects

Once again it seems top subjects are scholarly articles.

Top 20 Subjects in Wikidata
Subject Subject Label Count
wd:Q39790431 BayGenomics: a resource of insertional mutations in mouse embryonic stem cells 16758
wd:Q57661806 Erratum to: Search for supersymmetry in events containing a same-flavour opposite-sign dilepton... 11371
wd:Q56836084 40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany,... 11054
wd:Q64022985 Combinations of single-top-quark production cross-section measurements and fLVVtb determinati... 10460
wd:Q21558717 Combined Measurement of the Higgs Boson Mass in p p Collisions at s = 7 and 8\u00A0TeV wi... 10351
wd:Q106988069 Combined Measurement of the Higgs Boson Mass in pp Collisions at \u221As=7 and 8 TeV with the A... 10338
wd:Q56754739 Measurements of the Higgs boson production and decay rates and constraints on its couplings fro... 10285
wd:Q56895655 Combination of inclusive and differential t t \u00AF π‘šπ‘Žπ‘‘β„Žπ‘Ÿπ‘šπ‘‘π‘œπ‘£π‘’π‘Ÿπ‘™π‘–π‘›π‘’π‘šπ‘Žπ‘‘β„Žπ‘Ÿπ‘šπ‘‘ c... 10254
wd:Q58231267 Erratum to: 36th International Symposium on Intensive Care and Emergency Medicine 9861
wd:Q57920219 35th Annual Meeting of the European Association for the Study of Diabetes 9191
wd:Q56883844 35th Annual Meeting of the European Association for the Study of Diabetes : Brussels, Belgium, ... 9187
wd:Q57735077 ABSTRACTS 9092
wd:Q56489295 Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets,... 8145
wd:Q21521425 Charged-particle multiplicities in pp interactions at root s=900 GeV measured with the ATLAS de... 6543
wd:Q21521423 Search for quark contact interactions in dijet angular distributions in pp collisions at root s... 6454
wd:Q57018684 Measurement of the W \u2192 \u2113\u03BD and Z/\u03B3 * \u2192 \u2113\u2113 production cross se... 6446
wd:Q56289397 Performance of the ATLAS detector using first collision data 6435
wd:Q57018057 Measurement of inclusive jet and dijet cross sections in proton-proton collisions at 7 TeV cent... 6426
wd:Q56501626 Search for new particles in two-jet final states in 7 TeV proton-proton collisions with the ATL... 6421
wd:Q27972199 ESG: extended similarity group method for automated protein function prediction 6292

References

Top references (removing the duplicates). Top references as subjects are the ones that have the most triples associated with them (i.e refs as subjects). Reference usage is counted by considering them object (usage count in the table). The reference with the most triples are not necessairly the most used ones. Top references as object are the ones that are used the most.
Analysis Note: Reference and Values have duplicates in hdfs due to the dumping process. In real triple store, they are deduplicated. So from here on, for references and values, all distinct triples are considered.

  • Total number of triples related to references (count of triples in reference context): 379164793 (379M)
  • Total number of references: 90062598 (90M)
Top reference triple count (as subject)
Reference triple count usage count
ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b 102 1
ref:68e48339e339a3bda7932cac38f44abe27de1461 35 1
ref:703f0d28768bd798064c25fcdce64ea5dfbd6c5a 33 1
ref:35ff7a307543d079cc224bb7aa75ef02a164049f 28 1
ref:d2658c2ffc4a87017867dffe00c3cccc64f6a131 27 1
ref:c892725170c4c673767355f2581286c675613844 26 1
ref:c7105386906164ed1a2e4ef334b43e9f00c00157 26 1
ref:5cb21fcb42c03830f7125eaa545e577361c2f9ef 25 1
ref:7e1244220f770f53ec309f3dce0845f990959d7d 25 1
ref:872a839ab7777797a4a498442811816c70025da5 24 1
ref:55ee45a8d9f9cc0fad2cae61f5e42aced44261e0 24 1
ref:426796f41cc0666ac881b1f42501cbdb0064e976 24 1
ref:ff4ad8769bd82d915b6c6e5f2004f13b57efc5ff 22 1
ref:a0ea572733723ae44d5d3c10cea8a79e9e67e7da 21 1
ref:b9ca90f1e1de79de773a3a7f3f6f014ade3ca397 21 1
ref:7c4655b9fadcc3751795f4fc610854826e2095a1 20 1
ref:8e260ab6e7cd618239354955d7c86558ea9992aa 20 1
ref:dfeadb7be3fd743c77af182ad62cf834c89587bd 20 1
ref:ce66538f0ea508e1ad69004f962ba53c5b7ed05a 20 1
ref:d8488d862542e7169f3c77b836caa1274c959e8c 20 1
Top reference usage count (as object)
Reference triple count usage count
ref:8ba559d5760a03bedaaacc3c347bbfe4981560bf 1 46222198
ref:b64af6c056b6c5f6a7ea17156dcd718d4744bbf8 1 32783765
ref:fa278ebfc458360e5aed63d5058cca83c46134f1 1 14391465
ref:6b647975ae22e206a4cd711623ecb06abadbdb9e 1 10767806
ref:0723282bb80042897ca697416c050b4bf7fb5428 1 6246037
ref:9a24f7c0208b05d6be97077d855671d1dfdbc0dd 1 5183641
ref:7c4765d26b6b678783fec763a62a05f82ef36291 1 4663919
ref:64141ed6d84b2cf105b1656d0c0f094358a3dd4f 1 4141724
ref:43a0088c51fd85e5a85d1b46412c3a635e6d4edc 1 3756463
ref:288ab581e7d2d02995a26dfa8b091d96e78457fc 1 3047132
ref:6c44b0eb3905101f3d17982ef3fddb8cb2b3e278 1 2972781
ref:0ee3b3ba1c958f4c3dcba7ed8091fe4b57311348 1 2637075
ref:d5847b9b6032aa8b13dae3c2dfd9ed5d114d21b3 1 2595349
ref:3913844e06e055e8cd81608f22bad0e604d89d2d 1 2547282
ref:bd49d3e4f67bc460ce7a06b6ac3027347cf5ee55 1 2397088
ref:d4bd87b862b12d99d26e86472d44f26858dee639 1 2330033
ref:efa0005ffbf7ddad87bc72240c9732b6a01f9f0e 1 1997758
ref:eec9dbd6f74260dc8f8c2ee1b0ecd8c64d973be5 1 1728596
ref:377e4d758ca3aff7d42243bbd9df04682e6b611b 1 1651288
ref:a29a646602abf65105ed0f39a44231c962ece9ee 1 1463936

Let us explore some.

First reference

This section explores the reference with the most triples where it is a subject. ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b with 102 triples. Some of the triples are shown below.

predicate object
pr:P3452 wd:Q41555988
pr:P3452 wd:Q42605633
pr:P3452 wd:Q42614357
pr:P3452 wd:Q42612177
pr:P3452 wd:Q42615213
pr:P3452 wd:Q42615740
pr:P3452 wd:Q42613597

This ref seems to be used only in one place.

subject subject label predicate object
wd:Q36502461 Allgemeiner Harz-Berg-Kalender publisher ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b

Second reference

This is the reference with the second most triples where it is a subject. ref:68e48339e339a3bda7932cac38f44abe27de1461 with 35 triples. Some of the triples are shown below.

predicate object
pr:P854 <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368433>
pr:P248 wd:Q106715485
pr:P854 <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368440>
prv:P813 wdv:e06efec16adfbaad0a72e3b6d9fc28fe
pr:P854 <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368443>
pr:P813 "2021-05-22T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>


This ref also seems to be used only in one place.

subject subject label predicate object
wd:Q3782554 Lyres has works in the collection ref:68e48339e339a3bda7932cac38f44abe27de1461

Third reference

It is a reference for KBpedia ID on a specific date. Example of where it is used: The KBpedia statement in Q2013. Some more places this reference is used is given below (See more using this SPARQL query).

subject subject label predicate object
Q125 November KBpedia ID ref:9a681f9dd95c90224547c404e11295f4f7dcf54e
Q140 lion
Q144 dog
Q147 kitten
Q148 People's Republic of China
Q155 Brazil
Q177 pizza
Q178 pasta
Q2013 Wikidata
Q23 George Washington

This ref has 3 triples where it is a subject. The triples of this references are shown below.

predicate object
prv:P813> <http://www.wikidata.org/value/664bae4effccc18fd4ad1ae188fab025>
pr:P813 "2020-07-09T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
pr:P248 <http://www.wikidata.org/entity/Q64139102>

Fourth reference

This is a reference for 'taxon common name' on a specific date. Some places it is used is given below. Notice that same item uses this references multiple times. I tried to put a couple of different items that use this reference.

subject subject label predicate object
Q17970 Jabiru mycteria taxon common name ref:9a681dbf31ebd5fd1d2006e0c492516e6c3d59d7
Q17970 Jabiru mycteria
Q18836 Common Buttonquail
Q18836 Common Buttonquail
Q26490 Common Kestrel
Q26620 Common Redstart
Q26657 Goldcrest
Q26685 Atlantic Puffin

This ref has 3 triples where it is a subject. The triples of this references are shown below.

predicate object
prv:P813 <http://www.wikidata.org/value/055167878d6ea2b50690069f330bb773>
pr:P813 "2016-10-16T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
pr:P248 <http://www.wikidata.org/entity/Q27042747>