You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "User:AKhatun/Wikidata Basic Analysis"

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(Fix subject and ref, finish ref.)
imported>Nettrom
(→‎Overview: let's wrap those numbers in formatnum)
 
(3 intermediate revisions by one other user not shown)
Line 3: Line 3:
The Wikidata prefixes list can be found here: [[mw:Wikibase/Indexing/RDF_Dump_Format#Full_list_of_prefixes|Full_list_of_prefixes]]<br>
The Wikidata prefixes list can be found here: [[mw:Wikibase/Indexing/RDF_Dump_Format#Full_list_of_prefixes|Full_list_of_prefixes]]<br>
Phabricator Ticket: [[phab:T282139|T282139]]<br>
Phabricator Ticket: [[phab:T282139|T282139]]<br>
Jupyter Notebook: [[]]
Jupyter Notebook: [[Media:WikidataAnalysis General.pdf|Wikidata Analysis Notebook]]


== Overview ==
== Overview ==


As of <code>20210614</code>:
As of <code>20210614</code>:
* Total number of triples: 12910066145 (12.9B)
* Total number of triples: {{formatnum:12910066145}} (12.9B)
* Total number of distinct items (context): 97315151 (0.75% of total triples)
* Total number of distinct items (context): {{formatnum:97315151}}
* Total number of distinct predicates:
* Total number of distinct predicates: {{formatnum:41117}}
* Total number of distinct wikidata propertes:
* Total number of triples related to references: {{formatnum:379164793}} (379M, 2.9%)
* Total number of references: {{formatnum:90062598}} (90M)
* Total triples with value node as subject: {{formatnum:279313267}} (279M, 2.2%)
* Total distinct value nodes: {{formatnum:61518273}} (61M)
* Number of triples with Wiki objects: {{formatnum:6828025880}} (6.8B, 52.9%)
* Number of triples with Non-Wiki objects: {{formatnum:5843911006}} (5.8B, 45.3%)


== Items ==
== Items ==
Line 217: Line 222:
Let us explore some.
Let us explore some.


=== First reference ===
=== Reference #1 ===
This section explores the reference with the most triples where it is a subject. <code>ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b</code> with 102 triples. Some of the triples are shown below.
This section explores the reference with the most triples where it is a subject. <code>ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b</code> with 102 triples. Some of the triples are shown below.


Line 246: Line 251:
|}
|}


=== Second reference ===
=== Reference #2 ===
This is the reference with the second most triples where it is a subject. <code>ref:68e48339e339a3bda7932cac38f44abe27de1461</code> with 35 triples. Some of the triples are shown below.
This is the reference with the second most triples where it is a subject. <code>ref:68e48339e339a3bda7932cac38f44abe27de1461</code> with 35 triples. Some of the triples are shown below.


Line 274: Line 279:
|}
|}


=== Third reference ===
=== Reference #3 ===
It is a reference for KBpedia ID on a specific date. Example of where it is used: The KBpedia statement in [[wikidata:Q2013|Q2013]]. Some more places this reference is used is given below (See more using [https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3FpropLabel%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20%3Fpred%20%5Bprov%3AwasDerivedFrom%20wdref%3A9a681f9dd95c90224547c404e11295f4f7dcf54e%5D.%0A%20%20BIND%28REPLACE%28STR%28%3Fpred%29%2C%20%22http%3A%2F%2Fwww.wikidata.org%2F.%2a%2F%28.%2a%29%22%2C%20%22%241%22%29%20as%20%3Fcode%29.%0A%20%20BIND%28URI%28CONCAT%28%22http%3A%2F%2Fwww.wikidata.org%2Fentity%2F%22%2C%3Fcode%29%29%20as%20%3Fprop%29.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D%0ALIMIT%2010 this SPARQL query]).
It is a reference for KBpedia ID on a specific date. Example of where it is used: The KBpedia statement in [[wikidata:Q2013|Q2013]]. Some more places this reference is used is given below (See more using [https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3FpropLabel%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20%3Fpred%20%5Bprov%3AwasDerivedFrom%20wdref%3A9a681f9dd95c90224547c404e11295f4f7dcf54e%5D.%0A%20%20BIND%28REPLACE%28STR%28%3Fpred%29%2C%20%22http%3A%2F%2Fwww.wikidata.org%2F.%2a%2F%28.%2a%29%22%2C%20%22%241%22%29%20as%20%3Fcode%29.%0A%20%20BIND%28URI%28CONCAT%28%22http%3A%2F%2Fwww.wikidata.org%2Fentity%2F%22%2C%3Fcode%29%29%20as%20%3Fprop%29.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D%0ALIMIT%2010 this SPARQL query]).


Line 315: Line 320:
|}
|}


=== Fourth reference ===
=== Reference #4 ===
This is a reference for 'taxon common name' on a specific date. Some places it is used is given below. Notice that same item uses this references multiple times. I tried to put a couple of different items that use this reference.
This is a reference for 'taxon common name' on a specific date. Some places it is used is given below. Notice that same item uses this references multiple times. I tried to put a couple of different items that use this reference.


Line 350: Line 355:
|-
|-
| pr:P248      || <http://www.wikidata.org/entity/Q27042747>
| pr:P248      || <http://www.wikidata.org/entity/Q27042747>
|}
== Values ==
Values are nodes that hold some values like time or quantities along with precision, time zone etc. References for example have the direct values plus value nodes to hold more information about the values. For more, see [[mw:Wikibase/Indexing/RDF_Dump_Format#Value_representation|Value_representation]].<br>
Analysis Note: Reference and Values have duplicates in hdfs due to the dumping process. In real triple store, they are deduplicated. For references and values, all distinct triples are considered.
* Total triples with value node as subject: 279313267 (279M)
* Total distinct value nodes: 61518273 (61M)
{| class="wikitable sortable"
|+ Types of Value nodes
!Type !! count 
|-
| <http://wikiba.se/ontology#QuantityValue>        || 52564412
|-
| <http://wikiba.se/ontology#GlobecoordinateValue> || 8603170
|-
| <http://wikiba.se/ontology#TimeValue>            || 350691 
|-
| <http://wikiba.se/ontology#GeoAutoPrecision>    || 93780 
|}
{| class="wikitable"
|+ Top triples of values (as subject)
!value !! triple count !! usage count
|-
| v:e2bd8f07b10701c92eacf58a0329b127 || 6 || 1
|-
| v:0bee50ecdf15d0e640ac9d69a68cdc76 || 6 || 1
|-
| v:c23b7a0348b297277984fde34e6a51ab || 6 || 3
|-
| v:596fec2ac8604b8607ecb4b5b83f468c || 6 || 3
|-
| v:383f9a273c8ea452d57902f936263369 || 6 || 1
|-
| v:c1a3dbbac3f8b4a37ffb91daa2e86317 || 6 || 24
|-
| v:1ab662093fe7836e96f3a5a780247c4a || 6 || 1
|-
| v:2c03608955c6e8eeba4e2e55805c4979 || 6 || 2
|-
| v:211d8f31b9a28f79dd20477e31e1c5ae || 6 || 1
|-
| v:f86ef017ad15adc68b55ada2df7f248a || 6 || 10
|-
| v:fb485962a17c8c5be3ad4c894a281c65 || 6 || 1
|-
| v:80356c41f851dcbbdb594e43ac82369d || 6 || 3
|-
| v:f2b33b065b12f1668cf13b3562cf19db || 6 || 1
|-
| v:bfc2ab241dafc94425fba4a642e6009d || 6 || 2
|-
| v:fec78667c7a021215ca59728258be16c || 6 || 1
|-
| v:40ea08aa3e1d6edc307d961fc9dc0b69 || 6 || 23
|-
| v:357e66846e15f89837bdc31792df2e6a || 6 || 4
|-
| v:748476f9b6f3daf4d9a6818798263d6c || 6 || 1
|-
| v:65113fbe866cec6af2d05e3aaa2bac0e || 6 || 2
|-
| v:e395a45e7e42cd416e2269fbdd1ab8f9 || 6 || 2
|}
{| class="wikitable"
|+ Top usage of values (as object)
!value !! triple count !! usage count
|-
| v:c610c7d0abbfe361e367744369f5d33d || 6 || 10121290
|-
| v:4e601d1880d647664093f1b20b24dacf || 5 || 4542586
|-
| v:d0931b31b1c31ffa1325777f65b723db || 5 || 3277169
|-
| v:7e281616976c7de150357c18e76abfd1 || 5 || 1452465
|-
| v:d40cb13acf8001d779efbb0c45cb42f0 || 5 || 1412411
|-
| v:8c2bdc5006a93f73a2b03849218b4e7b || 5 || 1217459
|-
| v:b3795d3425e0bbdd474f3138cad4a069 || 5 || 940261
|-
| v:202de63fcb2e0943a9b5d0cebf189569 || 5 || 685430
|-
| v:a843a14d6be3111e93a253fd623f18cf || 5 || 580686
|-
| v:dafb9cf711b15afe91ec0aa7158e57a6 || 5 || 573003
|-
| v:1c9e02c1631d3fd5ec9a9fe9aa1fde65 || 3 || 562672
|-
| v:67a17c05603b9e75b6c25826fa747705 || 3 || 486942
|-
| v:5a2515dd8960847405b294e0a7999403 || 5 || 466608
|-
| v:5a166c540a59253c92144230db78cb8a || 5 || 428426
|-
| v:8c0d739994215f213311d29254302049 || 5 || 422030
|-
| v:39c2e70b9990c3ed7be32f8e34015853 || 5 || 420452
|-
| v:b441aa14f32ad7a9e6fe04eb80002b4c || 5 || 411528
|-
| v:216f4f19c804fc50c737da9ae87494a9 || 5 || 387676
|-
| v:1ce0c285c67a65d1e2e620ace3f6c897 || 5 || 369781
|-
| v:ddd311e198ef615dbfaaa3f42aeec7b4 || 5 || 364341
|}
I explore a few values below.
==== Value #1 ====
Value node <code>v:e2bd8f07b10701c92eacf58a0329b127</code> is <code>QuantityValue</code> type. Places it is used and the triples it contains is shown below. The usage table shows what kind of value it is used as (i.e the predicate to which the value is a object of) how many times it is used with that predicate.
{|class="wikitable"
|+ Usage of value
!predicate !!predicate label !! count
|-
|P2216 || radial velocity || 1
|}
{|class="wikitable"
|+ Triples of value
!predicate !!object
|-
| <http://wikiba.se/ontology#quantityLowerBound>    || "+83.6"^^<http://www.w3.org/2001/XMLSchema#decimal>           
|-
| <http://wikiba.se/ontology#quantityAmount>        || "+84.5"^^<http://www.w3.org/2001/XMLSchema#decimal>           
|-
| <http://wikiba.se/ontology#quantityUnit>          || <http://www.wikidata.org/entity/Q3674704>                     
|-
| <http://wikiba.se/ontology#quantityUpperBound>    || "+85.4"^^<http://www.w3.org/2001/XMLSchema#decimal>           
|-
| <http://wikiba.se/ontology#quantityNormalized>    || <http://www.wikidata.org/value/f0b95aa63660347cc5d9af28fe74c7d8>
|-
| <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> || <http://wikiba.se/ontology#QuantityValue>                     
|}
==== Value #2 ====
Value node <code>v:e2ce06d70b150e202e4c81681e5334e5</code> is <code>TimeValue</code> type.
{|class="wikitable"
|+ Usage of value
!predicate !!predicate label !! count
|-
| P577 || publication date || 35774
|-
| P580 || start time || 800
|-
| P582 || end time || 233
|-
| P570 || date of death || 43
|-
| P571 || inception || 36
|-
| P576 || dissolved, abolished or demolished date || 27
|-
| P7588 || effective date || 21
|-
| P585 || point in time || 16
|-
| P1619 || date of official opening || 5
|-
| P569 || date of birth || 2
|-
| P1191 || date of first performance || 1
|-
| P2960 || archive date || 1
|-
| P575 || time of discovery or invention || 1
|-
| P620 || time of spacecraft landing || 1
|-
| P729 || service entry || 1
|}
{|class="wikitable"
|+ Triples of value
!predicate !!object
|-
| <http://wikiba.se/ontology#timeValue>            || "2009-12-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
|-
| <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> || <http://wikiba.se/ontology#TimeValue>                             
|-
| <http://wikiba.se/ontology#timeCalendarModel>    || <http://www.wikidata.org/entity/Q1985727>                         
|-
| <http://wikiba.se/ontology#timeTimezone>          || "0"^^<http://www.w3.org/2001/XMLSchema#integer>                   
|-
| <http://wikiba.se/ontology#timePrecision>        || "11"^^<http://www.w3.org/2001/XMLSchema#integer>                                     
|}
==== Value #3 ====
Value node <code>v:e2ce017b6638b9684082390db9ce311f</code> is <code>TimeValue</code> type.
{|class="wikitable"
|+ Usage of value
!predicate !!predicate label !! count
|-
| P813 || retrieved || 44722
|-
| P577 || publication date|| 1048
|-
| P5017 || last update|| 195
|-
| P585 || point in time|| 64
|-
| P570 || date of death|| 59
|-
| P580 || start time|| 57
|-
| P582 || end time|| 56
|-
| P2960 || archive date|| 9
|-
| P6949 || announcement date|| 2
|-
| P1319 || earliest date|| 1
|-
| P2031 || work period (start)|| 1
|-
| P571 || inception || 1           
|}
{|class="wikitable"
|+ Triples of value
!predicate !!object
|-
| <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>|| <http://wikiba.se/ontology#TimeValue>                             
|-
| <http://wikiba.se/ontology#timeCalendarModel>    || <http://www.wikidata.org/entity/Q1985727>                         
|-
| <http://wikiba.se/ontology#timeValue>            || "2020-01-12T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
|-
| <http://wikiba.se/ontology#timeTimezone>        || "0"^^<http://www.w3.org/2001/XMLSchema#integer>                   
|-
| <http://wikiba.se/ontology#timePrecision>        || "11"^^<http://www.w3.org/2001/XMLSchema#integer> 
|}
== Top Predicates ==
Number of distinct predicates: <code>41117</code>
{| class="wikitable sortable"
|+ Top 20 predicates
! predicate !! count
|-
| schema:description || 2462371590
|-
| rdf:type || 1406804171
|-
| wikibase:rank || 1288379756
|-
| prov:wasDerivedFrom || 1003288418
|-
| rdfs:label || 495214648
|-
| p:P2860 || 247929898
|-
| ps:P2860 || 247929861
|-
| wdt:P2860 || 247928515
|-
| pq:P1545 || 157196523
|-
| p:P2093 || 135492184
|-
| ps:P2093 || 135492121
|-
| wdt:P2093 || 135397718
|-
| skos:altLabel || 102185811
|-
| p:P31 || 99091730
|-
| ps:P31 || 99091721
|-
| schema:dateModified || 97315730
|-
| schema:version || 97315148
|-
| wdt:P31 || 95759651
|-
| wikibase:statements || 93970378
|-
| wikibase:sitelinks || 93464760
|-
| wikibase:identifiers || 93464760
|}
Assuming all wikidata related predicates are prefixed with <code>wikidata.org</code> or <code>wikiba.se</code>. Therefore anything other than these prefix are considered non-wiki predicates.<br>
Both wiki and non-wiki predicates can have wiki or non-wiki obejcts.
* wiki predicate, non-wiki object: wd:Q:30 wdt:P2250 "78.69024"
* non-wiki predicate, wiki object: data:P31 schema:about wd:P31
{| class="wikitable sortable"
|+ Top 20 non-wiki predicates
! predicate !! count
|-
| schema:description || 2462371590
|-
| rdf:type || 1406804171
|-
| prov:wasDerivedFrom || 1003288418
|-
| rdfs:label || 495214648
|-
| skos:altLabel || 102185811
|-
| schema:dateModified || 97315730
|-
| schema:version || 97315148
|-
| schema:about || 78517643
|-
| schema:inLanguage || 78516557
|-
| schema:isPartOf || 78516557
|-
| schema:name || 78516557
|-
| ontolex:representation || 9957440
|-
| ontolex:lexicalForm || 8730481
|-
| owl:sameAs || 3344770
|-
| dct:language || 496678
|-
| skos:definition || 142252
|-
| ontolex:sense || 128287
|-
| owl:onProperty || 8940
|-
| owl:complementOf || 8940
|-
| owl:someValuesFrom || 8940
|-
| owl:imports || 1
|-
| cc:license || 1
|-
| schema:softwareVersion || 1
|}
{| class="wikitable sortable"
|+ Top 50 wiki predicates
! predicate !! predicate label !! count
|-
| wikibase:rank || || 1288379756
|-
| p:P2860 || cites work || 247929898
|-
| ps:P2860 || cites work || 247929861
|-
| wdt:P2860 || cites work || 247928515
|-
| pq:P1545 || series ordinal || 157196523
|-
| p:P2093 || author name string || 135492184
|-
| ps:P2093 || author name string || 135492121
|-
| wdt:P2093 || author name string || 135397718
|-
| p:P31 || instance of || 99091730
|-
| ps:P31 || instance of || 99091721
|-
| wdt:P31 || instance of || 95759651
|-
| wikibase:statements || || 93970378
|-
| wikibase:sitelinks || || 93464760
|-
| wikibase:identifiers || || 93464760
|-
| pr:P248 || stated in || 76940160
|-
| pr:P813 || retrieved || 74812531
|-
| prv:P813 || retrieved || 74812525
|-
| pr:P854 || reference URL || 56837807
|-
| wikibase:quantityAmount || || 52564412
|-
| wikibase:quantityUnit || || 52564412
|-
| p:P1476 || title || 40923677
|-
| ps:P1476 || title || 40921787
|-
| wdt:P1476 || title || 40916525
|-
| p:P577 || publication date || 39645641
|-
| ps:P577 || publication date || 39645462
|-
| psv:P577 || publication date || 39645061
|-
| wdt:P577 || publication date || 39633371
|-
| p:P1433 || published in || 37349691
|-
| ps:P1433 || published in || 37349680
|-
| wdt:P1433 || published in || 37348812
|-
| wikibase:quantityNormalized || || 35418096
|-
| p:P304 || page(s) || 34875085
|-
| ps:P304 || page(s) || 34875081
|-
| wdt:P304 || page(s) || 34875052
|-
| p:P478 || volume || 34665675
|-
| ps:P478 || volume || 34665668
|-
| wdt:P478 || volume || 34665659
|-
| psv:P1215 || apparent magnitude || 33123781
|-
| ps:P1215 || apparent magnitude || 33123781
|-
| p:P1215 || apparent magnitude || 33123781
|-
| pq:P1227 || astronomical filter || 33123753
|-
| p:P698 || PubMed ID || 31983819
|-
| ps:P698 || PubMed ID || 31983819
|-
| wdt:P698 || PubMed ID || 31948158
|-
| p:P433 || issue || 31751756
|-
| ps:P433 || issue || 31751753
|-
| wdt:P433 || issue || 31751748
|-
| p:P528 || catalog code || 28701730
|-
| ps:P528 || catalog code || 28701676
|-
| wdt:P528 || catalog code || 28698943
|}
== Object ==
Obejects have too many distinct values especially because of literals (string, numeral, time, date etc), so getting a list of top objects would not serve any useful purpose. Rather I looked at the types of objects in Wikidata.<br>
Broadly speaking, objects can be:
# Literals: Literals can have datatypes
# URIs: May or may not have <type> predicate specified.
## Wiki URI
## Non-wiki URI
Of the objects that are URI and have a <type> predicate in wikidata, let us find the top types of object. Note that it is not the count of object usage or occurance, rather just the number of distinct objects with that <type>. More on the distribution of the number of triples with each kind of object can be found in the [[User:AKhatun/Wikidata_Basic_Analysis#Wiki vs Non-wiki Triples|Wiki vs Non-wiki Triples]] section.
{| class="wikitable sortable"
|+ Top types of URI objects
! Type of object URI !! count
|-
| wikibase:BestRank || 1256923711
|-
| wikibase:QuantityValue || 52564412
|-
| ontolex:Form || 8730481
|-
| wikibase:GlobecoordinateValue || 8603170
|-
| wikibase:TimeValue || 350691
|-
| ontolex:LexicalSense || 128287
|-
| wikibase:GeoAutoPrecision || 93780
|-
| owl:ObjectProperty || 68987
|-
| wdno:P364 || 60594
|-
| wdno:P17 || 39925
|-
| schema:Article || 33268
|-
| wdno:P155 || 32010
|-
| owl:DatatypeProperty || 29805
|-
| ontolex:LexicalEntry || 18917
|-
| owl:Class || 8940
|-
| owl:Restriction || 8940
|-
| wdno:P814 || 8713
|-
| wikibase:Property || 8605
|-
| wdno:P156 || 8457
|-
| wdno:P162 || 8191
|}
Of the objects that are literals, we can find the datatype of the literals. The table below count the number of objects with a specific data type.
{|class="wikitable sortable"
|+ Top datatypes of literal objects
!dtype !! count !! precentage
|-
| <http://www.w3.org/2001/XMLSchema#integer> || 397670762 || 36.22
|-
| <http://www.w3.org/2001/XMLSchema#decimal> || 344236061 || 31.35
|-
| <http://www.w3.org/2001/XMLSchema#dateTime> || 312169343 || 28.43
|-
| <http://www.w3.org/2001/XMLSchema#double> || 26333634 || 2.39
|-
| <http://www.opengis.net/ont/geosparql#wktLiteral> || 17557277 || 1.60
|-
| <http://www.w3.org/1998/Math/MathML> || 45058 || 0.004
|}
== Wiki vs Non-wiki Triples ==
Triples can have objects that are later expanded within wikidata. These objects have to be wiki objects. The objects that start with the prefix <code>wikidata.org</code> or <code>wikiba.se</code> are considered wiki objects. But all wiki objects not tend to 'expand' within wikidata. This is calculated by find the number of wiki objects that also occur as subjects. If they do occur as subject, then they have 'expanded' within wikidata.<br>
Triples can also have non-wiki objects, and non-wiki objects cannot be expanded in wikidata. Non-wiki objects do not start with the <code>wikidata.org</code> or <code>wikiba.se</code> prefixes. Non-wiki objects can be URIs or literals. The idea is that since these objects do not expand within wikidata, they are leaves in the graph, and so they can be modelled as properties of the associated node in Property graphs.
The distribution of the number of triples containing different types of objects. All percentages expressed are the percentage of the total number of triples.
{|class="wikitable"
!Object type !! # Triples !! % Total triples
!Object type !! # Triples !! % Total triples
!Object type !! # Triples !! % Total triples
|-
|rowspan="4" | Wiki object
|rowspan="4" | 6828025880
|rowspan="4" | 52.9
|rowspan="2" | Wikidata object
|rowspan="2" | 4221092432
|rowspan="2" | 32.7
| Object also subject || 4220239722 || 32.7
|-
| Object not subject || 852710 || 0.00006
|-
|rowspan="2" | Wikiba.se object
|rowspan="2" | 2606933448
|rowspan="2" | 20.2
| Object also subject || 0 || 0
|-
| Object not subject || 2606933448 || 20.2
|-
|rowspan="2" | Non-wiki object
|rowspan="2" | 5843911006
|rowspan="2" | 45.3
| URI object || 391262595 || 3
|-
| Literal object || 5452648411 || 42.2
|}
If the table is too much to digest, here is a simple diagram with the same information but in a more colorful and beautiful way.
[[File:wikidata_triples.jpg|800px]]
== Expanded vs Unexpanded objects ==
This section does some analysis mainly to verify that wikidata objects do expand within wikidata. This is a premise to the idea that non-wiki objects don't expand within wikidata and are leaves in the graph.
* Triples with wiki objects that also appear as subjects (therefore do expand in wikidata): 4220239722 (32.7%)
** Number of triples with <code>wikidata.org</code> object that are also subject: 4220239722 (32.7%)
** Number of triples with <code>wikiba.se</code> object that are also subject: 0
* Total wikidata-objects that are not expanded in wikidata (not subjects): 852710
* Distinct wikidata-objects that are not expanded in wikidata (not subjects): 852319 (Most are distinct)
Ideally we expect that if a wikidata entry is a subject, it should have some relevant information on wikidata. But 852710 triples have objects that do not have corresponding subjects. Wikidata objects that do not expand in wikidata and the count of their occurance as objects are shown below. Upon manual inspection it seems the Q-items are deleted entries. This can explain why they were not available as a subject in wikidata. Nevertheless, the triples that used them as objects still persist.
{| class="wikitable"
|+ Unexpanded wikidata objects
! object !! count
|-
| wd:Q68637652 || 148
|-
| <http://www.wikidata.org/.well-known/genid/abf15f7aee46e00705c700147bd53518> || 32
|-
| wd:Q28968053 || 24
|-
| <http://www.wikidata.org/.well-known/genid/3d27111745b4e92877aa4c1ad765e5ca> || 15
|-
| wd:L229411-S1 || 14
|-
| wd:Q35104224 || 14
|-
| wd:Q58331113 || 12
|-
| wd:undefined || 10
|-
| wd:Q107006232 || 10
|-
| <http://www.wikidata.org/.well-known/genid/fa575516a51320c3beb70f7719f72a99> || 10
|-
| <http://www.wikidata.org/.well-known/genid/170fd20cd570feefeba7cbc7be44c853> || 9
|}
== Objects Per Item ==
While it's great we know how many non-wiki objects we have and that we can consider them as properties in a property graph, we still don't know how spread out they are. Are most of the non-wiki objects in few items, or most items have a lot of non-wiki objects, etc. If we find the number of non-wiki triples per item we can try to infer these questions.
{| class="wikitable"
|+ Objects per subject
! Object type !! max !! min !! avg !! std
|-
| Non-wiki || 5311 || 1 || 5.59 || 19.86
|-
| Literal || 5309 || 1 || 5.32 || 19.94
|-
| Wiki || 16667 || 1 || 4.21 || 8.7
|}
{| class="wikitable"
|+ Top objects per subject
|-
! colspan="2" | Top Non-wiki objects per subject
! colspan="2" | Top literals per subject
! colspan="2" | Top Wiki objects per subject
|-
! subject !! count !! subject !! count !! subject !! count
|-
| wd:Q56836084 || 5311 || wd:Q56836084 || 5309 || wd:Q39790431 || 16667
|-
| wd:Q106988069 || 5125 || wd:Q106988069 || 5123 || wd:Q27972199 || 6208
|-
| wd:Q64022985 || 4492 || wd:Q64022985 || 4491 || wd:Q57661806 || 6086
|-
| wd:Q56883844 || 4489 || wd:Q56883844 || 4489 || wd:Q21558717 || 6018
|-
| wd:Q57920219 || 4484 || wd:Q57920219 || 4482 || wd:Q56754739 || 5986
|-
| wd:Q21558717 || 4328 || wd:Q21558717 || 4327 || wd:Q56895655 || 5970
|-
| wd:Q56754739 || 4238 || wd:Q56754739 || 4236 || wd:Q63409374 || 5964
|-
| wd:Q56895655 || 4218 || wd:Q56895655 || 4216 || wd:Q64022985 || 5894
|-
| wd:Q174565 || 4049 || wd:Q174565 || 4043 || wd:Q56836084 || 5652
|-
| wd:Q58231267 || 3812 || wd:Q58231267 || 3810 || wd:Q106988069 || 5165
|-
| wd:Q467925 || 3277 || wd:Q467925 || 3272 || wd:Q58231267 || 4920
|-
| wd:Q104369389 || 2983 || wd:Q104369389 || 2982 || wd:Q56489295 || 4876
|-
| wd:Q100507117 || 2974 || wd:Q100507117 || 2973 || wd:Q33928881 || 4817
|-
| wd:Q104798012 || 2930 || wd:Q104798012 || 2929 || wd:Q28388335 || 4763
|-
| wd:Q98468691 || 2917 || wd:Q98468691 || 2916 || wd:Q57920219 || 4695
|-
| wd:Q98730204 || 2914 || wd:Q98730204 || 2913 || wd:Q56883844 || 4682
|-
| wd:Q96613392 || 2881 || wd:Q96613392 || 2880 || wd:Q57735077 || 4559
|-
| wd:Q104467608 || 2877 || wd:Q104467608 || 2876 || wd:Q35952737 || 4435
|-
| wd:Q21521425 || 2868 || wd:Q21521425 || 2867 || wd:Q35202929 || 4380
|-
| wd:Q21521423 || 2827 || wd:Q21521423 || 2826 || wd:Q30486707 || 4129
|}
|}

Latest revision as of 14:42, 26 July 2021

The following analysis is done on Wikidata as a means to understand more about Wikidata itself. This includes what kind of subjects, properties, objects etc does it contain the most, the type of triples it contains, how much of it refers to wiki or non-wiki objects etc. The Analysis was done on Wikidata snapshot 20210614 using Python in Jupyter Notebook. Other packages used are: Spark, RDFLib, SPARQLWrapper, and Pandas. Part of some analysis collected data with SPARQL from WDQS endpoint, which fetches the latest data (to easily get labels and data types of literals for example). So a small difference is sometimes found with the snapshot and latest data.

The Wikidata prefixes list can be found here: Full_list_of_prefixes
Phabricator Ticket: T282139
Jupyter Notebook: Wikidata Analysis Notebook

Overview

As of 20210614:

  • Total number of triples: 12,910,066,145 (12.9B)
  • Total number of distinct items (context): 97,315,151
  • Total number of distinct predicates: 41,117
  • Total number of triples related to references: 379,164,793 (379M, 2.9%)
  • Total number of references: 90,062,598 (90M)
  • Total triples with value node as subject: 279,313,267 (279M, 2.2%)
  • Total distinct value nodes: 61,518,273 (61M)
  • Number of triples with Wiki objects: 6,828,025,880 (6.8B, 52.9%)
  • Number of triples with Non-Wiki objects: 5,843,911,006 (5.8B, 45.3%)

Items

How many different things does Wikidata talk about? It's a very high-level overview question and answered based on the context from Wikidata. For example, all triples under Q42 context can be found here: Q42 dump. Top 20 items are shown in the table below.

Top 20 Items in Wikidata
Item Item Label Count
wd:Q39790431 BayGenomics: a resource of insertional mutations in mouse embryonic stem cells 41847
wd:Q57661806 Erratum to: Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in 𝑠𝑞𝑟𝑡𝑠=8 s = 8 TeV pp collisions with the ATLAS detector 34517
wd:Q56836084 40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany, 5-9 September 2004 33299
wd:Q64022985 Combinations of single-top-quark production cross-section measurements and fLVVtb determinations at s 𝑠𝑞𝑟𝑡𝑠= 7 and 8 TeV with the ATLAS and CMS experiments 32078
wd:Q21558717 Combined Measurement of the Higgs Boson Mass in p p Collisions at s = 7 and 8\u00A0TeV with the ATLAS and CMS Experiments 31791
wd:Q56754739 Measurements of the Higgs boson production and decay rates and constraints on its couplings from a combined ATLAS and CMS analysis of the LHC pp collision data at s = 7 𝑠𝑞𝑟𝑡𝑠=7 and 8 TeV 31653
wd:Q56895655 Combination of inclusive and differential t t \u00AF 𝑚𝑎𝑡ℎ𝑟𝑚𝑡𝑜𝑣𝑒𝑟𝑙𝑖𝑛𝑒𝑚𝑎𝑡ℎ𝑟𝑚𝑡 charge asymmetry measurements using ATLAS and CMS data at s = 7 𝑠𝑞𝑟𝑡𝑠=7 and 8 TeV 31562
wd:Q57920219 35th Annual Meeting of the European Association for the Study of Diabetes 27656
wd:Q56883844 35th Annual Meeting of the European Association for the Study of Diabetes : Brussels, Belgium, 28 September-2 October 1999 27632
wd:Q57735077 ABSTRACTS 27267
wd:Q56489295 Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in [Formula: see text] TeV collisions with the ATLAS detector 24994
wd:Q93740619 XXIV World Allergy Congress 2015: Seoul, Korea. 14-17 October 2015 21491
wd:Q21521425 Charged-particle multiplicities in pp interactions at root s=900 GeV measured with the ATLAS detector at the LHC ATLAS Collaboration 19904
wd:Q56289397 Performance of the ATLAS detector using first collision data 19722
wd:Q57018684 Measurement of the W \u2192 \u2113\u03BD and Z/\u03B3 * \u2192 \u2113\u2113 production cross sections in proton-proton collisions at \u221As = 7 TeV with the ATLAS detector 19692
wd:Q57018057 Measurement of inclusive jet and dijet cross sections in proton-proton collisions at 7 TeV centre-of-mass energy with the ATLAS detector 19689
wd:Q56501626 Search for new particles in two-jet final states in 7 TeV proton-proton collisions with the ATLAS detector at the LHC 19640
wd:Q21521423 Search for quark contact interactions in dijet angular distributions in pp collisions at root s=7 TeV measured with the ATLAS detector 19635
wd:Q57016199 Search for heavy vector-like quarks coupling to light quarks in proton\u2013proton collisions at s = 7 TeV with the ATLAS detector 19276
wd:Q57661921 Erratum to: \u201CSearch for first generation scalar leptoquarks in pp collisions at s = 7 TeV with the ATLAS detector\u201D [Phys. Lett. B 709 (2012) 158] 19231
  • Total number of distinct items (context): 97315151 (0.75% of total triples)
  • Top 50 item means items that have the most related triples
  • All of the top 50 seem to be scholarly articles
  • These have *lots* of authors and more related information about the authors as statements

Top Subjects

Once again it seems top subjects are scholarly articles.

Top 20 Subjects in Wikidata
Subject Subject Label Count
wd:Q39790431 BayGenomics: a resource of insertional mutations in mouse embryonic stem cells 16758
wd:Q57661806 Erratum to: Search for supersymmetry in events containing a same-flavour opposite-sign dilepton... 11371
wd:Q56836084 40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany,... 11054
wd:Q64022985 Combinations of single-top-quark production cross-section measurements and fLVVtb determinati... 10460
wd:Q21558717 Combined Measurement of the Higgs Boson Mass in p p Collisions at s = 7 and 8\u00A0TeV wi... 10351
wd:Q106988069 Combined Measurement of the Higgs Boson Mass in pp Collisions at \u221As=7 and 8 TeV with the A... 10338
wd:Q56754739 Measurements of the Higgs boson production and decay rates and constraints on its couplings fro... 10285
wd:Q56895655 Combination of inclusive and differential t t \u00AF 𝑚𝑎𝑡ℎ𝑟𝑚𝑡𝑜𝑣𝑒𝑟𝑙𝑖𝑛𝑒𝑚𝑎𝑡ℎ𝑟𝑚𝑡 c... 10254
wd:Q58231267 Erratum to: 36th International Symposium on Intensive Care and Emergency Medicine 9861
wd:Q57920219 35th Annual Meeting of the European Association for the Study of Diabetes 9191
wd:Q56883844 35th Annual Meeting of the European Association for the Study of Diabetes : Brussels, Belgium, ... 9187
wd:Q57735077 ABSTRACTS 9092
wd:Q56489295 Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets,... 8145
wd:Q21521425 Charged-particle multiplicities in pp interactions at root s=900 GeV measured with the ATLAS de... 6543
wd:Q21521423 Search for quark contact interactions in dijet angular distributions in pp collisions at root s... 6454
wd:Q57018684 Measurement of the W \u2192 \u2113\u03BD and Z/\u03B3 * \u2192 \u2113\u2113 production cross se... 6446
wd:Q56289397 Performance of the ATLAS detector using first collision data 6435
wd:Q57018057 Measurement of inclusive jet and dijet cross sections in proton-proton collisions at 7 TeV cent... 6426
wd:Q56501626 Search for new particles in two-jet final states in 7 TeV proton-proton collisions with the ATL... 6421
wd:Q27972199 ESG: extended similarity group method for automated protein function prediction 6292

References

Top references (removing the duplicates). Top references as subjects are the ones that have the most triples associated with them (i.e refs as subjects). Reference usage is counted by considering them object (usage count in the table). The reference with the most triples are not necessairly the most used ones. Top references as object are the ones that are used the most.
Analysis Note: Reference and Values have duplicates in hdfs due to the dumping process. In real triple store, they are deduplicated. So from here on, for references and values, all distinct triples are considered.

  • Total number of triples related to references (count of triples in reference context): 379164793 (379M)
  • Total number of references: 90062598 (90M)
Top reference triple count (as subject)
Reference triple count usage count
ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b 102 1
ref:68e48339e339a3bda7932cac38f44abe27de1461 35 1
ref:703f0d28768bd798064c25fcdce64ea5dfbd6c5a 33 1
ref:35ff7a307543d079cc224bb7aa75ef02a164049f 28 1
ref:d2658c2ffc4a87017867dffe00c3cccc64f6a131 27 1
ref:c892725170c4c673767355f2581286c675613844 26 1
ref:c7105386906164ed1a2e4ef334b43e9f00c00157 26 1
ref:5cb21fcb42c03830f7125eaa545e577361c2f9ef 25 1
ref:7e1244220f770f53ec309f3dce0845f990959d7d 25 1
ref:872a839ab7777797a4a498442811816c70025da5 24 1
ref:55ee45a8d9f9cc0fad2cae61f5e42aced44261e0 24 1
ref:426796f41cc0666ac881b1f42501cbdb0064e976 24 1
ref:ff4ad8769bd82d915b6c6e5f2004f13b57efc5ff 22 1
ref:a0ea572733723ae44d5d3c10cea8a79e9e67e7da 21 1
ref:b9ca90f1e1de79de773a3a7f3f6f014ade3ca397 21 1
ref:7c4655b9fadcc3751795f4fc610854826e2095a1 20 1
ref:8e260ab6e7cd618239354955d7c86558ea9992aa 20 1
ref:dfeadb7be3fd743c77af182ad62cf834c89587bd 20 1
ref:ce66538f0ea508e1ad69004f962ba53c5b7ed05a 20 1
ref:d8488d862542e7169f3c77b836caa1274c959e8c 20 1
Top reference usage count (as object)
Reference triple count usage count
ref:8ba559d5760a03bedaaacc3c347bbfe4981560bf 1 46222198
ref:b64af6c056b6c5f6a7ea17156dcd718d4744bbf8 1 32783765
ref:fa278ebfc458360e5aed63d5058cca83c46134f1 1 14391465
ref:6b647975ae22e206a4cd711623ecb06abadbdb9e 1 10767806
ref:0723282bb80042897ca697416c050b4bf7fb5428 1 6246037
ref:9a24f7c0208b05d6be97077d855671d1dfdbc0dd 1 5183641
ref:7c4765d26b6b678783fec763a62a05f82ef36291 1 4663919
ref:64141ed6d84b2cf105b1656d0c0f094358a3dd4f 1 4141724
ref:43a0088c51fd85e5a85d1b46412c3a635e6d4edc 1 3756463
ref:288ab581e7d2d02995a26dfa8b091d96e78457fc 1 3047132
ref:6c44b0eb3905101f3d17982ef3fddb8cb2b3e278 1 2972781
ref:0ee3b3ba1c958f4c3dcba7ed8091fe4b57311348 1 2637075
ref:d5847b9b6032aa8b13dae3c2dfd9ed5d114d21b3 1 2595349
ref:3913844e06e055e8cd81608f22bad0e604d89d2d 1 2547282
ref:bd49d3e4f67bc460ce7a06b6ac3027347cf5ee55 1 2397088
ref:d4bd87b862b12d99d26e86472d44f26858dee639 1 2330033
ref:efa0005ffbf7ddad87bc72240c9732b6a01f9f0e 1 1997758
ref:eec9dbd6f74260dc8f8c2ee1b0ecd8c64d973be5 1 1728596
ref:377e4d758ca3aff7d42243bbd9df04682e6b611b 1 1651288
ref:a29a646602abf65105ed0f39a44231c962ece9ee 1 1463936

Let us explore some.

Reference #1

This section explores the reference with the most triples where it is a subject. ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b with 102 triples. Some of the triples are shown below.

predicate object
pr:P3452 wd:Q41555988
pr:P3452 wd:Q42605633
pr:P3452 wd:Q42614357
pr:P3452 wd:Q42612177
pr:P3452 wd:Q42615213
pr:P3452 wd:Q42615740
pr:P3452 wd:Q42613597

This ref seems to be used only in one place.

subject subject label predicate object
wd:Q36502461 Allgemeiner Harz-Berg-Kalender publisher ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b

Reference #2

This is the reference with the second most triples where it is a subject. ref:68e48339e339a3bda7932cac38f44abe27de1461 with 35 triples. Some of the triples are shown below.

predicate object
pr:P854 <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368433>
pr:P248 wd:Q106715485
pr:P854 <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368440>
prv:P813 wdv:e06efec16adfbaad0a72e3b6d9fc28fe
pr:P854 <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368443>
pr:P813 "2021-05-22T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>


This ref also seems to be used only in one place.

subject subject label predicate object
wd:Q3782554 Lyres has works in the collection ref:68e48339e339a3bda7932cac38f44abe27de1461

Reference #3

It is a reference for KBpedia ID on a specific date. Example of where it is used: The KBpedia statement in Q2013. Some more places this reference is used is given below (See more using this SPARQL query).

subject subject label predicate object
Q125 November KBpedia ID ref:9a681f9dd95c90224547c404e11295f4f7dcf54e
Q140 lion
Q144 dog
Q147 kitten
Q148 People's Republic of China
Q155 Brazil
Q177 pizza
Q178 pasta
Q2013 Wikidata
Q23 George Washington

This ref has 3 triples where it is a subject. The triples of this references are shown below.

predicate object
prv:P813> <http://www.wikidata.org/value/664bae4effccc18fd4ad1ae188fab025>
pr:P813 "2020-07-09T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
pr:P248 <http://www.wikidata.org/entity/Q64139102>

Reference #4

This is a reference for 'taxon common name' on a specific date. Some places it is used is given below. Notice that same item uses this references multiple times. I tried to put a couple of different items that use this reference.

subject subject label predicate object
Q17970 Jabiru mycteria taxon common name ref:9a681dbf31ebd5fd1d2006e0c492516e6c3d59d7
Q17970 Jabiru mycteria
Q18836 Common Buttonquail
Q18836 Common Buttonquail
Q26490 Common Kestrel
Q26620 Common Redstart
Q26657 Goldcrest
Q26685 Atlantic Puffin

This ref has 3 triples where it is a subject. The triples of this references are shown below.

predicate object
prv:P813 <http://www.wikidata.org/value/055167878d6ea2b50690069f330bb773>
pr:P813 "2016-10-16T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
pr:P248 <http://www.wikidata.org/entity/Q27042747>

Values

Values are nodes that hold some values like time or quantities along with precision, time zone etc. References for example have the direct values plus value nodes to hold more information about the values. For more, see Value_representation.
Analysis Note: Reference and Values have duplicates in hdfs due to the dumping process. In real triple store, they are deduplicated. For references and values, all distinct triples are considered.

  • Total triples with value node as subject: 279313267 (279M)
  • Total distinct value nodes: 61518273 (61M)
Types of Value nodes
Type count
<http://wikiba.se/ontology#QuantityValue> 52564412
<http://wikiba.se/ontology#GlobecoordinateValue> 8603170
<http://wikiba.se/ontology#TimeValue> 350691
<http://wikiba.se/ontology#GeoAutoPrecision> 93780
Top triples of values (as subject)
value triple count usage count
v:e2bd8f07b10701c92eacf58a0329b127 6 1
v:0bee50ecdf15d0e640ac9d69a68cdc76 6 1
v:c23b7a0348b297277984fde34e6a51ab 6 3
v:596fec2ac8604b8607ecb4b5b83f468c 6 3
v:383f9a273c8ea452d57902f936263369 6 1
v:c1a3dbbac3f8b4a37ffb91daa2e86317 6 24
v:1ab662093fe7836e96f3a5a780247c4a 6 1
v:2c03608955c6e8eeba4e2e55805c4979 6 2
v:211d8f31b9a28f79dd20477e31e1c5ae 6 1
v:f86ef017ad15adc68b55ada2df7f248a 6 10
v:fb485962a17c8c5be3ad4c894a281c65 6 1
v:80356c41f851dcbbdb594e43ac82369d 6 3
v:f2b33b065b12f1668cf13b3562cf19db 6 1
v:bfc2ab241dafc94425fba4a642e6009d 6 2
v:fec78667c7a021215ca59728258be16c 6 1
v:40ea08aa3e1d6edc307d961fc9dc0b69 6 23
v:357e66846e15f89837bdc31792df2e6a 6 4
v:748476f9b6f3daf4d9a6818798263d6c 6 1
v:65113fbe866cec6af2d05e3aaa2bac0e 6 2
v:e395a45e7e42cd416e2269fbdd1ab8f9 6 2
Top usage of values (as object)
value triple count usage count
v:c610c7d0abbfe361e367744369f5d33d 6 10121290
v:4e601d1880d647664093f1b20b24dacf 5 4542586
v:d0931b31b1c31ffa1325777f65b723db 5 3277169
v:7e281616976c7de150357c18e76abfd1 5 1452465
v:d40cb13acf8001d779efbb0c45cb42f0 5 1412411
v:8c2bdc5006a93f73a2b03849218b4e7b 5 1217459
v:b3795d3425e0bbdd474f3138cad4a069 5 940261
v:202de63fcb2e0943a9b5d0cebf189569 5 685430
v:a843a14d6be3111e93a253fd623f18cf 5 580686
v:dafb9cf711b15afe91ec0aa7158e57a6 5 573003
v:1c9e02c1631d3fd5ec9a9fe9aa1fde65 3 562672
v:67a17c05603b9e75b6c25826fa747705 3 486942
v:5a2515dd8960847405b294e0a7999403 5 466608
v:5a166c540a59253c92144230db78cb8a 5 428426
v:8c0d739994215f213311d29254302049 5 422030
v:39c2e70b9990c3ed7be32f8e34015853 5 420452
v:b441aa14f32ad7a9e6fe04eb80002b4c 5 411528
v:216f4f19c804fc50c737da9ae87494a9 5 387676
v:1ce0c285c67a65d1e2e620ace3f6c897 5 369781
v:ddd311e198ef615dbfaaa3f42aeec7b4 5 364341

I explore a few values below.

Value #1

Value node v:e2bd8f07b10701c92eacf58a0329b127 is QuantityValue type. Places it is used and the triples it contains is shown below. The usage table shows what kind of value it is used as (i.e the predicate to which the value is a object of) how many times it is used with that predicate.

Usage of value
predicate predicate label count
P2216 radial velocity 1
Triples of value
predicate object
<http://wikiba.se/ontology#quantityLowerBound> "+83.6"^^<http://www.w3.org/2001/XMLSchema#decimal>
<http://wikiba.se/ontology#quantityAmount> "+84.5"^^<http://www.w3.org/2001/XMLSchema#decimal>
<http://wikiba.se/ontology#quantityUnit> <http://www.wikidata.org/entity/Q3674704>
<http://wikiba.se/ontology#quantityUpperBound> "+85.4"^^<http://www.w3.org/2001/XMLSchema#decimal>
<http://wikiba.se/ontology#quantityNormalized> <http://www.wikidata.org/value/f0b95aa63660347cc5d9af28fe74c7d8>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://wikiba.se/ontology#QuantityValue>

Value #2

Value node v:e2ce06d70b150e202e4c81681e5334e5 is TimeValue type.

Usage of value
predicate predicate label count
P577 publication date 35774
P580 start time 800
P582 end time 233
P570 date of death 43
P571 inception 36
P576 dissolved, abolished or demolished date 27
P7588 effective date 21
P585 point in time 16
P1619 date of official opening 5
P569 date of birth 2
P1191 date of first performance 1
P2960 archive date 1
P575 time of discovery or invention 1
P620 time of spacecraft landing 1
P729 service entry 1
Triples of value
predicate object
<http://wikiba.se/ontology#timeValue> "2009-12-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://wikiba.se/ontology#TimeValue>
<http://wikiba.se/ontology#timeCalendarModel> <http://www.wikidata.org/entity/Q1985727>
<http://wikiba.se/ontology#timeTimezone> "0"^^<http://www.w3.org/2001/XMLSchema#integer>
<http://wikiba.se/ontology#timePrecision> "11"^^<http://www.w3.org/2001/XMLSchema#integer>

Value #3

Value node v:e2ce017b6638b9684082390db9ce311f is TimeValue type.

Usage of value
predicate predicate label count
P813 retrieved 44722
P577 publication date 1048
P5017 last update 195
P585 point in time 64
P570 date of death 59
P580 start time 57
P582 end time 56
P2960 archive date 9
P6949 announcement date 2
P1319 earliest date 1
P2031 work period (start) 1
P571 inception 1
Triples of value
predicate object
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://wikiba.se/ontology#TimeValue>
<http://wikiba.se/ontology#timeCalendarModel> <http://www.wikidata.org/entity/Q1985727>
<http://wikiba.se/ontology#timeValue> "2020-01-12T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
<http://wikiba.se/ontology#timeTimezone> "0"^^<http://www.w3.org/2001/XMLSchema#integer>
<http://wikiba.se/ontology#timePrecision> "11"^^<http://www.w3.org/2001/XMLSchema#integer>

Top Predicates

Number of distinct predicates: 41117

Top 20 predicates
predicate count
schema:description 2462371590
rdf:type 1406804171
wikibase:rank 1288379756
prov:wasDerivedFrom 1003288418
rdfs:label 495214648
p:P2860 247929898
ps:P2860 247929861
wdt:P2860 247928515
pq:P1545 157196523
p:P2093 135492184
ps:P2093 135492121
wdt:P2093 135397718
skos:altLabel 102185811
p:P31 99091730
ps:P31 99091721
schema:dateModified 97315730
schema:version 97315148
wdt:P31 95759651
wikibase:statements 93970378
wikibase:sitelinks 93464760
wikibase:identifiers 93464760

Assuming all wikidata related predicates are prefixed with wikidata.org or wikiba.se. Therefore anything other than these prefix are considered non-wiki predicates.
Both wiki and non-wiki predicates can have wiki or non-wiki obejcts.

  • wiki predicate, non-wiki object: wd:Q:30 wdt:P2250 "78.69024"
  • non-wiki predicate, wiki object: data:P31 schema:about wd:P31
Top 20 non-wiki predicates
predicate count
schema:description 2462371590
rdf:type 1406804171
prov:wasDerivedFrom 1003288418
rdfs:label 495214648
skos:altLabel 102185811
schema:dateModified 97315730
schema:version 97315148
schema:about 78517643
schema:inLanguage 78516557
schema:isPartOf 78516557
schema:name 78516557
ontolex:representation 9957440
ontolex:lexicalForm 8730481
owl:sameAs 3344770
dct:language 496678
skos:definition 142252
ontolex:sense 128287
owl:onProperty 8940
owl:complementOf 8940
owl:someValuesFrom 8940
owl:imports 1
cc:license 1
schema:softwareVersion 1
Top 50 wiki predicates
predicate predicate label count
wikibase:rank 1288379756
p:P2860 cites work 247929898
ps:P2860 cites work 247929861
wdt:P2860 cites work 247928515
pq:P1545 series ordinal 157196523
p:P2093 author name string 135492184
ps:P2093 author name string 135492121
wdt:P2093 author name string 135397718
p:P31 instance of 99091730
ps:P31 instance of 99091721
wdt:P31 instance of 95759651
wikibase:statements 93970378
wikibase:sitelinks 93464760
wikibase:identifiers 93464760
pr:P248 stated in 76940160
pr:P813 retrieved 74812531
prv:P813 retrieved 74812525
pr:P854 reference URL 56837807
wikibase:quantityAmount 52564412
wikibase:quantityUnit 52564412
p:P1476 title 40923677
ps:P1476 title 40921787
wdt:P1476 title 40916525
p:P577 publication date 39645641
ps:P577 publication date 39645462
psv:P577 publication date 39645061
wdt:P577 publication date 39633371
p:P1433 published in 37349691
ps:P1433 published in 37349680
wdt:P1433 published in 37348812
wikibase:quantityNormalized 35418096
p:P304 page(s) 34875085
ps:P304 page(s) 34875081
wdt:P304 page(s) 34875052
p:P478 volume 34665675
ps:P478 volume 34665668
wdt:P478 volume 34665659
psv:P1215 apparent magnitude 33123781
ps:P1215 apparent magnitude 33123781
p:P1215 apparent magnitude 33123781
pq:P1227 astronomical filter 33123753
p:P698 PubMed ID 31983819
ps:P698 PubMed ID 31983819
wdt:P698 PubMed ID 31948158
p:P433 issue 31751756
ps:P433 issue 31751753
wdt:P433 issue 31751748
p:P528 catalog code 28701730
ps:P528 catalog code 28701676
wdt:P528 catalog code 28698943

Object

Obejects have too many distinct values especially because of literals (string, numeral, time, date etc), so getting a list of top objects would not serve any useful purpose. Rather I looked at the types of objects in Wikidata.
Broadly speaking, objects can be:

  1. Literals: Literals can have datatypes
  2. URIs: May or may not have <type> predicate specified.
    1. Wiki URI
    2. Non-wiki URI

Of the objects that are URI and have a <type> predicate in wikidata, let us find the top types of object. Note that it is not the count of object usage or occurance, rather just the number of distinct objects with that <type>. More on the distribution of the number of triples with each kind of object can be found in the Wiki vs Non-wiki Triples section.

Top types of URI objects
Type of object URI count
wikibase:BestRank 1256923711
wikibase:QuantityValue 52564412
ontolex:Form 8730481
wikibase:GlobecoordinateValue 8603170
wikibase:TimeValue 350691
ontolex:LexicalSense 128287
wikibase:GeoAutoPrecision 93780
owl:ObjectProperty 68987
wdno:P364 60594
wdno:P17 39925
schema:Article 33268
wdno:P155 32010
owl:DatatypeProperty 29805
ontolex:LexicalEntry 18917
owl:Class 8940
owl:Restriction 8940
wdno:P814 8713
wikibase:Property 8605
wdno:P156 8457
wdno:P162 8191

Of the objects that are literals, we can find the datatype of the literals. The table below count the number of objects with a specific data type.

Top datatypes of literal objects
dtype count precentage
<http://www.w3.org/2001/XMLSchema#integer> 397670762 36.22
<http://www.w3.org/2001/XMLSchema#decimal> 344236061 31.35
<http://www.w3.org/2001/XMLSchema#dateTime> 312169343 28.43
<http://www.w3.org/2001/XMLSchema#double> 26333634 2.39
<http://www.opengis.net/ont/geosparql#wktLiteral> 17557277 1.60
<http://www.w3.org/1998/Math/MathML> 45058 0.004

Wiki vs Non-wiki Triples

Triples can have objects that are later expanded within wikidata. These objects have to be wiki objects. The objects that start with the prefix wikidata.org or wikiba.se are considered wiki objects. But all wiki objects not tend to 'expand' within wikidata. This is calculated by find the number of wiki objects that also occur as subjects. If they do occur as subject, then they have 'expanded' within wikidata.
Triples can also have non-wiki objects, and non-wiki objects cannot be expanded in wikidata. Non-wiki objects do not start with the wikidata.org or wikiba.se prefixes. Non-wiki objects can be URIs or literals. The idea is that since these objects do not expand within wikidata, they are leaves in the graph, and so they can be modelled as properties of the associated node in Property graphs.

The distribution of the number of triples containing different types of objects. All percentages expressed are the percentage of the total number of triples.

Object type # Triples % Total triples Object type # Triples % Total triples Object type # Triples % Total triples
Wiki object 6828025880 52.9 Wikidata object 4221092432 32.7 Object also subject 4220239722 32.7
Object not subject 852710 0.00006
Wikiba.se object 2606933448 20.2 Object also subject 0 0
Object not subject 2606933448 20.2
Non-wiki object 5843911006 45.3 URI object 391262595 3
Literal object 5452648411 42.2

If the table is too much to digest, here is a simple diagram with the same information but in a more colorful and beautiful way. 800px

Expanded vs Unexpanded objects

This section does some analysis mainly to verify that wikidata objects do expand within wikidata. This is a premise to the idea that non-wiki objects don't expand within wikidata and are leaves in the graph.

  • Triples with wiki objects that also appear as subjects (therefore do expand in wikidata): 4220239722 (32.7%)
    • Number of triples with wikidata.org object that are also subject: 4220239722 (32.7%)
    • Number of triples with wikiba.se object that are also subject: 0
  • Total wikidata-objects that are not expanded in wikidata (not subjects): 852710
  • Distinct wikidata-objects that are not expanded in wikidata (not subjects): 852319 (Most are distinct)

Ideally we expect that if a wikidata entry is a subject, it should have some relevant information on wikidata. But 852710 triples have objects that do not have corresponding subjects. Wikidata objects that do not expand in wikidata and the count of their occurance as objects are shown below. Upon manual inspection it seems the Q-items are deleted entries. This can explain why they were not available as a subject in wikidata. Nevertheless, the triples that used them as objects still persist.

Unexpanded wikidata objects
object count
wd:Q68637652 148
<http://www.wikidata.org/.well-known/genid/abf15f7aee46e00705c700147bd53518> 32
wd:Q28968053 24
<http://www.wikidata.org/.well-known/genid/3d27111745b4e92877aa4c1ad765e5ca> 15
wd:L229411-S1 14
wd:Q35104224 14
wd:Q58331113 12
wd:undefined 10
wd:Q107006232 10
<http://www.wikidata.org/.well-known/genid/fa575516a51320c3beb70f7719f72a99> 10
<http://www.wikidata.org/.well-known/genid/170fd20cd570feefeba7cbc7be44c853> 9

Objects Per Item

While it's great we know how many non-wiki objects we have and that we can consider them as properties in a property graph, we still don't know how spread out they are. Are most of the non-wiki objects in few items, or most items have a lot of non-wiki objects, etc. If we find the number of non-wiki triples per item we can try to infer these questions.

Objects per subject
Object type max min avg std
Non-wiki 5311 1 5.59 19.86
Literal 5309 1 5.32 19.94
Wiki 16667 1 4.21 8.7
Top objects per subject
Top Non-wiki objects per subject Top literals per subject Top Wiki objects per subject
subject count subject count subject count
wd:Q56836084 5311 wd:Q56836084 5309 wd:Q39790431 16667
wd:Q106988069 5125 wd:Q106988069 5123 wd:Q27972199 6208
wd:Q64022985 4492 wd:Q64022985 4491 wd:Q57661806 6086
wd:Q56883844 4489 wd:Q56883844 4489 wd:Q21558717 6018
wd:Q57920219 4484 wd:Q57920219 4482 wd:Q56754739 5986
wd:Q21558717 4328 wd:Q21558717 4327 wd:Q56895655 5970
wd:Q56754739 4238 wd:Q56754739 4236 wd:Q63409374 5964
wd:Q56895655 4218 wd:Q56895655 4216 wd:Q64022985 5894
wd:Q174565 4049 wd:Q174565 4043 wd:Q56836084 5652
wd:Q58231267 3812 wd:Q58231267 3810 wd:Q106988069 5165
wd:Q467925 3277 wd:Q467925 3272 wd:Q58231267 4920
wd:Q104369389 2983 wd:Q104369389 2982 wd:Q56489295 4876
wd:Q100507117 2974 wd:Q100507117 2973 wd:Q33928881 4817
wd:Q104798012 2930 wd:Q104798012 2929 wd:Q28388335 4763
wd:Q98468691 2917 wd:Q98468691 2916 wd:Q57920219 4695
wd:Q98730204 2914 wd:Q98730204 2913 wd:Q56883844 4682
wd:Q96613392 2881 wd:Q96613392 2880 wd:Q57735077 4559
wd:Q104467608 2877 wd:Q104467608 2876 wd:Q35952737 4435
wd:Q21521425 2868 wd:Q21521425 2867 wd:Q35202929 4380
wd:Q21521423 2827 wd:Q21521423 2826 wd:Q30486707 4129