You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Scholarly Articles Subgraph Analysis

From Wikitech-static
< User:AKhatun
Revision as of 16:31, 10 August 2021 by imported>AKhatun (Add clarification for count of external id)
Jump to navigation Jump to search

Scholarly articles form a large portion of Wikidata, this has been established for some time. This page highlights the baseline analysis done on Scholarly articles in Wikidata. The aim is to identify not only what portion of Wikidata is schoalrly articles, but also how connected is it to other parts of Wikidata, how many users query this subgraph and what percentage of queries they are. The analysis is therefore divided into two parts:

  • Analysis on Wikidata:
    • What are schoalrly articles
    • Number and percentage of Wikidata entities that are scholarly articles
    • How many entities connect to scholarly articles
    • How many entities do scholarly articles connect to
    • Rate of growth of scholarly articles
    • Number of authors that would be isolated from Wikidata if Scholarly articles were removed (meaning, these author entities were created probably only for the purposes of the articles they wrote.)
  • Analysis of the WDQS SPARQL Queries:
    • What defines a query to be associated with scholarly articles
    • The number and percentage of queries associated with scholarly articles
    • The number and percentage of queries that require entities other than scholarly articles vs those that rewuire only scholarly articles

The Wikidata dump of 20210719 was used for the following analysis.
Ticket: T281854

Definition of Scholarly Articles

Research papers or books can be considered scholarly articles. This includes journals articles, conference papers, books etc. Wikidata has several types of entities to cover these. More technically speaking, the entities involved are:

Most of these have overlaps with scholarly articles, and with themselves. Scholarly articles have the largest count (37M) while everything else combined is in the thousands (~130K, excluding those that are included in scholarly articles), therefore the analysis is more focused on scholarly articles than others. The 130K could be even less if the other article types have overlap among themselves. This is not an exhaustive list, but cover most of what we can call scholarly articles or papers or journal articles. A detailed tree diagram of how these entities are related can be found in Scholarly Articles' Tree in Wikidata.

Scholarly Articles

Number of Scolarly Articles

  • Total entity count: 94M [1]
  • Number of entities that are instance of scholarly article: 37308158 (37.3M). 40% of all Wikidata entities are therefore schoalrly articles.

Number of triples related to Scholarly Articles

  • Total triple count: 12819818340 (12.8B)
  • Number of triples included within the scholarly articles: 6399256630 (6.4B). 50% of all triples are directly related to scholarly articles.

Technically, the triple related to scholarly articles come under the 'context' of items that are scholarly articles. Example context of an item is Q39790431 dump. The context count includes:

  • + all triples in which the item is a subject
  • + the statement triples that rise from these triples
  • + triples that define the item itself
  • - the refs and vals, as those are re-usable in other items.

Triples per article

i.e triples related to the artciles.

  • Average triple per article: 171
  • Minimum triple per article: 10
  • Maximum triple per article: 41847

See examples of articles with high number of triples in Wikidata_Basic_Analysis#Items

Direct triples per article

i.e triples where scholarly article is the subject.

  • Average direct triple per article: 84
  • Minimum direct triple per article: 7
  • Maximum direct triple per article: 16758

See examples of articles with high number of direct triples in Wikidata_Basic_Analysis#Top_Subjects

This raises the question: Are only a few articles responsible for this huge number of triples, or are the triples distributed evenly among the articles? So we look into the distribution of triples.

Distribution of triples per article

Distribution of the count of triples per article
Number of triples Count of articles
10 to 100 10583118
100 to 1k 26626656
1k to 10k 96563
more than 40k 1821

File:Related triples per scholarly article.png

Number of days to recovery

If 6.4B triples were to be removed from wikidata, given the current rate of growth, how long would it take for wikidata to get back to its original size again?
The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this.

  • Wikidata will take 1300 days = 44 months = appx. 3.6 years to get back to it's original size. This is a wide approximation, since the growth rate of wikidata is not constant.

Entities connected to scholarly articles

Entities scholarly articles connect to

Rate of growth of scholarly articles

Predicates of Scholarly Articles

  • Total distinct predicates: Total distinct predicates of scholarly articles is 2107
  • Non-wikidata predicates: 17 of these predicates are non-wikidata predicates, i.e unlike P31/P279etc, they do not start with the prefix wikidata.org. These predicates form 60% of the scholarly article triples.
  • Non-wiki predicates: 14 of these predicates don't start with wikidata.org or wikiba.se. These are 46% of the scholarly article triples.
  • Descriptions: Descriptions of scholarly articles form 20.5% of the scholarly article triples. This is 10% of all wikidata triples. Recall that all descriptions (of all items) together forms 19.5% of the entire wikidata.
  • External IDs: There are 1000 external IDs associated with scholarly articles, which form ~50% of the distinct predicates of scholarly articles. External IDs form 4% of scholarly article triples, 2% of all triples.

Top Predicates

Top predicates of scholarly articles
Predicate Predicate label # of Triples % of Scholarly Article Triples % of all Triples
http://schema.org/description 1321691671 20.654 10.252
http://wikiba.se/ontology#rank 773820830 12.092 6.002
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 773788620 12.092 6.002
http://www.w3.org/ns/prov#wasDerivedFrom 691773237 10.810 5.366
http://www.wikidata.org/prop/P2860 cites work 263097842 4.111 2.041
http://www.wikidata.org/prop/statement/P2860 cites work 263097837 4.111 2.041
http://www.wikidata.org/prop/direct/P2860 cites work 263004896 4.110 2.040
http://www.wikidata.org/prop/qualifier/P1545 series ordinal 154088914 2.408 1.195
http://www.wikidata.org/prop/P2093 author name string 134315644 2.099 1.042
http://www.wikidata.org/prop/statement/P2093 author name string 134315587 2.099 1.042
http://www.wikidata.org/prop/direct/P2093 author name string 134227496 2.098 1.041
http://www.w3.org/2000/01/rdf-schema#label 74437634 1.163 0.577
http://www.wikidata.org/prop/statement/P31 instance of 40319296 0.630 0.313
http://www.wikidata.org/prop/P31 instance of 40319296 0.630 0.313
http://www.wikidata.org/prop/direct/P31 instance of 40319268 0.630 0.313
http://www.wikidata.org/prop/P1476 title 37524904 0.586 0.291
http://www.wikidata.org/prop/statement/P1476 title 37524903 0.586 0.291
http://www.wikidata.org/prop/direct/P1476 title 37523899 0.586 0.291
http://www.wikidata.org/prop/P577 publication date 37309627 0.583 0.289
http://www.wikidata.org/prop/statement/P577 publication date 37309626 0.583 0.289
http://www.wikidata.org/prop/statement/value/P577 publication date 37309625 0.583 0.289

Top Properties

Considering only wikidata predicates, i.e mainly properties of items, and grouping based on property label of the property, we get the following count of top predicates. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.

Top properties of scholarly articles
Property name # of Triples % of Scholarly Article Triples % of all Triples
cites work 789200575 12.332 6.122
author name string 402859029 6.296 3.125
series ordinal 154089316 2.408 1.195
publication date 149227733 2.332 1.156
DOI 134464365 2.100 1.045
instance of 120957868 1.890 0.939
title 112574265 1.758 0.873
published in 109147370 1.707 0.846
page(s) 104199512 1.629 0.807
volume 103612864 1.620 0.804
PubMed ID 95893499 1.499 0.744
issue 95083032 1.485 0.738
author 59439111 0.929 0.461
main subject 41515900 0.648 0.321
language of work or name 34801291 0.543 0.270
PMCID 19059585 0.297 0.147
ResearchGate publication ID 13734703 0.216 0.108
stated as 8609378 0.135 0.067
exact match 8162235 0.129 0.063
Dimensions Publication ID 4617733 0.072 0.036

Top External IDs

Grouping based on property label of the external IDs, we get the following count of top external IDs. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.

Top external IDs of scholarly articles
External ID # of Triples % of Scholary Article Triples % of all Triples
DOI 134464365 2.100 1.045
PubMed ID 95893499 1.499 0.744
PMCID 19059585 0.297 0.147
ResearchGate publication ID 13734703 0.216 0.108
Dimensions Publication ID 4617733 0.072 0.036
CJFD journal article ID 2915210 0.045 0.024
DBLP publication ID 2080695 0.035 0.015
ADS bibcode 1186590 0.018 0.009
OpenCitations bibliographic resource ID 1161475 0.020 0.010
arXiv ID 1040311 0.015 0.009

Distribution of predicates

The distribution of distinct predicates per article is given below:

  • Average distinct predicate per article: 29
  • Maximum distinct predicate per article: 102
  • Minimum distinct predicate per article: 7
Distribution of distinct predicate per scholarly article
Number of distinct predicate Number of articles
less than 10 277
10 to 20 152995
20 to 30 18146794
30 to 40 18294391
40 to 50 711176
50 to 60 2493
60 to 70 26
70 to 80 4
80 to 90 1
90 to 100 0
more than 100 1

File:Distinct predicates per scholarly article.png

Scholarly Articles' Author

Number of authors that would be isolated from Wikidata if Scholarly articles were removed (meaning, these author entities were created probably only for the purposes of the articles they wrote.)

References