You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User:AKhatun/Wikidata Scholarly Articles Subgraph Analysis: Difference between revisions
imported>AKhatun (Add clarification for count of external id) |
imported>AKhatun |
||
Line 32: | Line 32: | ||
== Scholarly Articles == | == Scholarly Articles == | ||
[[File:scholarly_articles_example.png|900px]] | |||
=== Number of Scolarly Articles === | === Number of Scolarly Articles === | ||
* Total entity count: 94M <ref>[[wikidata:Special:Statistics]]</ref> | * Total entity count: 94M <ref>[[wikidata:Special:Statistics]]</ref> | ||
Line 92: | Line 95: | ||
=== Entities connected to scholarly articles === | === Entities connected to scholarly articles === | ||
* '''Links to scholarly articles''': Number of triples that have scholarly articles as ''object'' is 529702818 ('''530M''') | |||
* '''Outside links to scholarly articles''': Number of triples that have scholarly articles as ''object'', but subject is not another scholarly article is 266708031 ('''266M''') | |||
=== Entities scholarly articles connect to === | === Entities scholarly articles connect to === | ||
Queries to directly find what links ''from'' scholarly articles to other non-scholarly articles is running into time outs. Another way to estimate this is looking into predicates that tend to point towards other non-scholarly article items in Wikidata. | |||
The predicates considered were: | |||
* main subject | |||
* language of work or name | |||
* stated as | |||
* on focus list of Wikimedia project | |||
* describes a project that uses | |||
* determination method | |||
* sponsor | |||
* genre | |||
* object has role | |||
These together are 85851140 ('''85M''') triples, therefore 85M triples link directly from scholarly articles to things possibly non-scholarly article.<br> | |||
Note that this does not include triples contained in the statements of these triples, if any; but should not add too many triples to it either. | |||
=== Rate of growth of scholarly articles === | === Rate of growth of scholarly articles === | ||
== Scholarly Article Counts Summary == | |||
[[File:scholarly_articles_count_tree.png|900px]] | |||
== Predicates of Scholarly Articles == | == Predicates of Scholarly Articles == | ||
Line 270: | Line 298: | ||
== Scholarly Articles' Author == | == Scholarly Articles' Author == | ||
Number of authors | |||
The following analysis is done with the predicate '''P50''', which links to an author item in Wikidata. Other author data include 'author name string', but it is not considered here as these are literals and do not link to other Wikidata items. | |||
* Number of authors (direct links to authors) in scholarly articles: 19756767 ('''19.7M''') | |||
* There are '''~28M''' more triples where these authors are used as object (therefore linked to), but from items other than scholarly articles. | |||
* Number of distinct authors in scholarly articles: 1801971 ('''1.8M''') | |||
* Number of distinct authors being referred to ''ONLY'' in scholarly articles: ## ('''##M''') | |||
* Naturally, number of distinct authors being referred to both inside and outside of scholarly articles: 1.8M - ##M = '''##M''' | |||
* Average author per article: ~2 | |||
* Maximum author per article: 1070 | |||
* Minimum author per article: 1 | |||
[[File:auth_dist.png]] | |||
== References == | == References == |
Revision as of 13:47, 11 August 2021
Scholarly articles form a large portion of Wikidata, this has been established for some time. This page highlights the baseline analysis done on Scholarly articles in Wikidata. The aim is to identify not only what portion of Wikidata is schoalrly articles, but also how connected is it to other parts of Wikidata, how many users query this subgraph and what percentage of queries they are. The analysis is therefore divided into two parts:
- Analysis on Wikidata:
- What are schoalrly articles
- Number and percentage of Wikidata entities that are scholarly articles
- How many entities connect to scholarly articles
- How many entities do scholarly articles connect to
- Rate of growth of scholarly articles
- Number of authors that would be isolated from Wikidata if Scholarly articles were removed (meaning, these author entities were created probably only for the purposes of the articles they wrote.)
- Analysis of the WDQS SPARQL Queries:
- What defines a query to be associated with scholarly articles
- The number and percentage of queries associated with scholarly articles
- The number and percentage of queries that require entities other than scholarly articles vs those that rewuire only scholarly articles
The Wikidata dump of 20210719
was used for the following analysis.
Ticket: T281854
Definition of Scholarly Articles
Research papers or books can be considered scholarly articles. This includes journals articles, conference papers, books etc. Wikidata has several types of entities to cover these. More technically speaking, the entities involved are:
- academic journal article: Q18918145
- scholarly article: Q13442814
- scientific journal: Q5633421
- scholarly conference abstract: Q58632367
- conference paper: Q23927052
- scientific conference paper: Q10885494
- And their subclasses
Most of these have overlaps with scholarly articles, and with themselves. Scholarly articles have the largest count (37M) while everything else combined is in the thousands (~130K, excluding those that are included in scholarly articles), therefore the analysis is more focused on scholarly articles than others. The 130K could be even less if the other article types have overlap among themselves. This is not an exhaustive list, but cover most of what we can call scholarly articles or papers or journal articles. A detailed tree diagram of how these entities are related can be found in Scholarly Articles' Tree in Wikidata.
Scholarly Articles
File:Scholarly articles example.png
Number of Scolarly Articles
- Total entity count: 94M [1]
- Number of entities that are
instance of
scholarly article: 37308158 (37.3M). 40% of all Wikidata entities are therefore schoalrly articles.
- Total triple count: 12819818340 (12.8B)
- Number of triples included within the scholarly articles: 6399256630 (6.4B). 50% of all triples are directly related to scholarly articles.
Technically, the triple related to scholarly articles come under the 'context' of items that are scholarly articles. Example context of an item is Q39790431 dump. The context count includes:
- + all triples in which the item is a subject
- + the statement triples that rise from these triples
- + triples that define the item itself
- - the refs and vals, as those are re-usable in other items.
Triples per article
i.e triples related to the artciles.
- Average triple per article: 171
- Minimum triple per article: 10
- Maximum triple per article: 41847
See examples of articles with high number of triples in Wikidata_Basic_Analysis#Items
Direct triples per article
i.e triples where scholarly article is the subject.
- Average direct triple per article: 84
- Minimum direct triple per article: 7
- Maximum direct triple per article: 16758
See examples of articles with high number of direct triples in Wikidata_Basic_Analysis#Top_Subjects
This raises the question: Are only a few articles responsible for this huge number of triples, or are the triples distributed evenly among the articles? So we look into the distribution of triples.
Distribution of triples per article
|
Number of days to recovery
If 6.4B triples were to be removed from wikidata, given the current rate of growth, how long would it take for wikidata to get back to its original size again?
The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this.
- Wikidata will take 1300 days = 44 months = appx. 3.6 years to get back to it's original size. This is a wide approximation, since the growth rate of wikidata is not constant.
Entities connected to scholarly articles
- Links to scholarly articles: Number of triples that have scholarly articles as object is 529702818 (530M)
- Outside links to scholarly articles: Number of triples that have scholarly articles as object, but subject is not another scholarly article is 266708031 (266M)
Entities scholarly articles connect to
Queries to directly find what links from scholarly articles to other non-scholarly articles is running into time outs. Another way to estimate this is looking into predicates that tend to point towards other non-scholarly article items in Wikidata.
The predicates considered were:
- main subject
- language of work or name
- stated as
- on focus list of Wikimedia project
- describes a project that uses
- determination method
- sponsor
- genre
- object has role
These together are 85851140 (85M) triples, therefore 85M triples link directly from scholarly articles to things possibly non-scholarly article.
Note that this does not include triples contained in the statements of these triples, if any; but should not add too many triples to it either.
Rate of growth of scholarly articles
Scholarly Article Counts Summary
File:Scholarly articles count tree.png
Predicates of Scholarly Articles
- Total distinct predicates: Total distinct predicates of scholarly articles is 2107
- Non-wikidata predicates: 17 of these predicates are non-wikidata predicates, i.e unlike P31/P279etc, they do not start with the prefix
wikidata.org
. These predicates form 60% of the scholarly article triples. - Non-wiki predicates: 14 of these predicates don't start with
wikidata.org
orwikiba.se
. These are 46% of the scholarly article triples. - Descriptions: Descriptions of scholarly articles form 20.5% of the scholarly article triples. This is 10% of all wikidata triples. Recall that all descriptions (of all items) together forms 19.5% of the entire wikidata.
- External IDs: There are 1000 external IDs associated with scholarly articles, which form ~50% of the distinct predicates of scholarly articles. External IDs form 4% of scholarly article triples, 2% of all triples.
Top Predicates
Top Properties
Considering only wikidata predicates, i.e mainly properties of items, and grouping based on property label of the property, we get the following count of top predicates. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.
Property name | # of Triples | % of Scholarly Article Triples | % of all Triples |
---|---|---|---|
cites work | 789200575 | 12.332 | 6.122 |
author name string | 402859029 | 6.296 | 3.125 |
series ordinal | 154089316 | 2.408 | 1.195 |
publication date | 149227733 | 2.332 | 1.156 |
DOI | 134464365 | 2.100 | 1.045 |
instance of | 120957868 | 1.890 | 0.939 |
title | 112574265 | 1.758 | 0.873 |
published in | 109147370 | 1.707 | 0.846 |
page(s) | 104199512 | 1.629 | 0.807 |
volume | 103612864 | 1.620 | 0.804 |
PubMed ID | 95893499 | 1.499 | 0.744 |
issue | 95083032 | 1.485 | 0.738 |
author | 59439111 | 0.929 | 0.461 |
main subject | 41515900 | 0.648 | 0.321 |
language of work or name | 34801291 | 0.543 | 0.270 |
PMCID | 19059585 | 0.297 | 0.147 |
ResearchGate publication ID | 13734703 | 0.216 | 0.108 |
stated as | 8609378 | 0.135 | 0.067 |
exact match | 8162235 | 0.129 | 0.063 |
Dimensions Publication ID | 4617733 | 0.072 | 0.036 |
Top External IDs
Grouping based on property label of the external IDs, we get the following count of top external IDs. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.
External ID | # of Triples | % of Scholary Article Triples | % of all Triples |
---|---|---|---|
DOI | 134464365 | 2.100 | 1.045 |
PubMed ID | 95893499 | 1.499 | 0.744 |
PMCID | 19059585 | 0.297 | 0.147 |
ResearchGate publication ID | 13734703 | 0.216 | 0.108 |
Dimensions Publication ID | 4617733 | 0.072 | 0.036 |
CJFD journal article ID | 2915210 | 0.045 | 0.024 |
DBLP publication ID | 2080695 | 0.035 | 0.015 |
ADS bibcode | 1186590 | 0.018 | 0.009 |
OpenCitations bibliographic resource ID | 1161475 | 0.020 | 0.010 |
arXiv ID | 1040311 | 0.015 | 0.009 |
Distribution of predicates
The distribution of distinct predicates per article is given below:
- Average distinct predicate per article: 29
- Maximum distinct predicate per article: 102
- Minimum distinct predicate per article: 7
|
Scholarly Articles' Author
The following analysis is done with the predicate P50, which links to an author item in Wikidata. Other author data include 'author name string', but it is not considered here as these are literals and do not link to other Wikidata items.
- Number of authors (direct links to authors) in scholarly articles: 19756767 (19.7M)
- There are ~28M more triples where these authors are used as object (therefore linked to), but from items other than scholarly articles.
- Number of distinct authors in scholarly articles: 1801971 (1.8M)
- Number of distinct authors being referred to ONLY in scholarly articles: ## (##M)
- Naturally, number of distinct authors being referred to both inside and outside of scholarly articles: 1.8M - ##M = ##M
- Average author per article: ~2
- Maximum author per article: 1070
- Minimum author per article: 1