You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
User:AKhatun/Wikidata Scholarly Articles Subgraph Analysis
Scholarly articles form a large portion of Wikidata, this has been established for some time. This page highlights the baseline analysis done on Scholarly articles in Wikidata. The aim is to identify not only what portion of Wikidata is scholarly articles, but also how connected is it to other parts of Wikidata, how many users query this subgraph and what percentage of queries they are. The analysis is therefore divided into two parts:
- Analysis on Wikidata:
- What are scholarly articles
- Number and percentage of Wikidata entities that are scholarly articles
- How many entities connect to scholarly articles
- How many entities do scholarly articles connect to
- Rate of growth of scholarly articles
- Number of authors that would be isolated from Wikidata if scholarly articles were removed (meaning, these author entities were created probably only for the purposes of the articles they wrote.)
- Analysis of the WDQS SPARQL Queries:
- What defines a query to be associated with scholarly articles
- The number and percentage of queries associated with scholarly articles
- The number and percentage of queries that require entities other than scholarly articles vs those that require only scholarly articles
The Wikidata dump of
20210719 was used for the following analysis.
Definition of Scholarly Articles
Research papers or books can be considered scholarly articles. This includes journals articles, conference papers, books etc. Wikidata has several types of entities to cover these. More technically speaking, the entities involved are:
- academic journal article: Q18918145
- scholarly article: Q13442814
- scientific journal: Q5633421
- scholarly conference abstract: Q58632367
- conference paper: Q23927052
- scientific conference paper: Q10885494
- And their subclasses
Most of these have overlaps with scholarly articles, and with themselves. Scholarly articles have the largest count (37M) while everything else combined is in the thousands (~130K, excluding those that are included in scholarly articles), therefore the analysis is more focused on scholarly articles than others. The 130K could be even less if the other article types have overlap among themselves. This is not an exhaustive list, but cover most of what we can call scholarly articles or papers or journal articles. A detailed tree diagram of how these entities are related can be found in Scholarly Articles' Tree in Wikidata.
Scholarly Article Stats
Number of Scholarly Articles
- Total entity count: 94M 
- Number of entities that are
instance ofscholarly article: 37308158 (37.3M). 40% of all Wikidata entities are therefore scholarly articles.
- Total triple count: 12819818340 (12.8B)
- Number of triples included within the scholarly articles: 6399256630 (6.4B). 50% of all triples are directly related to scholarly articles.
Technically, the triple related to scholarly articles come under the 'context' of items that are scholarly articles. Example context of an item is Q39790431 dump. The context count includes:
- + all triples in which the item is a subject
- + the statement triples that rise from these triples
- + triples that define the item itself
- - the refs and vals, as those are re-usable in other items.
Triples per article
i.e triples related to the articles.
- Average triple per article: 171
- Minimum triple per article: 10
- Maximum triple per article: 41847
See examples of articles with high number of triples in Wikidata_Basic_Analysis#Items
Direct triples per article
i.e triples where scholarly article is the subject.
- Average direct triple per article: 84
- Minimum direct triple per article: 7
- Maximum direct triple per article: 16758
See examples of articles with high number of direct triples in Wikidata_Basic_Analysis#Top_Subjects
This raises the question: Are only a few articles responsible for this huge number of triples, or are the triples distributed evenly among the articles? So we look into the distribution of triples.
Distribution of triples per article
Number of days to recovery
If 6.4B triples were to be removed from wikidata, given the current rate of growth, how long would it take for wikidata to get back to its original size again?
The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this.
- Wikidata will take 1300 days = 44 months = appx. 3.6 years to get back to it's original size. This is a wide approximation, since the growth rate of wikidata is not constant.
Entities connected to scholarly articles
- Links to scholarly articles: Number of triples that have scholarly articles as object is 529702818 (530M)
- Outside links to scholarly articles: Number of triples that have scholarly articles as object, but subject is not another scholarly article is 266708031 (266M)
Entities scholarly articles connect to
Queries to directly find what links from scholarly articles to other non-scholarly articles is running into time outs. Another way to estimate this is looking into predicates that tend to point towards other non-scholarly article items in Wikidata.
The predicates considered were:
- main subject
- language of work or name
- stated as
- on focus list of Wikimedia project
- describes a project that uses
- determination method
- object has role
These together are 85851140 (85M) triples, therefore 85M triples link directly from scholarly articles to things possibly non-scholarly article.
Note that this does not include triples contained in the statements of these triples, if any; but should not add too many triples to it either.
Rate of growth of scholarly articles
The growth of:
- the number of scholarly articles
- the number of triples related to scholarly articles
is shown in the figure below. It seems they don't grow too fast, but the data is only over a span of 4 weeks.
More trend data can be found in wikicite.org/statistics, but they report for all publications, a much larger category than scholarly articles.
Scholarly Article Counts Summary
Predicates of Scholarly Articles
- Total distinct predicates: Total distinct predicates of scholarly articles is 2107
- Non-wikidata predicates: 17 of these predicates are non-wikidata predicates, i.e unlike P31/P279etc, they do not start with the prefix
wikidata.org. These predicates form 60% of the scholarly article triples.
- Non-wiki predicates: 14 of these predicates don't start with
wikiba.se. These are 46% of the scholarly article triples.
- Descriptions: Descriptions of scholarly articles form 20.5% of the scholarly article triples. This is 10% of all wikidata triples. Recall that all descriptions (of all items) together forms 19.5% of the entire wikidata.
- External IDs: There are 1000 external IDs associated with scholarly articles, which form ~50% of the distinct predicates of scholarly articles. External IDs form 4% of scholarly article triples, 2% of all triples.
Considering only wikidata predicates, i.e mainly properties of items, and grouping based on property label of the property, we get the following count of top predicates. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.
|Property name||# of Triples||% of Scholarly Article Triples||% of all Triples|
|author name string||402859029||6.296||3.125|
|language of work or name||34801291||0.543||0.270|
|ResearchGate publication ID||13734703||0.216||0.108|
|Dimensions Publication ID||4617733||0.072||0.036|
Top External IDs
Grouping based on property label of the external IDs, we get the following count of top external IDs. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.
|External ID||# of Triples||% of Scholary Article Triples||% of all Triples|
|ResearchGate publication ID||13734703||0.216||0.108|
|Dimensions Publication ID||4617733||0.072||0.036|
|CJFD journal article ID||2915210||0.045||0.024|
|DBLP publication ID||2080695||0.035||0.015|
|OpenCitations bibliographic resource ID||1161475||0.020||0.010|
Distribution of predicates
The distribution of distinct predicates per article is given below:
- Average distinct predicate per article: 29
- Maximum distinct predicate per article: 102
- Minimum distinct predicate per article: 7
Scholarly Articles' Author
The following analysis is done with the predicate P50, which links to an author item in Wikidata. Other author data include 'author name string', but it is not considered here as these are literals and do not link to other Wikidata items.
- Total number of distinct authors (through P50) in wikidata: 1.9M
- Number of distinct authors who wrote scholarly articles: 1.8M
- Number of distinct authors who wrote ONLY scholarly articles: 1.44M
- Number of distinct authors who wrote other kinds of articles: 0.5M
- Number of distinct authors who wrote did not write any scholarly article: 0.13M
- Number of authors who wrote both scholarly and non-scholarly articles: 0.36
- Number of distinct authors who wrote scholarly articles: 1.8M
- Number of direct links to authors in scholarly articles: 19756767 (19.7M)
- There are ~28M triples where items (not scholarly articles) link to authors.
- Average author per article: ~2
- Maximum author per article: 1070
- Minimum author per article: 1
This section aims to find statistics about the queries that touch on the scholarly articles subgraph. Given the inter-connected nature of Wikidata, it is very hard to find queries that somehow relate to scholarly articles, but some approximations are possible. To this end the various types of possible queries possible are:
- The query directly mentions some scholarly article(s) (e.g list the authors of an article)
- The query asks for scholarly article in results (e.g list articles published in a specific date)
- The query asks for information that has to pass through the scholarly articles (e.g list authors of articles published in a specific date)
To approximate the number of queries that relate to scholarly articles certain assumptions were made. These should practically cover all such queries.
- Query containing the QID of scholarly article: Q13442814
- The query will contain the QID of the articles themselves
- The query will contain objects, subjects, or predicates that are used most often in the scholarly article subgraph in Wikidata (e.g author, publication date etc). This set of queries should approximately cover 2nd and 3rd category of queries.
- Query containing predicates that are used mostly in scholarly articles subgraph
- Query containing subject/object URIs that are used mostly in scholarly articles subgraph
- Query containing literals that are used mostly in scholarly articles subgraph
Q: What do you mean by subject/predicate/object mostly in scholarly articles subgraph?
A: Some items occur almost always in relation to scholarly articles. Typical examples include: author property, certain author items, publication date, cites, DOI, and lots of other external IDs. One way to get an approximation of such items is to find out the percentage of use of an item in scholarly article versus in the entire Wikidata. Distribution of this percentage and more analysis given below.
The following analysis uses Wikidata dump of
20210816 and WDQS public SPARQL queries of
08/2021. All query related values below are
|Category||Count||% of all queries|
|Query contains scholarly article QID Q13442814||70K||0.04|
|Query contains scholarly article instance QID||730K||0.4|
|Query contains properties mostly relevant to scholarly articles||2.7M||1.4|
|Query contains subject or object URIs mostly relevant to scholarly articles||750K||0.4|
|Query contains literals mostly relevant to scholarly articles||825K (max 2.2M)||0.4 (max 1.2)|
|Total scholarly article related queries||3.7M (max 4.7M)||1.96 (max 2.5)|
Queries with scholarly article QID
Although there are other kinds of articles (see Definition of Scholarly Articles for details), since scholarly articles outnumber others significantly, the first step was to find queries that specify the QID of scholarly articles ( Q13442814) directly. This includes queries that may ask for list of scholarly with other conditions. For instance: List of scholarly articles published in a specific date or by a specific author.
- The number of queries that contained the QID of scholarly articles cam out to be ~70K, which is 0.04% of monthly queries.
Queries with scholarly article instance
Another almost direct way to identify queries related to scholarly articles is to get the queries that mentions any scholarly article. For this, the queries are checked for presence of items that are
instance of scholarly articles. For example: List the authors for an article.
- The number of such queries was ~730K, which is 0.4% of monthly queries.
Queries with properties mostly relevant to scholarly articles
Some properties are used almost always for scholarly articles, such as author (P50), author name string (P2093), cites work (P2860), IDs like P7710, P5875, P818, etc. If these properties were to be used in a query, we can assume the query has to pass through the scholarly articles subgraph.
There are around 2000 distinct properties used in the scholarly articles subgraph (refer to Predicates of Scholarly Articles for details on properties). To get a list of properties most concerned with scholarly articles, the following steps were taken:
- The usage count of these properties was counted in the entirety of Wikidata (
- The usage count of properties was counted within scholarly articles subgraph (
- Then percentage of usage within scholarly article subgraph was calculated (
- All properties with >=99% usage solely in scholarly articles subgraph were considered properties most concerned with scholarly articles.
A histogram of the distribution of usage percentage for all predicates used in the scholarly article subgraph is given below. Note that the predicates were grouped with their P values. That is, wd:P50, wdt:P50 would be considered simply P50.
- The number of predicates used >=99% in scholarly article subgraph is 40. If each predicate was considered separately, i.e wd, wdt, wdtn etc were considered separately, then 128 predicates are counted to be used >=99% in scholarly articles.
- The number of queries that use these 40 predicates is 2.7M, which is 1.4% of monthly queries.
Queries with sub/obj URIs mostly relevant to scholarly articles
Just like properties, some items can occur more in scholarly articles subgraph than other places in wikidata. Following a similar procedure, subject and object URIs usage percentage was calculated and those >=99% were considered more related to scholarly articles. The queries were then searched for presence of these more relevant items.
A histogram of the distribution of usage percentage for all subjects and object URIs used in the scholarly article subgraph is given below.
- The number of queries that use items more relevant to scholarly articles is 750K, which is 0.4% of monthly queries.
- Note that number of queries containing instances of scholarly articles was 730K. Therefore, most of the former number actually includes queries that directly mention the scholarly articles.
- The figure shows that quite a large number of items are used mostly in scholarly articles (>=99% usage), which were then used to sift through queries.
Queries with literals mostly relevant to scholarly articles
Literals were analyzed separately. They always occur as objects, with or without additional language or datatype tags ("label"@en, "2021"^^xsd:integer). A user can construct queries containing literals in the following ways:
- match the whole literal string, e.g "labelstring". Addionally use LAND/DTYPE for further filtering. But the literal appears as plain string.
- match literal with language or dtype tags, e.g "labelstring@en"
- match part of the literal, e.g using
regex(?g, "matchThisSubstring")or more complex expressions.
For the purposes of analysis, only 1 and 2 were considered for finding the queries since matching substrings is rather complicated and not as much reliable. But we suspect that if a query were to contain literals related to scholarly articles, it should also contain some predicates or other URIs that are mostly related to scholarly articles. Similar to before, literals that were used >=99% of the times in scholarly subgraphs were used to find queries related to scholarly articles.
A histogram of the distribution of usage percentage for all literals used in the scholarly article subgraph is given below.
- The number of such queries was ~825K, which is 0.4% of monthly queries. Note that this number was obtained from July. In August, it was ~600K.
Removing literals used in references and values
References and Values are not considered to be part of scholarly article subgraph since they may be used in other places as well. But this causes the usage percentage of certain items clearly related to scholarly articles to be less than 99%, and does not get included in query counting. Therefore, query couting was done later by removing all references and values. For URIs count, removal of refs and vals does not give significantly different count of scholarly article related queries. But some differences are seen when literals are matched to count queries.
- The number of queries with literal usage >=99% in scholarly article subgraph, exluding any occurances in references or values, is 2.2M for July, which is 1.2% of monthly queries. It is 1.6M for August, which is 0.84% of monthly queries.
- The total number of queries related to scholarly articles therefore becomes 4.7M, forming 2.5% of monthly queries. These values are marked as max in the summary table.
Queries with labels and descriptions mostly relevant to scholarly articles
While literals already cover labels and descriptions, a separate analysis was done on them. The process of finding scholarly article related queries remains the same, except this time we only look for labels and descriptions of scholarly articles in the queries.
- The number of queries asking for labels or descriptions of scholarly articles was ~300K, which is 0.16% of monthly queries.
This section analyses the queries that were extracted following the above methods as being related to scholarly articles. A total of ~3.7M queries (out of 190M) were identified as such.
- Number of distinct user agents: 3,138
|User Agent||Count||Percentage of scholarly article queries|
|UA # 1||154249||4.2|
|Toolforge - mix-n-match||110080||2.99|
|Toolforge - legacy code||109578||2.97|
|UA # 2||105035||2.9|
|UA # 3||62926||1.7|
|PyPoli University Matching||42906||1.2|
|Toolforge - wikidata-terminator||32116||0.87|
|UA # 4||22480||0.6|
N.B: Unknown or non-bot user agents were marked
UA # x
|query time class||count|
The following table shows the top subject, predicates, and objects used in queries that were identified as being related to scholarly articles. Top items are the top wikidata items or properties used anywhere within a query. These can occur as part of triples (subject/predicate/object) or outside (within VALUES).
The following table shows the top paths used in queries related to scholarly articles. Ordinary properties are not considered as ps. The following list contains not only the paths, but also their breakdown into components paths (as done by Jena ARQ while parsing SPARQL queries). For instance:
(p:P31/ps:P31)/(wdt:P279)* is recorded as: