You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "User:Joal/JanusGraph"

From Wikitech
Jump to navigation Jump to search
imported>Joal
(Page creation. Headline only)
 
imported>Joal
(Add links section)
Line 1: Line 1:
This page documents my work-log in playing with JanusGraph.
+
This page documents my work-log in playing with [https://janusgraph.org/ JanusGraph].
 +
 
 +
== Links ==
 +
 
 +
=== WDQS ===
 +
 
 +
* [[Wikidata query service]]
 +
* [[Wikidata query service/ScalingStrategy]]
 +
* [[Talk:Wikidata query service/ScalingStrategy]]
 +
 
 +
=== Wikidata ===
 +
 
 +
* https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
 +
 
 +
=== Janus/Gremlin/Tinkerpop ===
 +
 
 +
* https://docs.janusgraph.org/
 +
* https://tinkerpop.apache.org/docs/3.4.1/reference/#_tinkerpop_documentation
 +
* https://github.com/LITMUS-Benchmark-Suite/sparql-to-gremlin<br />
 +
 
 +
== 2019-09-06 - Install and tests on Cloud VPS ==
 +
I have already made an install of JanusGraph on cloud-VPS, but it was almost a year ago at All-Hands. Starting fresh :)
 +
 
 +
I'm using  (JanusGraph needs Java 1.8) and  [https://github.com/JanusGraph/janusgraph/releases/tag/v0.4.0 JanusGraph 0.0.4] (latest as of 2019-09-06)
 +
 
 +
=== Install and test ===
 +
 
 +
* I created the <code>janus1-1</code> large instance using <code>Debian 9.9 Stretch</code> (java 8 needed) in the cloud-VPS analytics project with [https://horizon.wikimedia.org Horizon]
 +
* I followed the introduction section of https://docs.janusgraph.org/, changing ElasticSearch index-backend to Lucene (single node test).
 +
 
 +
==== Install ====
 +
<syntaxhighlight lang="bash">
 +
ssh janus1-1.analytics.eqiad.wmflabs
 +
 
 +
sudo apt-get install unzip openjdk-8-jre
 +
 
 +
wget https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip
 +
 
 +
unzip janusgraph-0.4.0-hadoop2.zip
 +
 
 +
cd janusgraph-0.4.0-hadoop2
 +
 
 +
./bin/gremlin.sh
 +
 
 +
</syntaxhighlight>
 +
 
 +
==== Test ====
 +
<syntaxhighlight lang="groovy">
 +
/**********************************************
 +
  Configure and load graph
 +
**********************************************/
 +
 
 +
// Create graph with updated configuration (Lucen instead of ES)
 +
graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')
 +
 
 +
// Load graph example
 +
GraphOfTheGodsFactory.load(graph)
 +
 
 +
// Create graph traversal object
 +
g = graph.traversal()
 +
 
 +
/**********************************************
 +
  Test graph traversal
 +
**********************************************/
 +
 
 +
// Create a pointer to the Saturn node using index on name
 +
saturn = g.V().has('name', 'saturn').next()
 +
 
 +
// Show the Saturn node pointer values ([name:[saturn], age:[10000]])
 +
g.V(saturn).valueMap()
 +
 
 +
// Use the Saturn node pointer to find Saturn grand-child name (hercules)
 +
g.V(saturn).in('father').in('father').values('name')
 +
==>hercules
 +
 
 +
// Use geo index to find edges having a place property within 50km of Athen (2 results)
 +
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50)))
 +
 
 +
// Find nodes connected to the edges found by geo-index query and show their names (2 results)
 +
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50))).
 +
  as('source').inV().as('god2').
 +
  select('source').outV().as('god1').
 +
  select('god1', 'god2').by('name')
 +
</syntaxhighlight>
 +
 
 +
=== Load a subset of wikidata ===
 +
 
 +
==== Filtering of a truthy N-Triples wikidata dumps with Spark ====
 +
<syntaxhighlight lang="scala">
 +
import org.apache.spark.sql.functions._
 +
 
 +
val dump_path = "/user/joal/wmf/data/raw/mediawiki/wikidata/truthy_ntdumps/20190904"
 +
val df = spark.read.format("csv").
 +
  option("mode", "FAILFAST").
 +
  option("delimiter", " ").
 +
  load(dump_path).
 +
  withColumnRenamed("_c0", "origin").
 +
  withColumnRenamed("_c1", "link").
 +
  withColumnRenamed("_c2", "dest").
 +
  drop("_c3").
 +
  cache()
 +
 
 +
df.count()
 +
// 4139056936 - Wow!!!
 +
df.where("origin is null or link is null or dest is null").count()
 +
// 0 - \o/ well-formed data
 +
 
 +
df.select("origin").distinct().count()
 +
// 124151595
 +
 
 +
df.select("dest").distinct().count()
 +
// 685067856
 +
 
 +
df.select("link").distinct().count()
 +
// 6516                                                             
 +
 
 +
df.groupBy("link").count.sort(desc("count")).limit(20).show(20, false)
 +
/*
 +
+-------------------------------------------------+----------+                 
 +
|link                                            |count    |
 +
+-------------------------------------------------+----------+
 +
|<http://schema.org/description>                  |2014877520|
 +
|<http://schema.org/name>                        |322876582 |
 +
|<http://www.w3.org/2004/02/skos/core#prefLabel>  |322876582 |
 +
|<http://www.w3.org/2000/01/rdf-schema#label>    |322876582 |
 +
|<http://www.wikidata.org/prop/direct/P2860>      |174268418 |
 +
|<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>|121788917 |
 +
|<http://www.wikidata.org/prop/direct/P2093>      |95052532  |
 +
|<http://www.w3.org/2004/02/skos/core#altLabel>  |67929447  |
 +
|<http://schema.org/dateModified>                |62033634  |
 +
|<http://schema.org/about>                        |62033306  |
 +
|<http://schema.org/version>                      |62033306  |
 +
|<http://www.wikidata.org/prop/direct/P31>        |57574607  |
 +
|<http://www.wikidata.org/prop/direct/P1476>      |24229359  |
 +
|<http://www.wikidata.org/prop/direct/P577>      |23635113  |
 +
|<http://www.wikidata.org/prop/direct/P1433>      |22480553  |
 +
|<http://www.wikidata.org/prop/direct/P304>      |20782592  |
 +
|<http://www.wikidata.org/prop/direct/P478>      |20710316  |
 +
|<http://www.wikidata.org/prop/direct/P433>      |19305486  |
 +
|<http://www.wikidata.org/prop/direct/P698>      |18216932  |
 +
|<http://www.wikidata.org/prop/direct/P356>      |17315770  |
 +
*/
 +
 
 +
val fdf = df.where("""
 +
      link NOT IN (
 +
        '<http://schema.org/description>',
 +
        '<http://www.w3.org/2004/02/skos/core#prefLabel>',
 +
        '<http://www.w3.org/2000/01/rdf-schema#label>',
 +
        '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
 +
        '<http://www.w3.org/2004/02/skos/core#altLabel>',
 +
        '<http://schema.org/dateModified>',
 +
        '<http://schema.org/about>',
 +
        '<http://schema.org/version>'
 +
      ) AND (
 +
        link != '<http://schema.org/name>' OR dest LIKE '%@en'
 +
      )
 +
  """).cache()
 +
 
 +
fdf.count()
 +
// 825749515 -- Better - Need some naming effort
 +
 
 +
// Check http://www.wikidata.org/prop links
 +
df.where("link like '<http://www.wikidata.org/prop%'").select("link").distinct.count
 +
// 6486                                                             
 +
df.where("link like '<http://www.wikidata.org/prop/direct/%'").select("link").distinct.count
 +
// 6351                                                             
 +
df.where("link like '<http://www.wikidata.org/prop/direct-normalized/%'").select("link").distinct.count
 +
// 135 direct or direct-normalized only - GOOD :)
 +
 
 +
 
 +
fdf.where("link not like '<http://www.wikidata.org/prop%'").groupBy("link").count.sort(desc("count")).limit(100).show(100, false)
 +
/*
 +
+----------------------------------------------------+--------+               
 +
|link                                                |count  |
 +
+----------------------------------------------------+--------+
 +
|<http://schema.org/name>                            |46018455|
 +
|<http://www.w3.org/2002/07/owl#sameAs>              |2464024 |
 +
|<http://wikiba.se/ontology#claim>                  |6595    |
 +
|<http://wikiba.se/ontology#statementProperty>      |6595    |
 +
|<http://wikiba.se/ontology#qualifier>              |6595    |
 +
|<http://wikiba.se/ontology#directClaim>            |6595    |
 +
|<http://wikiba.se/ontology#statementValue>          |6595    |
 +
|<http://www.w3.org/2002/07/owl#complementOf>        |6595    |
 +
|<http://wikiba.se/ontology#qualifierValue>          |6595    |
 +
|<http://wikiba.se/ontology#reference>              |6595    |
 +
|<http://www.w3.org/2002/07/owl#someValuesFrom>      |6595    |
 +
|<http://wikiba.se/ontology#propertyType>            |6595    |
 +
|<http://www.w3.org/2002/07/owl#onProperty>          |6595    |
 +
|<http://wikiba.se/ontology#referenceValue>          |6595    |
 +
|<http://wikiba.se/ontology#novalue>                |6595    |
 +
|<http://wikiba.se/ontology#statementValueNormalized>|4758    |
 +
|<http://wikiba.se/ontology#referenceValueNormalized>|4758    |
 +
|<http://wikiba.se/ontology#directClaimNormalized>  |4758    |
 +
|<http://wikiba.se/ontology#qualifierValueNormalized>|4758    |
 +
|<http://www.w3.org/2002/07/owl#imports>            |328    |
 +
|<http://creativecommons.org/ns#license>            |328    |
 +
|<http://schema.org/softwareVersion>                |328    |
 +
+----------------------------------------------------+--------+
 +
*/
 +
 
 +
 
 +
// Looking for origin renaming scheme
 +
fdf.where("origin not like '<http://www.wikidata.org/entity/%>' and origin not like '_:genid%' and origin not like '<http://www.wikidata.org/prop/novalue/P%'").select("origin").distinct.show(20, false)
 +
+--------------------------------+                                             
 +
|origin                          |
 +
+--------------------------------+
 +
|<http://wikiba.se/ontology#Dump>|
 +
+--------------------------------+
 +
 
 +
 
 +
// Looking for dest renaming scheme
 +
fdf.where("dest like '<http://www.wikidata.org/entity/%'").count
 +
// 390170853
 +
 
 +
fdf.where("dest like '%^^<%'").count
 +
res44: Long = 50029908                                                         
 +
 
 +
// --> <http://wikiba.se/ontology#propertyType> will be very helpfull to interpret property values
 +
 
 +
 
 +
</syntaxhighlight>

Revision as of 17:22, 12 September 2019

This page documents my work-log in playing with JanusGraph.

Links

WDQS

Wikidata

Janus/Gremlin/Tinkerpop

2019-09-06 - Install and tests on Cloud VPS

I have already made an install of JanusGraph on cloud-VPS, but it was almost a year ago at All-Hands. Starting fresh :)

I'm using (JanusGraph needs Java 1.8) and JanusGraph 0.0.4 (latest as of 2019-09-06)

Install and test

  • I created the janus1-1 large instance using Debian 9.9 Stretch (java 8 needed) in the cloud-VPS analytics project with Horizon
  • I followed the introduction section of https://docs.janusgraph.org/, changing ElasticSearch index-backend to Lucene (single node test).

Install

ssh janus1-1.analytics.eqiad.wmflabs

sudo apt-get install unzip openjdk-8-jre

wget https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip

unzip janusgraph-0.4.0-hadoop2.zip

cd janusgraph-0.4.0-hadoop2

./bin/gremlin.sh

Test

/**********************************************
  Configure and load graph
**********************************************/

// Create graph with updated configuration (Lucen instead of ES)
graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')

// Load graph example
GraphOfTheGodsFactory.load(graph)

// Create graph traversal object
g = graph.traversal()

/**********************************************
  Test graph traversal
**********************************************/

// Create a pointer to the Saturn node using index on name
saturn = g.V().has('name', 'saturn').next()

// Show the Saturn node pointer values ([name:[saturn], age:[10000]])
g.V(saturn).valueMap()

// Use the Saturn node pointer to find Saturn grand-child name (hercules)
g.V(saturn).in('father').in('father').values('name')
==>hercules

// Use geo index to find edges having a place property within 50km of Athen (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50)))

// Find nodes connected to the edges found by geo-index query and show their names (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50))).
  as('source').inV().as('god2').
  select('source').outV().as('god1').
  select('god1', 'god2').by('name')

Load a subset of wikidata

Filtering of a truthy N-Triples wikidata dumps with Spark

import org.apache.spark.sql.functions._

val dump_path = "/user/joal/wmf/data/raw/mediawiki/wikidata/truthy_ntdumps/20190904"
val df = spark.read.format("csv").
  option("mode", "FAILFAST").
  option("delimiter", " ").
  load(dump_path).
  withColumnRenamed("_c0", "origin").
  withColumnRenamed("_c1", "link").
  withColumnRenamed("_c2", "dest").
  drop("_c3").
  cache()
  
df.count()
// 4139056936 - Wow!!!
df.where("origin is null or link is null or dest is null").count()
// 0 - \o/ well-formed data

df.select("origin").distinct().count()
// 124151595

df.select("dest").distinct().count()
// 685067856

df.select("link").distinct().count()
// 6516                                                              

df.groupBy("link").count.sort(desc("count")).limit(20).show(20, false)
/*
+-------------------------------------------------+----------+                  
|link                                             |count     |
+-------------------------------------------------+----------+
|<http://schema.org/description>                  |2014877520|
|<http://schema.org/name>                         |322876582 |
|<http://www.w3.org/2004/02/skos/core#prefLabel>  |322876582 |
|<http://www.w3.org/2000/01/rdf-schema#label>     |322876582 |
|<http://www.wikidata.org/prop/direct/P2860>      |174268418 |
|<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>|121788917 |
|<http://www.wikidata.org/prop/direct/P2093>      |95052532  |
|<http://www.w3.org/2004/02/skos/core#altLabel>   |67929447  |
|<http://schema.org/dateModified>                 |62033634  |
|<http://schema.org/about>                        |62033306  |
|<http://schema.org/version>                      |62033306  |
|<http://www.wikidata.org/prop/direct/P31>        |57574607  |
|<http://www.wikidata.org/prop/direct/P1476>      |24229359  |
|<http://www.wikidata.org/prop/direct/P577>       |23635113  |
|<http://www.wikidata.org/prop/direct/P1433>      |22480553  |
|<http://www.wikidata.org/prop/direct/P304>       |20782592  |
|<http://www.wikidata.org/prop/direct/P478>       |20710316  |
|<http://www.wikidata.org/prop/direct/P433>       |19305486  |
|<http://www.wikidata.org/prop/direct/P698>       |18216932  |
|<http://www.wikidata.org/prop/direct/P356>       |17315770  |
*/

val fdf = df.where("""
      link NOT IN (
        '<http://schema.org/description>',
        '<http://www.w3.org/2004/02/skos/core#prefLabel>',
        '<http://www.w3.org/2000/01/rdf-schema#label>',
        '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
        '<http://www.w3.org/2004/02/skos/core#altLabel>',
        '<http://schema.org/dateModified>',
        '<http://schema.org/about>',
        '<http://schema.org/version>'
      ) AND (
        link != '<http://schema.org/name>' OR dest LIKE '%@en'
      )
  """).cache()
  
fdf.count()
// 825749515 -- Better - Need some naming effort
  
// Check http://www.wikidata.org/prop links
df.where("link like '<http://www.wikidata.org/prop%'").select("link").distinct.count
// 6486                                                              
df.where("link like '<http://www.wikidata.org/prop/direct/%'").select("link").distinct.count
// 6351                                                              
df.where("link like '<http://www.wikidata.org/prop/direct-normalized/%'").select("link").distinct.count
// 135 direct or direct-normalized only - GOOD :)


fdf.where("link not like '<http://www.wikidata.org/prop%'").groupBy("link").count.sort(desc("count")).limit(100).show(100, false)
/*
+----------------------------------------------------+--------+                 
|link                                                |count   |
+----------------------------------------------------+--------+
|<http://schema.org/name>                            |46018455|
|<http://www.w3.org/2002/07/owl#sameAs>              |2464024 |
|<http://wikiba.se/ontology#claim>                   |6595    |
|<http://wikiba.se/ontology#statementProperty>       |6595    |
|<http://wikiba.se/ontology#qualifier>               |6595    |
|<http://wikiba.se/ontology#directClaim>             |6595    |
|<http://wikiba.se/ontology#statementValue>          |6595    |
|<http://www.w3.org/2002/07/owl#complementOf>        |6595    |
|<http://wikiba.se/ontology#qualifierValue>          |6595    |
|<http://wikiba.se/ontology#reference>               |6595    |
|<http://www.w3.org/2002/07/owl#someValuesFrom>      |6595    |
|<http://wikiba.se/ontology#propertyType>            |6595    |
|<http://www.w3.org/2002/07/owl#onProperty>          |6595    |
|<http://wikiba.se/ontology#referenceValue>          |6595    |
|<http://wikiba.se/ontology#novalue>                 |6595    |
|<http://wikiba.se/ontology#statementValueNormalized>|4758    |
|<http://wikiba.se/ontology#referenceValueNormalized>|4758    |
|<http://wikiba.se/ontology#directClaimNormalized>   |4758    |
|<http://wikiba.se/ontology#qualifierValueNormalized>|4758    |
|<http://www.w3.org/2002/07/owl#imports>             |328     |
|<http://creativecommons.org/ns#license>             |328     |
|<http://schema.org/softwareVersion>                 |328     |
+----------------------------------------------------+--------+
*/


// Looking for origin renaming scheme
fdf.where("origin not like '<http://www.wikidata.org/entity/%>' and origin not like '_:genid%' and origin not like '<http://www.wikidata.org/prop/novalue/P%'").select("origin").distinct.show(20, false)
+--------------------------------+                                              
|origin                          |
+--------------------------------+
|<http://wikiba.se/ontology#Dump>|
+--------------------------------+


// Looking for dest renaming scheme
fdf.where("dest like '<http://www.wikidata.org/entity/%'").count
// 390170853

fdf.where("dest like '%^^<%'").count
res44: Long = 50029908                                                          

// --> <http://wikiba.se/ontology#propertyType> will be very helpfull to interpret property values