weso / sparkwdsub

Spark processing of wikidata subsets
MIT License
0 stars 3 forks source link

Continuous Integration

sparkwdsub

Spark processing of wikidata subsets using Shape Expressions.

This repo contains an example script that processes Wikidata subsets using Shape Expressions.

The algorithm used to validate schemas in parallel has been ported to its own repo at pschema

Building and running

The system requires sbt to be built. Once you download and install it. You only need to run:

sbt assembly

which will generate a flat jar with all dependencies in folder: target/scala-2.12/ called: sparkwdsub.jar.

Using a local cluster

In order to run the system locally, you need to download and install Apache Spark. You should have the executable spark-submit accessible in your path.

Once you have Spark installed and the sparkwdsub.jar generated you can run with the following example:

spark-submit --class "es.weso.wdsub.spark.Main" --master local[4] target/scala-2.12/sparkwdsub.jar -d examples/sample-dump1.json.gz  -m cluster -n testCities -s examples/cities.shex -k -o target/cities

which will generate a folder target/cities with information about the extracted dump.

Using AWS

In AWS, you can do the following steps:

Using Google Cloud

Instructions pending...

Command line

Usage: sparkwdsub dump --schema <file> [--out <file>] [--site <string>] [--maxIterations <integer>] [--verbose] [--loggingLevel <string>] <dumpFile>
 Process example dump file.
 Options and flags:
     --help
         Display this help text.
     --schema <file>, -s <file>
         schema path
     --out <file>, -o <file>
         output path
     --site <string>
         Base url, default =http://www.wikidata.org/entity
     --maxIterations <integer>
        Max iterations for Pregel algorithm, default =20
     --verbose
         Verbose mode
     --loggingLevel <string>
         Logging level (ERROR, WARN, INFO), default=ERROR

Example:

sparkwdsub dump -s examples/cities.shex examples/6lines.json