YAGO is a large semantic knowledge base, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data sources. Currently, YAGO knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities.
YAGO is special in several ways:
YAGO is jointly developed at the DBWeb group at Télécom ParisTech University, the Databases and Information Systems group at the Max Planck Institute for Informatics, and Ambiverse.
(*) Not every version of YAGO is manually evaluated. Most notably, the version generated by this code may not be the one that we evaluated! Check the versions on the YAGO download page
If you are just interested in the data of YAGO, there is no need to use the present code repository. You can download data of YAGO from the YAGO homepage.
If you are interested in using the source code of YAGO, or in contributing to it, read on. The source code of YAGO is a Java project that extracts facts from Wikipedia and the other data sources, and stores these facts in files. These files make up the YAGO knowledge base.
If you run the code yourself, you can define (a) what Wikipedia languages to cover, and (b) which specific Wikipedia, Wikidata, and Wikimedia Commons snapshots should be used during the build.
The following Java projects belong to YAGO
To run YAGO, you need the following:
requests
(you can use pip install requests
to install this module)YAGO is configured with a configuration file. Use this template to generate your own copy of that file. It should contain the following lines:
reuse = true|false
: Specifies whether a new run of YAGO should overwrite or re-use the facts that have already been generated in a previous run.yagoFolder = ...
: Specifies the folder where the YAGO facts shall be stored.languages = en, de, fr, nl, it, es, ro, pl, ar, fa
: Specifies the Wikipedia languages from which YAGO shall extract the facts. Use ISO 639-1 language codes.extractors
: List of extractors to run. By default, just use the list from the template.subgraphClasses
: Specify a single class (e.g. subgraphEntities
.subgraphEntities
: Specify a single entity (e.g. subgraphClasses
.YAGO needs the following data sources:
pages-articles
.wikidata-DATE-all-ttl
.If you want to download the latest versions of the data sources automatically, add the following line to your YAGO configuration file:
dumpsFolder = ...
: points to a folder where the data sources live.Then run the following code (works on Linux or Mac):
python scripts/dumps/downloadDumps.py -y <PATH_TO_CONFIGURATION_FILE>
This code will create a new configuration file, which you will have to use in the sequel.
Alternatively, you can download the required data sources manually. Then add the following lines to your configuration file:
wikipedias = ...
: a comma-separated list of the Wikipedia dumps, in the order of the languages specified with the languages
parameter.wikidata = ...
: Points to the WikiData file.commons_wiki = ...
: Points to the WikiCommons file.geonames = ...
: Points to the folder where Geonames is stored.Once the configuration file has been prepared and all required resources have been downloaded, a YAGO build can be started like this:
cd <PATH_TO_YAGO3>
export MAVEN_OPTS=-Xmx220G
mvn clean verify exec:java -Dexec.args=<PATH_TO_CONFIGURATION_FILE>
Watch out to use the new configuration file if you used the Python script to download the data resources. Allocating 220G of main memory to YAGO is a reasonable estimate which typically works fine, but of course this highly depends on the number of languages you execute the build for. Increase this value if necessary.
Once the processing finished, all output can be found in the directory given by the yagoFolder
parameter in your configuration file.
The overall goal of the YAGO architecture is to enable cooperation of several contributors, facilitate debugging and maintenance, and allow users to download only particular pieces of YAGO ("YAGO a la carte"). In short: YAGO is modular, both in code and in data.
The current architecture pursues the goal of modularity at the expense of longer running times and inefficiency. The rationale is that we do not care if the extraction runs a few hours longer, if we can save a few hours of human work in return.
The YAGO data is split into "themes". Each theme corresponds to a file on disk. A theme contains facts (either in RDF or in TSV, see the section on data formats below). Themes can overlap, but should not. The class basics.Theme
implements a theme.
Themes that are free of duplicates and ready for export are called "final themes". These live in the same folder as the other themes, but start with yago...
. The final themes make up the YAGO knowledge base.
An extractor is a unit of Java code that takes as input (1) one or more themes and/or (2) a raw data file, and that produces as output one or more themes.
Extractors implement extractors.Extractor
. Common postprocessing steps (such as translating entities) implement the class FollowUpExtractor
. This defines a dependency graph of extractors. Extractors are scheduled in the right order and called by main.ParallelCaller
.
Facts can have a meta-fact extractionSource
. This meta-fact can have a meta-fact extractionTechnique
. There should be a finite set of techniques that does not grow with the data.
Facts that do not have such an annotation are assumed to be trivially clean.
Extractors are split into the following packages:
In YAGO (as in RDF), each fact consists of a subject, a predicate, and an object. Every fact can have a fact id. This allows facts to talk about other facts. The fact id is simply computed as a hash from the subject, predicate, and object of the fact. An example fact is
<id_abcd> <Elvis_Presley> <marriedTo> <Priscilla_Presley>
Entity names follow the RDF/Turtle convention. Turtle leaves some design choices open. We use the following conventions:
<Albert_Einstein>
. This is because qnames may not contain certain characters<http://...>
\uXXXX
encodings are avoided wherever possible.We use predefined RDF entities wherever possible, in particular
rdfs:domain, skos:prefLabel, rdfs:range, rdfs:label, rdfs:subClassOf,
rdfs:subPropertyOf, rdf:type, rdf:Resource, xsd:boolean, rdf:Class,
xsd:date, xsd:duration, rdf:Statement, xsd:integer, xsd:nonNegativeInteger,
xsd:decimal, xsd:decimal, rdf:Property, xsd:string, xsd:gYear, owl:Thing
We integrate XML types into the YAGO literal type hierarchy. We use YAGO literal types as literal types in RDF. The root of the taxonomy of individuals (formerly "entity") is owl:Thing
.
Currently, all facts are in this format even inside the running program (i.e., inside Fact, FactCollection, etc.). The implementation has to convert specifically to a Java String in order to work with real (16-bit) Java strings. This is done by FactComponent.asJavaString()
All frequent YAGO and RDFS string constants are declared in basics.YAGO
and basics.RDFS
, respectively.
YAGO can store facts either in TSV or in RDF/Turtle. All modules can deal with both formats, but we typically use TSV, because it is faster.
The TSV (Tab-Separated Values) format of YAGO contains 5 columns:
The RDF/Turtle format follows the standard Turtle conventions. To say that a fact ABC
has a fact id ID
, we use a comment in the line before the fact
#@ ID
ABC
The source code of YAGO is licensed under GNU General Public License, version 3 or later.
The files generated by YAGO are licensed under Creative-Commons Attribution License.
The YAGO development is lead by (in alphabetical order):
Contributors include (in alphabetical order):
[(https://travis-ci.org/yago-naga/yago3)
If you use the data of YAGO in your research, please cite:
@inproceedings{yago,
author = {Fabian M. Suchanek and Gjergji Kasneci and Gerhard Weikum},
title = {{Yago: A Core of Semantic Knowledge}},
booktitle = {16th International Conference on the World Wide Web},
pages = {697--706},
year = {2007}
}
If you use the code YAGO in your research, please cite:
@inproceedings{YAGO2016,
author = {Thomas Rebele and
Fabian M. Suchanek and
Johannes Hoffart and
Joanna Biega and
Erdal Kuzey and
Gerhard Weikum},
title = {{YAGO:} {A} Multilingual Knowledge Base from Wikipedia, Wordnet, and
Geonames},
booktitle = {The Semantic Web - {ISWC} 2016 - 15th International Semantic Web Conference,
Kobe, Japan, October 17-21, 2016, Proceedings, Part {II}},
pages = {177--185},
year = {2016},
url = {https://doi.org/10.1007/978-3-319-46547-0_19},
doi = {10.1007/978-3-319-46547-0_19},
}