Code related to the Researchers and Publications track from the Hercules challenge.
In order to run the code from this repository, Python 3.7 or greater is required. Experiments were executed in Python 3.7.8, and that is the preferred version for the execution of the models.
Instructions to install Python 3.7.8 are available at the official website.
Once Python has been installed, it is preferrable to create a [environment] before installing the dependencies. To create a new python environment, the following command can be used:
python -m venv edma_env
This environment can be then used with the following command:
souce edma_env/bin/activate
Finally, we can install the dependencies of the system with pip:
pip install -r requirements.txt
In the notebooks directory we provide a series of Jupyter notebooks that can be executed to explore how the systems were created and get more information about them or finetune their hyperparameters. In this section we will explain how to run those notebooks and provide some advice on how they should be executed.
If you have followed the steps from the previous section to install Python and the project dependencies, the jupyter package should already be installed. In order to run the Jupyter client, go to the notebooks directory and run the jupyter notebook command:
cd notebooks
jupyter notebook
This will open a new tab in your browser with the Jupyter explorer where the different files can be explored:
If the browser was not automatically opened, you can connect to the Jupyter client through localhost on port 8888 (localhost:8888).
Now you can click on any of the notebooks to explore its content or even rerun and modify the cells. Instructions on how to do this are provided in the official Notebook docs.
It is recommended to run the notebooks in a sequential manner, in the order indicated by their first filename number (i.e. notebook _1_datafetching.ipynb should be run before _2_DataExploration.ipynb and so on). Non-sequential execution is not recommended and should be avoided, since the execution of one notebook may depend on outputs produced by the previous ones.
Note: In order to run the systems you must first obtain the models used to perform the topic extraction. Due to size constraints, they are not included in these repositories. There are two main alternatives to obtain the models: the first one is the execution of every notebook to retrain and build the systems from scratch, but this may take some time; the second alternative is to go to the complete_system directory and follow the instructions to download the trained models. Several scripts are provided in the scripts folder to execute the systems and reproduce the results obtained for this track. In the following sections we will explain the main functionality of each script and how they can be executed.
The script _run_trackpredictions.py can be used to obtain at once all the topics assigned to every publication from the dataset. The following parameters can be passed to the script: | Name | Description | Compulsory | Allowed Values |
---|---|---|---|---|
-f --format | Output format of the results. If no output format is specified, results are returned in JSON by default. | No | One of csv, json, jsonld, n3, rdf/xml or ttl | |
-o --output | Name of the file where the results will be saved. If no output file is specified, results will be written to the console instead. | No | Any valid filename. |
For more additional information about how to run the script, you can execute the following command:
python scripts/run_track_predictions.py -h
In the following example, we will be running the script twice. The first execution will print the results in console and in json format (default values). The second one will save the results to the file results.ttl in the turtle format:
python scripts/run_track_predictions.py
python scripts/run_track_predictions.py -o results.ttl -f ttl
The script _predict_articletopics.py can be used to obtain the topics for a given article or list of articles. The following parameters can be passed to the string: | Name | Description | Compulsory | Allowed Values |
---|---|---|---|---|
input | ID of the PMC article to extract the topics from (e.g. PMC3310815). If the --isFile flag is set, file with the ids of the publications | Yes | Any PMC id or file. | |
--isFile | If present, this flag indicates that the input passed to the script is a file with the ids of each publication delimited by newlines. | No | True or False | |
-f --format | Output format of the results. If no output format is specified, results are returned in JSON by default. | No | One of csv, json, jsonld, n3 or ttl | |
-o --output | Name of the file where the results will be saved. If no output file is specified, results will be written to the console instead. | No | Any valid filename. |
For more additional information about how to run the script, you can execute the following command:
python scripts/predict_article_topics.py -h
In the following example, we will be running the script twice. The first execution will print the results in console and in json format (default values). The second one we will use the list of article ids from the data directory to predict the topics for those articles. This will be equivalent to running the _predict_articletopics.py script. After that, we will save the results to the file results.ttl in the turtle format:
python scripts/predict_article_topics.py PMC3310815
python scripts/predict_article_topics.py data/agriculture/pmc_ids.txt --isFile -o results.ttl -f ttl
The script _obtain_track_authortopics.py can be used to obtain the topics assigned to each author from the dataset. The script can be run with the following command:
python scripts/obtain_track_author_topics.py -h
An API has been deployed at http://edma-challenge.compute.weso.network/ where the different functionality of the system can be tested out without needing to manually run the scripts with Python.
For the publications track, we provide the api/publication/topics GET endpoint to predict the topics of a given publication. The following parameters can be sent in the JSON body: | Name | Description | Compulsory | Allowed Values |
---|---|---|---|---|
input | ID of the PMC article to extract the topics from (e.g. PMC3310815) | Yes | Any PMC id. | |
format | Output format of the results. If no output format is specified, results are returned in JSON by default. | No | One of json, jsonld, n3 or ttl |
An example body passed to the API could be as follows:
{
"input": "PMC3310815",
"format": "json"
}
The response will be as follows:
{
"task_id": "YOUR_TASK_ID"
}
A task identifier will be returned. We can query the __api/prediction/
The results obtained for the track dataset can be found in the script_results folder. These results are provided in multiple formats (.csv, .json, .jsonld, and .ttl).