Something horrible happened in the previous PR so this is a redo of https://github.com/wellcometrust/reach/pull/516

There is now scraper data for an organisation called 'evaluation'. So this would be the starting point for the evaluation flow, which would look like:

ORG = evaluation
make run-parser
make run-extracter
make run-indexer-fulltexts
make run-indexer-epmc
make run-fuzzymatcher
make run-evaluator

Not sure how the argo flow would actually be coded though! But this is how it would work running locally).

Questions

These changes work and give reasonable output files when I run things locally, but I worry I haven't edited other files which will allow them to be run in production. Can you comment on that?

Description

Incorporating the end to end evaluation - https://github.com/wellcometrust/reach/issues/315. The old method was written for airflow, so it'll be adapted for the new architecture.

This involves a new tasks which is only to be run when org=evaluation:

evaluator_task.py - uses the fuzzy match results and the gold standard data to evaluate the quality of reach's results

Type of change

Please delete options that are not relevant.

[ ] :bug: Bug fix (Add Fix #(issue) to your PR)
[x] :sparkles: New feature
[ ] :fire: Breaking change
[ ] :memo: Documentation update

How Has This Been Tested?

Running make run-evaluator when org=who_iris worked and created the output file: s3://datalabs-dev/evaluator/evaluator_2020-05-22.json.gz

Ideally I would test running make run-evaluator when org=evaluation but I don't have the fuzzymatch file for the evaluation organisation yet (since I can't run the ES tasks lcoally)

Results

These were from when I combined all the fuzzymatch results together and ran the evaluation task on them. This method wouldnt be done in practice, but I couldn't run the fuzzymatch task just on the extracted references from the evaluation 'org' since I dont have the ES credentials.

Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl

{"doc_metrics": {"found": 7, "missed": 4, "accuracy": 0.64, "total": 11},

"ref_metrics": {"possible": 85, "actual": 107, "found": 37, "missed": 48,
"spurious": 70, "recall": 0.44, "precision": 0.35, "f1": 0.39},

"found_docs_ref_metrics": {"possible": 74, "actual": 107, "found": 37,
"missed": 37, "spurious": 70, "recall": 0.5, "precision": 0.35, "f1": 0.41},

"gold_refs": "s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl", 

"reach_refs": "s3://datalabs-dev/combinedmatches/combinedmatches.json.gz", "reach_params": null}

Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl

{
"doc_metrics": {"found": 4, "missed": 2, "accuracy": 0.67, "total": 6},

"ref_metrics": {"possible": 37, "actual": 46, "found": 28, "missed": 9,
"spurious": 18, "recall": 0.76, "precision": 0.61, "f1": 0.67}, 

"found_docs_ref_metrics": {"possible": 33, "actual": 46, "found": 28,
"missed": 5, "spurious": 18, "recall": 0.85, "precision": 0.61, "f1": 0.71}, 

"gold_refs": "s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl", 

"reach_refs": "s3://datalabs-dev/combinedmatches/combinedmatches.json.gz", 

"reach_params": null}

and a reminder about what these mean: 1) doc_metrics - Evaluate success of reach in finding docs in the gold set. 2) ref_metrics - Precision/Recall/F1 for all references in the gold set. 3) found_docs_ref_metrics - Precision/Recall/F1 for only these references from the gold set which are present in documents that were found by reach.

Checklist:

[ ] My code follows the style guidelines of this project (pep8 AND pyflakes)
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] If needed, I changed related parts of the documentation
[ ] I included tests in my PR
[ ] New and existing unit tests pass locally with my changes
[ ] Any dependent changes have been merged and published in downstream modules
[ ] If my PR aims to fix an issue, I referenced it using #(issue)

wellcometrust / reach

REDO End to end evaluation pipeline tasks #592