wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

REDO End to end evaluation pipeline tasks #592

Closed lizgzil closed 3 years ago

lizgzil commented 3 years ago

Something horrible happened in the previous PR so this is a redo of https://github.com/wellcometrust/reach/pull/516

There is now scraper data for an organisation called 'evaluation'. So this would be the starting point for the evaluation flow, which would look like:

ORG = evaluation
make run-parser
make run-extracter
make run-indexer-fulltexts
make run-indexer-epmc
make run-fuzzymatcher
make run-evaluator

Not sure how the argo flow would actually be coded though! But this is how it would work running locally).

Questions

Description

Incorporating the end to end evaluation - https://github.com/wellcometrust/reach/issues/315. The old method was written for airflow, so it'll be adapted for the new architecture.

This involves a new tasks which is only to be run when org=evaluation:

  1. evaluator_task.py - uses the fuzzy match results and the gold standard data to evaluate the quality of reach's results

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Running make run-evaluator when org=who_iris worked and created the output file: s3://datalabs-dev/evaluator/evaluator_2020-05-22.json.gz

Ideally I would test running make run-evaluator when org=evaluation but I don't have the fuzzymatch file for the evaluation organisation yet (since I can't run the ES tasks lcoally)

Results

These were from when I combined all the fuzzymatch results together and ran the evaluation task on them. This method wouldnt be done in practice, but I couldn't run the fuzzymatch task just on the extracted references from the evaluation 'org' since I dont have the ES credentials.

Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl

{"doc_metrics": {"found": 7, "missed": 4, "accuracy": 0.64, "total": 11},

"ref_metrics": {"possible": 85, "actual": 107, "found": 37, "missed": 48,
"spurious": 70, "recall": 0.44, "precision": 0.35, "f1": 0.39},

"found_docs_ref_metrics": {"possible": 74, "actual": 107, "found": 37,
"missed": 37, "spurious": 70, "recall": 0.5, "precision": 0.35, "f1": 0.41},

"gold_refs": "s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl", 

"reach_refs": "s3://datalabs-dev/combinedmatches/combinedmatches.json.gz", "reach_params": null}

Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl

{
"doc_metrics": {"found": 4, "missed": 2, "accuracy": 0.67, "total": 6},

"ref_metrics": {"possible": 37, "actual": 46, "found": 28, "missed": 9,
"spurious": 18, "recall": 0.76, "precision": 0.61, "f1": 0.67}, 

"found_docs_ref_metrics": {"possible": 33, "actual": 46, "found": 28,
"missed": 5, "spurious": 18, "recall": 0.85, "precision": 0.61, "f1": 0.71}, 

"gold_refs": "s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl", 

"reach_refs": "s3://datalabs-dev/combinedmatches/combinedmatches.json.gz", 

"reach_params": null}

and a reminder about what these mean: 1) doc_metrics - Evaluate success of reach in finding docs in the gold set. 2) ref_metrics - Precision/Recall/F1 for all references in the gold set. 3) found_docs_ref_metrics - Precision/Recall/F1 for only these references from the gold set which are present in documents that were found by reach.

Checklist: