wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

End to end evaluation pipeline tasks #516

Closed lizgzil closed 4 years ago

lizgzil commented 4 years ago

There is now scraper data for an organisation called 'evaluation'. So this would be the starting point for the evaluation flow, which would look like:

ORG = evaluation
make run-parser
make run-extracter
make run-indexer-fulltexts
make run-indexer-epmc
make run-fuzzymatcher
make run-evaluator

Not sure how the argo flow would actually be coded though! But this is how it would work running locally).

Questions

Description

Incorporating the end to end evaluation - https://github.com/wellcometrust/reach/issues/315. The old method was written for airflow, so it'll be adapted for the new architecture.

This involves a new tasks which is only to be run when org=evaluation:

  1. evaluator_task.py - uses the fuzzy match results and the gold standard data to evaluate the quality of reach's results

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Running make run-evaluator when org=who_iris worked and created the output file: s3://datalabs-dev/evaluator/evaluator_2020-05-22.json.gz

Ideally I would test running make run-evaluator when org=evaluation but I don't have the fuzzymatch file for the evaluation organisation yet (since I can't run the ES tasks lcoally)

Results

These were from when I combined all the fuzzymatch results together and ran the evaluation task on them. This method wouldnt be done in practice, but I couldn't run the fuzzymatch task just on the extracted references from the evaluation 'org' since I dont have the ES credentials.

Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl

{"doc_metrics": {"found": 7, "missed": 4, "accuracy": 0.64, "total": 11},

"ref_metrics": {"possible": 85, "actual": 107, "found": 37, "missed": 48,
"spurious": 70, "recall": 0.44, "precision": 0.35, "f1": 0.39},

"found_docs_ref_metrics": {"possible": 74, "actual": 107, "found": 37,
"missed": 37, "spurious": 70, "recall": 0.5, "precision": 0.35, "f1": 0.41},

"gold_refs": "s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl", 

"reach_refs": "s3://datalabs-dev/combinedmatches/combinedmatches.json.gz", "reach_params": null}

Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl

{
"doc_metrics": {"found": 4, "missed": 2, "accuracy": 0.67, "total": 6},

"ref_metrics": {"possible": 37, "actual": 46, "found": 28, "missed": 9,
"spurious": 18, "recall": 0.76, "precision": 0.61, "f1": 0.67}, 

"found_docs_ref_metrics": {"possible": 33, "actual": 46, "found": 28,
"missed": 5, "spurious": 18, "recall": 0.85, "precision": 0.61, "f1": 0.71}, 

"gold_refs": "s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl", 

"reach_refs": "s3://datalabs-dev/combinedmatches/combinedmatches.json.gz", 

"reach_params": null}

and a reminder about what these mean: 1) doc_metrics - Evaluate success of reach in finding docs in the gold set. 2) ref_metrics - Precision/Recall/F1 for all references in the gold set. 3) found_docs_ref_metrics - Precision/Recall/F1 for only these references from the gold set which are present in documents that were found by reach.

Checklist:

jdu commented 4 years ago

@lizgzil Will try and look at this later in week as we want to get this into pipelines soon.

lizgzil commented 4 years ago

This has been replaced by https://github.com/wellcometrust/reach/pull/592/files#, @jdu this is ok to close now right?

jdu commented 4 years ago

Closing this as @lizgzil has provided a cleaner PR, we're not sure what went wrong with this branch but it seems to have diverged from master in some way and needed a fresh PR in #592