There is now scraper data for an organisation called 'evaluation'. So this would be the starting point for the evaluation flow, which would look like:
ORG = evaluation
make run-parser
make run-extracter
make run-indexer-fulltexts
make run-indexer-epmc
make run-fuzzymatcher
make run-evaluator
Not sure how the argo flow would actually be coded though! But this is how it would work running locally).
Questions
These changes work and give reasonable output files when I run things locally, but I worry I haven't edited other files which will allow them to be run in production. Can you comment on that?
This involves a new tasks which is only to be run when org=evaluation:
evaluator_task.py - uses the fuzzy match results and the gold standard data to evaluate the quality of reach's results
Type of change
Please delete options that are not relevant.
[ ] :bug: Bug fix (Add Fix #(issue) to your PR)
[x] :sparkles: New feature
[ ] :fire: Breaking change
[ ] :memo: Documentation update
How Has This Been Tested?
Running make run-evaluator when org=who_iris worked and created the output file:
s3://datalabs-dev/evaluator/evaluator_2020-05-22.json.gz
Ideally I would test running make run-evaluator when org=evaluation but I don't have the fuzzymatch file for the evaluation organisation yet (since I can't run the ES tasks lcoally)
Results
These were from when I combined all the fuzzymatch results together and ran the evaluation task on them. This method wouldnt be done in practice, but I couldn't run the fuzzymatch task just on the extracted references from the evaluation 'org' since I dont have the ES credentials.
Using s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl
and a reminder about what these mean:
1) doc_metrics - Evaluate success of reach in finding docs in the gold set.
2) ref_metrics - Precision/Recall/F1 for all references in the gold set.
3) found_docs_ref_metrics - Precision/Recall/F1 for only these references from the gold set which are present in documents that were found by reach.
Checklist:
[ ] My code follows the style guidelines of this project (pep8 AND pyflakes)
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] If needed, I changed related parts of the documentation
[ ] I included tests in my PR
[ ] New and existing unit tests pass locally with my changes
[ ] Any dependent changes have been merged and published in downstream modules
[ ] If my PR aims to fix an issue, I referenced it using #(issue)
Something horrible happened in the previous PR so this is a redo of https://github.com/wellcometrust/reach/pull/516
There is now scraper data for an organisation called 'evaluation'. So this would be the starting point for the evaluation flow, which would look like:
Not sure how the argo flow would actually be coded though! But this is how it would work running locally).
Questions
Description
Incorporating the end to end evaluation - https://github.com/wellcometrust/reach/issues/315. The old method was written for airflow, so it'll be adapted for the new architecture.
This involves a new tasks which is only to be run when org=evaluation:
Type of change
Please delete options that are not relevant.
Fix #(issue)
to your PR)How Has This Been Tested?
Running
make run-evaluator
when org=who_iris worked and created the output file: s3://datalabs-dev/evaluator/evaluator_2020-05-22.json.gzIdeally I would test running
make run-evaluator
when org=evaluation but I don't have the fuzzymatch file for the evaluation organisation yet (since I can't run the ES tasks lcoally)Results
These were from when I combined all the fuzzymatch results together and ran the evaluation task on them. This method wouldnt be done in practice, but I couldn't run the fuzzymatch task just on the extracted references from the evaluation 'org' since I dont have the ES credentials.
Using
s3://datalabs-data/reach_evaluation/data/sync/2019.10.8_gold_matched_references_snake.jsonl
Using
s3://datalabs-data/reach_evaluation/data/sync/2019.10.8-fuzzy-matched-gold-refs-manually-verified.jsonl
and a reminder about what these mean: 1) doc_metrics - Evaluate success of reach in finding docs in the gold set. 2) ref_metrics - Precision/Recall/F1 for all references in the gold set. 3) found_docs_ref_metrics - Precision/Recall/F1 for only these references from the gold set which are present in documents that were found by reach.
Checklist:
#(issue)