wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
25 stars 4 forks source link

spider_years argument in the policy DAG #420

Closed lizgzil closed 4 years ago

lizgzil commented 4 years ago

(in create_dag_fuzzy_match of airflow/dags/policy.py)

I have a few issues with this parameter:

  1. looking at when it is actually used in spider_operator.py it seems to only effect the years scraped from WHO? So does it need a clearer name if so?
  2. Shouldn't we scrape from all years and not have this parameter anyway?
  3. If we are to keep it can we log it somehow? Like the policy dag run date is associated with a log of the parameters that went into the run?

Please let me know if I've misunderstood something about this.

jdu commented 4 years ago

I think there's a naming convention issue there as some of the spiders were copied from each other, the values IS applicable to all spiders, just it was named who when it was first implemented and not updated to a more generic name.

It's intent is only for the test policy to limit the test to a specific number of documents to allow us to run the DAG in minimal time and validate that it all runs correctly. For the full policy DAG there shouldn't be any limits set (i.e. it should be None, None).

lizgzil commented 4 years ago

After chatting with @jdu it seems that WHO_IRIS_YEARS isn't used for just scraping data from a specific date range, but instead filtering query params into the WHO site.

I think lines 33-37 here are the ones where it is used. @SamDepardieu could you confirm this?

In this case I think this issue will just be to change the doc string for create_dag_all_match in airflow/dags/policy.py which is currently ("spider_years: years to scrape")