Fairness review on deep reference parser algorithm

wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning

MIT License

25 stars 4 forks source link

Fairness review on deep reference parser algorithm #360

Open aoifespenge opened 4 years ago

aoifespenge commented 4 years ago

Tasks:

[x] Discuss and decide fairness criteria.
[ ] Conduct fairness review.
[ ] Report to team.

The deep reference parser needs to undergo a fairness review. Before this can happen we need to answer the following questions:

How we define fairness in this case?

Do we care about treating people the same across different groups?
What are the groups that we think are important if any?
Or do we care about treating every single the individual the same?

How does our definition translate to the testing for fairness in the algorithm?

Which metric(s) are most appropriate for our case?

nsorros commented 4 years ago

I think the analysis should be the same as we have done so far, just replication using the new model. We should aim to do it end to end but due to the limited data we might find that we need to annotate more before we are able to in which case we might decide to postpone that for the future and just literally replicate on the existing data.

ivyleavedtoadflax commented 4 years ago

Why would we need to label more data @nsorros? Btw I said to @aoifespenge today that I envisage this being another airflow task that is completed at the end of a dag just like the end-to-end evaluation for the more usual metrics. Is that what you had in mind?

nsorros commented 4 years ago

Why would we need to label more data @nsorros? Btw I said to @aoifespenge today that I envisage this being another airflow task that is completed at the end of a dag just like the end-to-end evaluation for the more usual metrics. Is that what you had in mind?

Not a bad idea, we can definitely have it in Airflow as well. Even though the analysis would only change when a new model is deployed.

We would need to label more data because how would you quantify whether the algorithm is biased towards sociology or non english publication if none of your data contains either?

ivyleavedtoadflax commented 4 years ago

Sorry I guess my question was more, which data should we label more of?

I think it would make sense to run the ethical assessment after each update to reach, not just the model, because an improvement to the scraper, or adding a new provider could equally have an impact on the fairness of the whole pipeline.

nsorros commented 4 years ago

good point and why not. the data that might need more annotating is the gold data, more titles that are matched to pubmed ids and have the neccesary metadata.

ivyleavedtoadflax commented 4 years ago

Agree. More data is good data.

ivyleavedtoadflax commented 4 years ago

This is currently blocked by https://github.com/wellcometrust/reach/issues/48