wri-dssg-omdena / policy-data-analyzer

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Other
34 stars 9 forks source link

Added data loader and model evaluator #24

Closed thefirebanks closed 3 years ago

thefirebanks commented 3 years ago

Main changes:

Bonus:

jordiplanescutxi commented 3 years ago

You have done a magnificent work!! Some things that I need to clarify:

  1. If we had the example file "input/sample_model_output.json" it would be easier to execute the code in the notebook
  2. If I understand it right you assume that, for each document we will have two files with labelled sentences, the sample_dataset and the sample_model_output. They are going to have the same sentences in the same order and then different labels. I'm afraid we may have some mess if we do not check that the sentences are actually the same and that they are at the same position.
  3. If we are going trhough the sample_dataset.json as we do it in the function "labeled_sentences_from_dataset" we are assuming that all sentences will fall into one of the categories 0 to 5, while most of them will fall into a -1 category which is "no_incentive". We will talk about it.
thefirebanks commented 3 years ago

You have done a magnificent work!! Some things that I need to clarify:

  1. If we had the example file "input/sample_model_output.json" it would be easier to execute the code in the notebook
  2. If I understand it right you assume that, for each document we will have two files with labelled sentences, the sample_dataset and the sample_model_output. They are going to have the same sentences in the same order and then different labels. I'm afraid we may have some mess if we do not check that the sentences are actually the same and that they are at the same position.
  3. If we are going trhough the sample_dataset.json as we do it in the function "labeled_sentences_from_dataset" we are assuming that all sentences will fall into one of the categories 0 to 5, while most of them will fall into a -1 category which is "no_incentive". We will talk about it.

Hi Jordi, thank you for the feedback! Here are my responses:

  1. You are absolutely right, I had completely forgotten that the input folders don't get versioned, so I uploaded the input folder to our google drive (left the link in Slack).
  2. Indeed! To solve this, maybe we can create a unique ID for each sentence and that way it is easier to check for equality. This can be added in the script/process that creates the json files in the first place, and I can add a check to confirm that they are in the same order in the data loader. I will add this once we confirm the mechanism to identify the unique sentence.
  3. Good point, I was actually thinking to make 0 be the "no incentive" label and then 1-6 be the distinct types of incentives. I will correct that now!
thefirebanks commented 3 years ago

Updated the input files in the google drive folder, will update the data loader tomorrow before midnight EST!

thefirebanks commented 3 years ago

Tried loading sentences from ElSalvador.json and they got effectively loaded!

Screen Shot 2020-12-10 at 12 01 11 AM