wenhuchen / HybridQA

Dataset and code for EMNLP2020 paper "HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data"
MIT License
222 stars 22 forks source link

Some questions about the dataset and the codes #3

Closed daiyongya closed 4 years ago

daiyongya commented 4 years ago

I appreciate your nice work, but I have some questions about your dataset and your code. Questions about the dataset in your released data:

  1. 'answer-node': these answer-nodes are generated by string-match-based heuristics, they are not guaranteed to be correct. How they are determined? Do they be extracted by the 'preprossing.py'? But in your released data, there already have these nodes.
  2. 'question-postag': these tags detect the maximum/minimum questions, how are they defined? Are these tags are annotated by the annotators?
  3. 'where': means the final answer location? Is this annotated as the golden labels, not like the 'answer-node', which is generated by string-match-based heuristics?

Questions about the codes:

  1. codes in 'CELL' of stage12:triggers = ['JJR', 'JJS', 'RBR', 'RBS'] do these mean some types like minimax/maximum? How are they defined?
  2. the codes in 'prepare_stage2_data(d)': if d['type'] in ['medium', 'easy'], is there exist the evidence cells that in the same row with the answer nodes, the corresponding question is categorized to 'easy' or 'medium'? And other examples are abandoned because they cannot be processed by this pipeline.
  3. Why do you normalize your confidence? Like (1,0,1,0,0) to (0.5,0,0.5,0,0)
wenhuchen commented 4 years ago

1: answer-nodes, I'm separating the annotated-answer and answer-node tracing code into two separate files now, I will upload it soon.

  1. The question-postag is detected by nltk, the max/min is using the JJR/JJS/ to decide.
  2. 'where' is also based on heuristics, it's not guaranteed. The only thing guaranteed is the answer-text itself.
wenhuchen commented 4 years ago

Hi, sorry for the late reply, I was previous super busy with ICLR deadline. The code has been recently changed dramatically for the camera-ready version for EMNLP. Now I separate the files in released_data into train/dev/test.json, and the tracing process is by trace_answer.py in the root folder. There was also some issue with the table crawling, now it's fixed.

For your question 2: Yes, you are right. For your question 3: The normalization is a standard computation procedure, we didn't compare with unnormalized version.

daiyongya commented 4 years ago

@wenhuchen Thank you so much. Best wishes for you ICLR paper.