zbmed-semtec / hybrid-pre-doc2vec-doc-relevance

Hybrid approach combining dictionary-based NER and doc2vec
GNU General Public License v3.0
0 stars 0 forks source link

Testing the reproducibility of the hybrid approach #2

Open Soudeh-Jahanshahi opened 11 months ago

Soudeh-Jahanshahi commented 11 months ago

The code "xml_translate.py" has a bug for processing 32 annotated xml-files!

defect_list = [27817193, 28240519, 28244787, 28438127, 28670879, 28707850, 28749127, 28749635, 28843255, 29095577, 29099159, 29116736, 29132205, 29172291, 29206099, 29220461, 29235983, 29283531, 29373899, 29374411, 29388757, 29451968, 29481028, 29533587, 29616530, 29630142, 29644823, 29688353, 29688370, 29693981, 29716180, 29801411]

For PMIDs in this list, the single tsv file is not generated correctly: the code splits their title and abstract between different lines.

rohitharavinder commented 11 months ago

The formatting of the TSV seems to be disrupted due to certain special characters present in the XML files listed above. To handle these characters within the TSV file, we utilized the "quotechar" parameter.

The relevant code can be found at line 335 in the xml_translate.py script, where the XML is converted into a TSV format using the following functionality:

publications_df.to_csv(output_file, sep="\t", index=False, quotechar="`")

Certain special characters, such as Î, ±, ≥, %, and a few more, are part of the text. The '%' symbol, while generally a regular symbol, can cause a formatting issue in specific cases where there is no whitespace between the number and the '%' symbol, e.g., 20% vs. 20 %.

Nothing to be fixed.

ljgarcia commented 7 months ago

@Soudeh-Jahanshahi this is marked as nothing to fix but, how the bug that originated this issue affects the approaches you are working on? Does it have an effect or was something that you observed and got your attention? Please clarify, thanks.

Soudeh-Jahanshahi commented 7 months ago

@ljgarcia : These 32 annotated xml-files do not have any contribution in post-processing approach. Specifically, (If they are part of the input data) their tokens just contribute in creating Word2Vec model, but when doing post-annotation, the presence of MeSH-terms in the corresponding documents is neglected... However comparing to the number of entire dataset, ignoring these documents for post-processing would have just a negligible impact on final evaluation results ...