Open Soudeh-Jahanshahi opened 11 months ago
The formatting of the TSV seems to be disrupted due to certain special characters present in the XML files listed above. To handle these characters within the TSV file, we utilized the "quotechar" parameter.
The relevant code can be found at line 335 in the xml_translate.py script, where the XML is converted into a TSV format using the following functionality:
publications_df.to_csv(output_file, sep="\t", index=False, quotechar="`")
Certain special characters, such as Î, ±, ≥, %, and a few more, are part of the text. The '%' symbol, while generally a regular symbol, can cause a formatting issue in specific cases where there is no whitespace between the number and the '%' symbol, e.g., 20% vs. 20 %.
Nothing to be fixed.
@Soudeh-Jahanshahi this is marked as nothing to fix but, how the bug that originated this issue affects the approaches you are working on? Does it have an effect or was something that you observed and got your attention? Please clarify, thanks.
@ljgarcia : These 32 annotated xml-files do not have any contribution in post-processing approach. Specifically, (If they are part of the input data) their tokens just contribute in creating Word2Vec model, but when doing post-annotation, the presence of MeSH-terms in the corresponding documents is neglected... However comparing to the number of entire dataset, ignoring these documents for post-processing would have just a negligible impact on final evaluation results ...
The code "xml_translate.py" has a bug for processing 32 annotated xml-files!
defect_list = [27817193, 28240519, 28244787, 28438127, 28670879, 28707850, 28749127, 28749635, 28843255, 29095577, 29099159, 29116736, 29132205, 29172291, 29206099, 29220461, 29235983, 29283531, 29373899, 29374411, 29388757, 29451968, 29481028, 29533587, 29616530, 29630142, 29644823, 29688353, 29688370, 29693981, 29716180, 29801411]
For PMIDs in this list, the single tsv file is not generated correctly: the code splits their title and abstract between different lines.