wikipathways / wikipathways-development

Roadmap planning, developer documentation, contribution guidelines
2 stars 0 forks source link

Parsing WPx-datanodes.tsv files #62

Closed ariutta closed 2 years ago

ariutta commented 2 years ago

These aren't always being serialized/parsed consistently. Examples:

WP5064-datanodes.tsv

ChEH    Protein P34913  Uniprot-TrEMBL  AKA cholesterol epoxide hydrolase (ChEH); EC: 3.3.2.11</br>"ChEH is a dimer of 7-dehydrocholesterol reductase (DHCR7) and 3β-hydroxysteroid-Δ8-Δ7-isomerase (D8D7I)"

WP5114-datanodes.tsv

XPA GeneProduct ENSG00000136936 Ensembl "Furthermore, the DDB2 complex-mediated ubiquitylation plays a role in recruiting XPA to damaged sites".

WP5166-datanodes.tsv

LDH Protein 1.1.1.27    BRENDA  In book it says "(R)-lactate dehydrogenase" - investigate</br>L-Lactate dehydrogenase
...
1P5CDH  Protein 1.2.1.88    BRENDA  1-pyrroline-5-carboxylate dehydrogenase</br>Add "EC" with EC identifier?

~~This may be due to the change I made here for the parameters for parsing quoted fields: https://github.com/wikipathways/wikipathways-database/commit/288ce733e1607777c6ede0b290c4639382da3e29#diff-f3a0c6cf05b703b1241cd9c83a3c9efad10e6e84ce8f1246f0ffd25e1f2b8fcbL82~~ Update: the jekyll site is parsing the TSV files independently of my script.

The annotations CSV files do use quotes to indicate fields, so I changed all the parsing to support quoted fields. @mkutmon, do you want to change the serialization of WPx-datanodes.tsv files to support quoted fields, or should we use quoting=csv.QUOTE_NONE here when parsing WPx-datanodes.tsv files?

AlexanderPico commented 2 years ago

I identified 32 busted tsvs and removed them so the site is building again. You can find them here: https://github.com/wikipathways/wikipathways.github.io/tree/main/bad%20md%20and%20tsv%20files

mkutmon commented 2 years ago
ariutta commented 2 years ago

Tina, using a TSV/CSV library that handles escaping special characters might be the most reliable solution.

ariutta commented 2 years ago

This quoting=csv.QUOTE_NONE change fixed the parsing for my script, but it doesn't fix the parsing for the jekyll site.

Let's remove quoting=csv.QUOTE_NONE when you update the serialization of the datanode TSVs.

ariutta commented 2 years ago

I temporarily disabled the copying of datanode TSV files over to the jekyll site: https://github.com/wikipathways/wikipathways-database/blob/7ae256201c2e6972d271e77145244de0e26fcc70/.github/workflows/on_gpml_change.yml#L226-L228

Once the TSV file serialization is working, we can re-enable that line.

mkutmon commented 2 years ago

@ariutta can you check if the new data nodes comment format solves the issue? I still need to fix the ordering but wanted to fix the comments problem first.

AlexanderPico commented 2 years ago

@ariutta @mkutmon What's the status of this issue? Looks like we are not copying over datanode.tsv files to the jekyll site until this is resolved.

ariutta commented 2 years ago

I guess we'd need to give it a try -- just copy some over and see whether Jekyll correctly parses them. I haven't checked lately, but Tina's change may have worked.

ariutta commented 2 years ago

@AlexanderPico, @mkutmon, I re-enabled datanodes.tsv: https://github.com/wikipathways/wikipathways-database/blob/834874d96bc2b4d8054da070cbe0e1d7480d8266/.github/workflows/on_gpml_change.yml#L350

We'll have to keep on eye one whether Jekyll can parse the files with the updated formatting.

AlexanderPico commented 2 years ago

I tested the following changes by editing the GPML files in wp-db and seeing the results on the new site:

So, looks like @mkutmon's checklist is to active. I'll comment those lines out again.

AlexanderPico commented 2 years ago

Replacing all double quotes with single quotes resolves the TSV parsing issues for all the "bad" datanode.tsv cases, e.g., WP1763:

Ctr9    GeneProduct 22083   Entrez Gene full name: Ctr9 (Alco called "SH2 domain-binding protein 1")
AlexanderPico commented 2 years ago

Made a new release of meta-data-action jar attempting to fix double quotes on comments: https://github.com/wikipathways/meta-data-action/releases/tag/v0.0.2

AlexanderPico commented 2 years ago

It worked! Now, I'm uncommenting the cp datanodes.tsv line as it should work for all the files previously marked "bad."