wikipathways / pathway-figure-ocr

Extracting gene sets from published pathway figures
Apache License 2.0
15 stars 2 forks source link

matches missing from figures__xrefs view #5

Open AlexanderPico opened 6 years ago

AlexanderPico commented 6 years ago

In this example, Cyclin E/A is successfully matched, added to success.txt and match_attempts, but it's missing from figures__xrefs. Here are the results from a query against match_attempts:

pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=769 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |   word    | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+-----------+---------------------+-------------------------------------+--------+------------------
                6 |       769 | p16       |              475262 | -n stop                             | 475262 | p16
                6 |       769 | INK4      |              473787 | -n stop                             | 473787 | INK4
                6 |       769 | Mol       |              475266 | -n stop                             | 475266 | MOL
                6 |       769 | CDK       |              463989 | -n stop                             | 463989 | CDK
                6 |       769 | SCF       |              464414 | -n stop                             | 464414 | SCF
                6 |       769 | CDK2      |              464337 | -n stop                             | 464337 | CDK2
                6 |       769 | Suv39H1   |              475294 | -n stop                             | 475294 | SUV39H1
                6 |       769 | SIN3A     |              475295 | -n stop                             | 475295 | SIN3A
                6 |       769 | CyclinE/A |              475305 | -n stop -n nfkc -n deburr -m expand | 475305 | CYCLINA
                6 |       769 | CyclinE/A |              464335 | -n stop -n nfkc -n deburr -m expand | 464335 | CYCLINE
                6 |       769 | E2F/1/2/3 |              463979 | -n stop -n nfkc -n deburr -m expand | 463979 | E2F
                6 |       769 | DHFR      |              475308 | -n stop                             | 475308 | DHFR
                6 |       769 | PCNA      |              475309 | -n stop                             | 475309 | PCNA
                6 |       769 | H2A       |              475310 | -n stop                             | 475310 | H2A

Everything is pulled into the view just fine except for the two CyclinE/A columns. I'm guessing there is some sort of unique criteria being applied to the word column in the construction of the view?? Though it's odd that it's excluding both and not just one, right?

AlexanderPico commented 6 years ago

Somehow, in contrast to the example above, this case with two words that are identical behaves just fine and the individual hits, AKT1 and AKT2, are properly included in figures__xrefs, so it's not a simply matter of excluding non-unique words...

pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=566 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |  word  | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
                6 |       566 | PI3K   |              462515 | -n stop                             | 462515 | PI3K
                6 |       566 | Akt1/2 |              464705 | -n stop -n nfkc -n deburr -m expand | 464705 | AKT1
                6 |       566 | Akt1/2 |              465819 | -n stop -n nfkc -n deburr -m expand | 465819 | AKT2
                6 |       566 | JNK2   |              465589 | -n stop                             | 465589 | JNK2
                6 |       566 | CIDEA  |              472387 | -n stop                             | 472387 | CIDEA
                6 |       566 | CIDEC  |              472388 | -n stop                             | 472388 | CIDEC
AlexanderPico commented 6 years ago

Another case where match did NOT get pulled into figures__xrefs view, "CyclinD1":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=2026 and transformed_word not like 'dummy%' limit 100;

 ocr_processor_id | figure_id |   word   | transformed_word_id |                          transforms_applied                          |   id   | transformed_word 
------------------+-----------+----------+---------------------+----------------------------------------------------------------------+--------+------------------
                6 |      2026 | CB1      |              476682 | -n stop                                                              | 476682 | CB1
                6 |      2026 | PI3K     |              462515 | -n stop                                                              | 462515 | PI3K
                6 |      2026 | GSK-3β   |              462915 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 462915 | GSK3
                6 |      2026 | D1       |              463085 | -n stop                                                              | 463085 | D1
                6 |      2026 | CyclinD1 |              464644 | -n stop                                                              | 464644 | CYCLIND1
AlexanderPico commented 6 years ago

And another, "NF-KB":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=1875 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |      word       | transformed_word_id |                          transforms_applied                          |   id   | transformed_word 
------------------+-----------+-----------------+---------------------+----------------------------------------------------------------------+--------+------------------
                6 |       958 | PI3K/AKTpathway |              462515 | -n stop -n nfkc -n deburr -m expand                                  | 462515 | PI3K
                6 |       958 | PI3K/AKT        |              462522 | -n stop -n nfkc -n deburr -m expand                                  | 462522 | AKT
                6 |       958 | p38             |              462651 | -n stop                                                              | 462651 | p38
                6 |       958 | JNK             |              462633 | -n stop                                                              | 462633 | JNK
                6 |       958 | ERK             |              462776 | -n stop                                                              | 462776 | ERK
                6 |       958 | ROS             |              463928 | -n stop                                                              | 463928 | ROS
                6 |       958 | mTOR            |              463184 | -n stop                                                              | 463184 | MTOR
                6 |       958 | NF-KB           |              462632 | -n stop                                                              | 462632 | NF-KB
                6 |       958 | XIAP            |              463990 | -n stop                                                              | 463990 | XIAP
                6 |       958 | -(PTEN          |              463396 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 463396 | PTEN
AlexanderPico commented 6 years ago

Another case with "NF-KB":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=3247 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |  word  | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
                6 |      3247 | RXFP2  |              505535 | -n stop                             | 505535 | RXFP2
                6 |      3247 | Akt    |              462522 | -n stop                             | 462522 | AKT
                6 |      3247 | PYK2   |              469751 | -n stop                             | 469751 | PYK2
                6 |      3247 | AC     |              462893 | -n stop                             | 462893 | AC
                6 |      3247 | CRAF   |              470309 | -n stop                             | 470309 | CRAF
                6 |      3247 | PKA    |              463347 | -n stop                             | 463347 | PKA
                6 |      3247 | IkBa   |              467857 | -n stop                             | 467857 | IKBA
                6 |      3247 | PKC    |              463219 | -n stop                             | 463219 | PKC
                6 |      3247 | NF-KB  |              462632 | -n stop                             | 462632 | NF-KB
                6 |      3247 | MEK1/2 |              463892 | -n stop -n nfkc -n deburr -m expand | 463892 | MEK1
                6 |      3247 | MEK1/2 |              463893 | -n stop -n nfkc -n deburr -m expand | 463893 | MEK2
                6 |      3247 | ERK1/2 |              462520 | -n stop -n nfkc -n deburr -m expand | 462520 | ERK1
                6 |      3247 | ERK1/2 |              462521 | -n stop -n nfkc -n deburr -m expand | 462521 | ERK2

...but why does it matching before having the hyphen removed?? The lexicon only contains "NFKB".

ariutta commented 6 years ago

The symbols table doesn't contain anything starting with "CYCLIN":

SELECT * FROM symbols WHERE symbol LIKE 'CYC%';

(edit: but does have items starting with "Cyclin")

ariutta commented 6 years ago

Turns out it was the non-alphanumeric characters like dashes.