Open AlexanderPico opened 6 years ago
Somehow, in contrast to the example above, this case with two words that are identical behaves just fine and the individual hits, AKT1 and AKT2, are properly included in figures__xrefs
, so it's not a simply matter of excluding non-unique words...
pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=566 and transformed_word not like 'dummy%' limit 100;
ocr_processor_id | figure_id | word | transformed_word_id | transforms_applied | id | transformed_word
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
6 | 566 | PI3K | 462515 | -n stop | 462515 | PI3K
6 | 566 | Akt1/2 | 464705 | -n stop -n nfkc -n deburr -m expand | 464705 | AKT1
6 | 566 | Akt1/2 | 465819 | -n stop -n nfkc -n deburr -m expand | 465819 | AKT2
6 | 566 | JNK2 | 465589 | -n stop | 465589 | JNK2
6 | 566 | CIDEA | 472387 | -n stop | 472387 | CIDEA
6 | 566 | CIDEC | 472388 | -n stop | 472388 | CIDEC
Another case where match did NOT get pulled into figures__xrefs
view, "CyclinD1":
select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=2026 and transformed_word not like 'dummy%' limit 100;
ocr_processor_id | figure_id | word | transformed_word_id | transforms_applied | id | transformed_word
------------------+-----------+----------+---------------------+----------------------------------------------------------------------+--------+------------------
6 | 2026 | CB1 | 476682 | -n stop | 476682 | CB1
6 | 2026 | PI3K | 462515 | -n stop | 462515 | PI3K
6 | 2026 | GSK-3β | 462915 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 462915 | GSK3
6 | 2026 | D1 | 463085 | -n stop | 463085 | D1
6 | 2026 | CyclinD1 | 464644 | -n stop | 464644 | CYCLIND1
And another, "NF-KB":
select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=1875 and transformed_word not like 'dummy%' limit 100;
ocr_processor_id | figure_id | word | transformed_word_id | transforms_applied | id | transformed_word
------------------+-----------+-----------------+---------------------+----------------------------------------------------------------------+--------+------------------
6 | 958 | PI3K/AKTpathway | 462515 | -n stop -n nfkc -n deburr -m expand | 462515 | PI3K
6 | 958 | PI3K/AKT | 462522 | -n stop -n nfkc -n deburr -m expand | 462522 | AKT
6 | 958 | p38 | 462651 | -n stop | 462651 | p38
6 | 958 | JNK | 462633 | -n stop | 462633 | JNK
6 | 958 | ERK | 462776 | -n stop | 462776 | ERK
6 | 958 | ROS | 463928 | -n stop | 463928 | ROS
6 | 958 | mTOR | 463184 | -n stop | 463184 | MTOR
6 | 958 | NF-KB | 462632 | -n stop | 462632 | NF-KB
6 | 958 | XIAP | 463990 | -n stop | 463990 | XIAP
6 | 958 | -(PTEN | 463396 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 463396 | PTEN
Another case with "NF-KB":
select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=3247 and transformed_word not like 'dummy%' limit 100;
ocr_processor_id | figure_id | word | transformed_word_id | transforms_applied | id | transformed_word
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
6 | 3247 | RXFP2 | 505535 | -n stop | 505535 | RXFP2
6 | 3247 | Akt | 462522 | -n stop | 462522 | AKT
6 | 3247 | PYK2 | 469751 | -n stop | 469751 | PYK2
6 | 3247 | AC | 462893 | -n stop | 462893 | AC
6 | 3247 | CRAF | 470309 | -n stop | 470309 | CRAF
6 | 3247 | PKA | 463347 | -n stop | 463347 | PKA
6 | 3247 | IkBa | 467857 | -n stop | 467857 | IKBA
6 | 3247 | PKC | 463219 | -n stop | 463219 | PKC
6 | 3247 | NF-KB | 462632 | -n stop | 462632 | NF-KB
6 | 3247 | MEK1/2 | 463892 | -n stop -n nfkc -n deburr -m expand | 463892 | MEK1
6 | 3247 | MEK1/2 | 463893 | -n stop -n nfkc -n deburr -m expand | 463893 | MEK2
6 | 3247 | ERK1/2 | 462520 | -n stop -n nfkc -n deburr -m expand | 462520 | ERK1
6 | 3247 | ERK1/2 | 462521 | -n stop -n nfkc -n deburr -m expand | 462521 | ERK2
...but why does it matching before having the hyphen removed?? The lexicon only contains "NFKB".
The symbols table doesn't contain anything starting with "CYCLIN":
SELECT * FROM symbols WHERE symbol LIKE 'CYC%';
(edit: but does have items starting with "Cyclin")
Turns out it was the non-alphanumeric characters like dashes.
In this example, Cyclin E/A is successfully matched, added to
success.txt
andmatch_attempts
, but it's missing fromfigures__xrefs
. Here are the results from a query againstmatch_attempts
:Everything is pulled into the view just fine except for the two CyclinE/A columns. I'm guessing there is some sort of unique criteria being applied to the
word
column in the construction of the view?? Though it's odd that it's excluding both and not just one, right?