zwdzwd / transvar

TransVar - multiway annotator for precision medicine
Other
115 stars 34 forks source link

Performance issue in faidx.py fetch_sequence #31

Closed mmoisse closed 4 years ago

mmoisse commented 5 years ago

I noticed a 10x performance drop compared to my older transvar version (https://bitbucket.org/wanding/transvar/commits/8a7a774618174bd591e8821b9c7c7fd5c03ce8c4) for some variants. I traced back the performance drop to the addition of decode() the fetch_sequence function, which convert seq from str to unicode and apparently the concatenation of unicode is way slower than that of str https://github.com/zwdzwd/transvar/blob/28a725dfb30acbd5c5cde7a7c8015ffdcbb1826b/transvar/faidx.py#L81-L83

I suggest to only concatenate the unicode at the end of the loop or remove the decode()

test.vcf.gz

transvar ganno --vcf test.vcf.gz --refversion hg19 --ccds 

Current version: 46.5366 s Version without decode(): 5.30291 s Version without one join(): 9.85117 s

zmiimz commented 4 years ago

I confirm that this patch significantly improves performance of transvar. Thanks!

zwdzwd commented 4 years ago

Thanks for the suggestion and confirmation. Sorry for having missed this. Will merge and integrate soon.