wdecoster / surpyvor

A python wrapper around SURVIVOR
MIT License
19 stars 1 forks source link

surpyvor upset #3

Closed Nazeeefa closed 4 years ago

Nazeeefa commented 5 years ago

Hi Wouter,

Hope you are doing well. I have been trying to get UpSet plot, I used the following command:

surpyvor upset --variants file1.vcf file2.vcf truth_set.vcf --plotout upset.png

and received the following error:

Executing SURVIVOR...DONE Writing to /tmp/bcftools-sort.kEAm7H [W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.) [E::bcf_write] Unchecked error (1), exiting Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/bin/surpyvor", line 11, in load_entry_point('surpyvor==0.5.0', 'console_scripts', 'surpyvor')() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/surpyvor-0.5.0-py3.7.egg/surpyvor/surpyvor.py", line 42, in main File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/surpyvor-0.5.0-py3.7.egg/surpyvor/surpyvor.py", line 118, in upset File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/surpyvor-0.5.0-py3.7.egg/surpyvor/utils.py", line 74, in make_sets File "cyvcf2/cyvcf2.pyx", line 196, in cyvcf2.cyvcf2.VCF.init OSError: b'/var/folders/3m/f0595mlj0nb_dl8v6yjhddym0000gq/T/tmp8exnpd39' if not valid bcf or vcf [1]+ Exit 1 surpyvor upset --variants file1.vcf file2.vcf truth_set.vcf --plotout upset.png

I couldn't figure from the error which file is missing header information, but I checked that it was truth set vcf so I tried to index using tabix. Then, I got the following error from surpyvor:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Any pointers on how to solve this? The truth set I am using is the one available here.

I appreciate your advice and time.

Thank you, Nazeefa

wdecoster commented 5 years ago

Hi Nazeefa,

I believe this is a bug in SURVIVOR which I also came across last week, but the fix probably hasn't made it to a release yet. It surfaces when the chromosomes in files you are comparing/merging are not fully the same, e.g. one contains additional contigs. The result is that the merged file (which is a temporary file in surpyvor) contains an incomplete header, tripping up the next step.

Could you try if the following filtering helps:

bcftools view file1.vcf.gz $(grep -v '^#' truth_set.vcf | cut -f1 | sort -u | tr '\n' ' ') -o file1-only-normal-chromosomes.vcf.gz -O z

And the same for file2.vcf

You should have a look what's different between the files, but in my case, it were the decoy and unplaced contigs.

The second error you get suggests that the file is compressed, which on its own is not a problem, but is the file extension .gz? That might be necessary for it to be recognized. I should probably not trust file extensions and check if the file is really compressed (and how) by checking the first bytes, but that's not how it currently works.

Please let me know if you have more feedback or suggestions!

Cheers, Wouter

Nazeeefa commented 5 years ago

Hi Wouter,

I tried with filtering approach you mentioned, however I received the same error. Yes, the extension for the truth set file was in .gz, I compressed it and indexed it with tabix before using.

It turned out that truth set vcf file does not have same chromosome notation as subject vcf files i.e. chr. I fixed that with awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' and ran analyses with SURVIVOR and surpyvor.

Both seem to work, however surpyvor upset plot doesn't seem to show intersection between truth and my callsets. upset1

Thanks for a quick response and help, I really appreciate your time!

Cheers, Nazeefa

wdecoster commented 5 years ago

Hi Nazeefa,

Good to hear the error is gone now! Let me know if you have other examples in which you get errors, I would like to get rid of those of course.

I'm a bit confused by the large number of variants which are absent from all sets (the first bar). Do you vcf files contain a lot of 0/0 or ./. positions? Having this high bar for the "uncalled" positions may lead to too low resolution to show the bars for the triple intersection, but that's just speculation.

You could try filtering your vcfs. to remove those uncalled positions. How many variants are in your truth set?

Cheers, Wouter

Nazeeefa commented 5 years ago

Hi Wouter,

Sure! & Yes, the callsets (for my data) include lots of 0/0 positions - >700 but <1000 in each callset.

In general, when analysing SVs, do you think it's a good idea to remove uncalled positions? because these could be due to low coverage, for example. I do not know. Just wondering.

I removed the uncalled positions and ran surpyvor again. Sadly, it gives same output. upset_2

The truth set contains 192,352 variants.

Thank you again, I appreciate your time!

Cheers, Nazeefa

wdecoster commented 5 years ago

That's remarkable. Could you perhaps share the files you are using? That would make it easier to take a look at what's going on.

Nazeeefa commented 5 years ago

Hi Wouter,

I can only share with you the small test files that include some uncalled 0/0 positions (hope that's fine), and the truth set.

VCF File 1 VCF File 2

Truth Set (original file is 60 MB)

I noticed that, since test files contain very few (<250) SVs, surpyvor doesn't show their (not even individual) count in a plot - as compared to previous two plots I shared. nstd152_test

Cheers, Nazeefa

wdecoster commented 5 years ago

Hi Nazeefa,

I'll look into this deeper when I find the time in the next few days, but I'm quite sure the cause is that your truth set does not contain genotypes, but is a "position vcf". I suppose you could fix/fool the issue by adding GT 0/1 genotypes to a FORMAT column and SAMPLE column. Some sed would probably do the job, or any other scripting language. For the upset plot, the zygosity of the variant is not taken into account anyway.

I'll think about a better solution. It would also be good if surpyvor gave you a warning about this probably.

Cheers,
Wouter

Nazeeefa commented 5 years ago

Hi Wouter,

Thanks! It did not give any warning. I was wondering if it is possible to get stats of the amounts displayed in upset plot. Also, I don't understand the last bar/bit.

surpyvor_upset

Cheers, Nazeefa

wdecoster commented 5 years ago

You mean you want the number of SVs in each category printed out? That should be possible to add. I am currently on paternity leave, but can look into this more towards the end of the month, but I also welcome pull requests so feel free to try :)

The last bar seems to correspond to variants which were found in neither of the vcf files, but I assume you filtered for that, so that's indeed a bit strange. surpyvor upset has a --keepmerged option, which will save the merged vcf file which was used for the upset plot. Maybe you can have a look in that vcf file to see if you can find an explanation.