Closed Nazeeefa closed 4 years ago
Hi Nazeefa,
I believe this is a bug in SURVIVOR which I also came across last week, but the fix probably hasn't made it to a release yet. It surfaces when the chromosomes in files you are comparing/merging are not fully the same, e.g. one contains additional contigs. The result is that the merged file (which is a temporary file in surpyvor) contains an incomplete header, tripping up the next step.
Could you try if the following filtering helps:
bcftools view file1.vcf.gz $(grep -v '^#' truth_set.vcf | cut -f1 | sort -u | tr '\n' ' ') -o file1-only-normal-chromosomes.vcf.gz -O z
And the same for file2.vcf
You should have a look what's different between the files, but in my case, it were the decoy and unplaced contigs.
The second error you get suggests that the file is compressed, which on its own is not a problem, but is the file extension .gz? That might be necessary for it to be recognized. I should probably not trust file extensions and check if the file is really compressed (and how) by checking the first bytes, but that's not how it currently works.
Please let me know if you have more feedback or suggestions!
Cheers, Wouter
Hi Wouter,
I tried with filtering approach you mentioned, however I received the same error. Yes, the extension for the truth set file was in .gz, I compressed it and indexed it with tabix before using.
It turned out that truth set vcf file does not have same chromosome notation as subject vcf files i.e. chr. I fixed that with awk '{if($0 !~ /^#/) print "chr"$0; else print $0}'
and ran analyses with SURVIVOR and surpyvor.
Both seem to work, however surpyvor upset plot doesn't seem to show intersection between truth and my callsets.
Thanks for a quick response and help, I really appreciate your time!
Cheers, Nazeefa
Hi Nazeefa,
Good to hear the error is gone now! Let me know if you have other examples in which you get errors, I would like to get rid of those of course.
I'm a bit confused by the large number of variants which are absent from all sets (the first bar). Do you vcf files contain a lot of 0/0 or ./. positions? Having this high bar for the "uncalled" positions may lead to too low resolution to show the bars for the triple intersection, but that's just speculation.
You could try filtering your vcfs. to remove those uncalled positions. How many variants are in your truth set?
Cheers, Wouter
Hi Wouter,
Sure! & Yes, the callsets (for my data) include lots of 0/0 positions - >700 but <1000 in each callset.
In general, when analysing SVs, do you think it's a good idea to remove uncalled positions? because these could be due to low coverage, for example. I do not know. Just wondering.
I removed the uncalled positions and ran surpyvor again. Sadly, it gives same output.
The truth set contains 192,352 variants.
Thank you again, I appreciate your time!
Cheers, Nazeefa
That's remarkable. Could you perhaps share the files you are using? That would make it easier to take a look at what's going on.
Hi Wouter,
I can only share with you the small test files that include some uncalled 0/0 positions (hope that's fine), and the truth set.
Truth Set (original file is 60 MB)
I noticed that, since test files contain very few (<250) SVs, surpyvor doesn't show their (not even individual) count in a plot - as compared to previous two plots I shared.
Cheers, Nazeefa
Hi Nazeefa,
I'll look into this deeper when I find the time in the next few days, but I'm quite sure the cause is that your truth set does not contain genotypes, but is a "position vcf". I suppose you could fix/fool the issue by adding GT 0/1
genotypes to a FORMAT column and SAMPLE column. Some sed
would probably do the job, or any other scripting language. For the upset plot, the zygosity of the variant is not taken into account anyway.
I'll think about a better solution. It would also be good if surpyvor gave you a warning about this probably.
Cheers,
Wouter
Hi Wouter,
Thanks! It did not give any warning. I was wondering if it is possible to get stats of the amounts displayed in upset plot. Also, I don't understand the last bar/bit.
Cheers, Nazeefa
You mean you want the number of SVs in each category printed out? That should be possible to add. I am currently on paternity leave, but can look into this more towards the end of the month, but I also welcome pull requests so feel free to try :)
The last bar seems to correspond to variants which were found in neither of the vcf files, but I assume you filtered for that, so that's indeed a bit strange. surpyvor upset has a --keepmerged
option, which will save the merged vcf file which was used for the upset plot. Maybe you can have a look in that vcf file to see if you can find an explanation.
Hi Wouter,
Hope you are doing well. I have been trying to get UpSet plot, I used the following command:
surpyvor upset --variants file1.vcf file2.vcf truth_set.vcf --plotout upset.png
and received the following error:
I couldn't figure from the error which file is missing header information, but I checked that it was truth set vcf so I tried to index using tabix. Then, I got the following error from surpyvor:
Any pointers on how to solve this? The truth set I am using is the one available here.
I appreciate your advice and time.
Thank you, Nazeefa