Closed lauramason326 closed 2 years ago
Hi Laura,
That's a new one. My best guess is that when you dedupe, you are left with a fasta file where multiple contigs have the same name, and that confuses things - coverm thinks it is a mapping to the shorter contig when it is a mapping to the larger one, and so throws that indexing error.
I'd suggest making the names unique somehow and then retrying.
One thing I don't understand though - if the above hypothesis is true, you have a contig which is 802 bp long, when you said that you removed everything <1kb. Maybe the thresholding didn't quite work or I'm misunderstanding something?
Hi Ben So I ran this same command with all contigs (not parsed for >1kb) and that is the error I pasted above - sorry about that. The error from the parsed >1kb file is: An interesting thing though. I looked at the dereplicated and non-dereplicated files as well as checked the dereplicated file for duplicates. It seems that the files are structured differently, despite being both .fa
Also, it does not seem like the contigs have the same names, but that sections of contigs are duplicates:
Do you think that maybe coverM is not recognizing the lines in the dereplicated file to be from the same contig and is reading them separately?
Thanks Laura
Hmm, it is just the first "word" of the contig name that matters. What does this give?
grep '>' contigs.fa |sed 's/ .*//' |sort |uniq |sort -rn |head
Hm - So I tried that and this was the result:
(base) bash-4.2$ grep '>' derep_COA1R_075_combo_1k.final_modified.contigs.fa |sed 's/ .*//' |sort |uniq |sort -rn |head
>k141_999995
>k141_999973
>k141_999964
>k141_999935
>k141_999929
>k141_999922
>k141_999906
>k141_999903
>k141_999897
>k141_999895
I made the dereplicated file into a single line fasta file, just to see if that would help Does this mean that these are the repeated contig names?
Oh sorry a mistake, I meant
grep '>' contigs.fa |sed 's/ .*//' |sort |uniq -c |sort -rn |head
No worries. Here's the results:
(base) bash-4.2$ grep '>' derep_COA1R_075_combo_1k.final_modified.contigs.fa |sed 's/ .*//' |sort |uniq -c |sort -rn |head
2 >k141_999865
2 >k141_999742
2 >k141_999587
2 >k141_999450
2 >k141_999228
2 >k141_999182
2 >k141_999097
2 >k141_998935
2 >k141_998742
2 >k141_998682
Does this mean there are 2 copies of these contigs?
Yes, it does. Try changing names so that isn't the case.
-------------- Ben Woodcroft Group leader, Centre for Microbiome Research, QUT
From: lauramason326 @.> Sent: Saturday, May 14, 2022 12:15:15 AM To: wwood/CoverM @.> Cc: Ben J Woodcroft @.>; Comment @.> Subject: Re: [wwood/CoverM] Not pre-generating minimap2 index (Issue #113)
No worries. Here's the results:
(base) bash-4.2$ grep '>' derep_COA1R_075_combo_1k.final_modified.contigs.fa |sed 's/ .*//' |sort |uniq -c |sort -rn |head 2 >k141_999865 2 >k141_999742 2 >k141_999587 2 >k141_999450 2 >k141_999228 2 >k141_999182 2 >k141_999097 2 >k141_998935 2 >k141_998742 2 >k141_998682
Does this mean there are 2 copies of these contigs?
― Reply to this email directly, view it on GitHubhttps://github.com/wwood/CoverM/issues/113#issuecomment-1126106709, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5FII5BJFQJ26LXCTITVJZPXHANCNFSM5VVE2X5Q. You are receiving this because you commented.Message ID: @.***>
Hi I am trying to run coverM 0.6.1 on a contig file made from a set of assemblies + a set of assemblies that were made from subsampling reads. All assemblies in this case come from the same sample. The files were concatenated, dereplicated with dedupe.sh, and then all contigs <1kb in length were removed. I am using coverM to determine the % of the trimmed reads from this sample that map to this concatenated contig file. However, I get this error or a similar one on each sample I run:
Here is my job script:
Any idea what's going on? Is the contig file just too big? There are 3.6M contigs in the contig file in the above example
thanks in advance! Laura