Not pre-generating minimap2 index

lauramason326 commented 2 years ago

Hi I am trying to run coverM 0.6.1 on a contig file made from a set of assemblies + a set of assemblies that were made from subsampling reads. All assemblies in this case come from the same sample. The files were concatenated, dereplicated with dedupe.sh, and then all contigs <1kb in length were removed. I am using coverM to determine the % of the trimmed reads from this sample that map to this concatenated contig file. However, I get this error or a similar one on each sample I run:

derep_COA1R_075_combo subsample test
The goal is to map trimmed reads to assemblies from 50M and 75% subsampled reads + original assemblies
[2022-05-05T19:20:45Z INFO  coverm] CoverM version 0.6.1
[2022-05-05T19:20:45Z INFO  coverm] Using min-covered-fraction 0%
[2022-05-05T19:20:45Z INFO  bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941 
[2022-05-05T19:20:45Z INFO  bird_tool_utils::external_command_checker] Found samtools version 1.9 
[2022-05-05T19:20:45Z INFO  coverm] Not pre-generating minimap2 index
thread 'main' panicked at 'index out of bounds: the len is 802 but the index is 862', /opt/conda/conda-bld/coverm_1645566576073/work/src/contig.rs:174:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Here is my job script:

source /fs/project/PAS1212/bioinformatic_tools/miniconda3/bin/activate coverm-0.6.1 

coverm contig -1 trimmedReads/COA1R.trimmed_R1.fastq.gz -2 trimmedReads/COA1R.trimmed_R2.fastq.gz --reference megahitAu2021/final.contigs/derep_COA1R_075_combo_1k.final.contigs.fa --min-read-percent-identity .95 --min-read-aligned-percent .75 --min-covered-fraction .75 --methods trimmed_mean --bam-file-cache-directory derep_COA1R_075_combo_BAM/ --output-file derep_COA1R_075_combo_coverM_out.csv --output-format dense

Any idea what's going on? Is the contig file just too big? There are 3.6M contigs in the contig file in the above example

thanks in advance! Laura

wwood commented 2 years ago

Hi Laura,

That's a new one. My best guess is that when you dedupe, you are left with a fasta file where multiple contigs have the same name, and that confuses things - coverm thinks it is a mapping to the shorter contig when it is a mapping to the larger one, and so throws that indexing error.

I'd suggest making the names unique somehow and then retrying.

One thing I don't understand though - if the above hypothesis is true, you have a contig which is 802 bp long, when you said that you removed everything <1kb. Maybe the thresholding didn't quite work or I'm misunderstanding something?

lauramason326 commented 2 years ago

Hi Ben So I ran this same command with all contigs (not parsed for >1kb) and that is the error I pasted above - sorry about that. The error from the parsed >1kb file is: An interesting thing though. I looked at the dereplicated and non-dereplicated files as well as checked the dereplicated file for duplicates. It seems that the files are structured differently, despite being both .fa coverM_Derep_Issue

Also, it does not seem like the contigs have the same names, but that sections of contigs are duplicates: derep_error

Do you think that maybe coverM is not recognizing the lines in the dereplicated file to be from the same contig and is reading them separately?

Thanks Laura

wwood commented 2 years ago

Hmm, it is just the first "word" of the contig name that matters. What does this give?

grep '>' contigs.fa |sed 's/ .*//' |sort |uniq |sort -rn |head

lauramason326 commented 2 years ago

Hm - So I tried that and this was the result:

(base) bash-4.2$ grep '>' derep_COA1R_075_combo_1k.final_modified.contigs.fa |sed 's/ .*//' |sort |uniq |sort -rn |head
>k141_999995
>k141_999973
>k141_999964
>k141_999935
>k141_999929
>k141_999922
>k141_999906
>k141_999903
>k141_999897
>k141_999895

I made the dereplicated file into a single line fasta file, just to see if that would help Does this mean that these are the repeated contig names?

wwood commented 2 years ago

Oh sorry a mistake, I meant

grep '>' contigs.fa |sed 's/ .*//' |sort |uniq -c |sort -rn |head

lauramason326 commented 2 years ago

No worries. Here's the results:

(base) bash-4.2$ grep '>' derep_COA1R_075_combo_1k.final_modified.contigs.fa |sed 's/ .*//' |sort |uniq -c |sort -rn |head
      2 >k141_999865
      2 >k141_999742
      2 >k141_999587
      2 >k141_999450
      2 >k141_999228
      2 >k141_999182
      2 >k141_999097
      2 >k141_998935
      2 >k141_998742
      2 >k141_998682

Does this mean there are 2 copies of these contigs?

wwood commented 2 years ago

Yes, it does. Try changing names so that isn't the case.

-------------- Ben Woodcroft Group leader, Centre for Microbiome Research, QUT

From: lauramason326 @.> Sent: Saturday, May 14, 2022 12:15:15 AM To: wwood/CoverM @.> Cc: Ben J Woodcroft @.>; Comment @.> Subject: Re: [wwood/CoverM] Not pre-generating minimap2 index (Issue #113)

No worries. Here's the results:

(base) bash-4.2$ grep '>' derep_COA1R_075_combo_1k.final_modified.contigs.fa |sed 's/ .*//' |sort |uniq -c |sort -rn |head 2 >k141_999865 2 >k141_999742 2 >k141_999587 2 >k141_999450 2 >k141_999228 2 >k141_999182 2 >k141_999097 2 >k141_998935 2 >k141_998742 2 >k141_998682

Does this mean there are 2 copies of these contigs?

― Reply to this email directly, view it on GitHubhttps://github.com/wwood/CoverM/issues/113#issuecomment-1126106709, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5FII5BJFQJ26LXCTITVJZPXHANCNFSM5VVE2X5Q. You are receiving this because you commented.Message ID: @.***>

wwood / CoverM

Not pre-generating minimap2 index #113