zanfona734 / zanfona

A genome assembly finishing tool
MIT License
4 stars 0 forks source link

Duplicate contigs in finished assemblies #1

Open jphruska opened 1 month ago

jphruska commented 1 month ago

Hello --

Thank you so much for making an accessible genome finishing tool. I've greatly enjoyed running the pipeline, and it was wonderfully easy to install and run. I did want to ask about one issue I've run into, that could entirely be user error. In case it wasn't, I was wondering if you might be able to help me out with an issue regarding potentially duplicate contigs in the polished assemblies.

I ran the pipeline successfully, with no issues or fatal error messages. However, when trying to convert my assemblies to 2bit I ran into an error message referencing "Duplicate sequence names" in the assemblies. This happened for four different individuals, from the same species, and all blasted to the same reference genome. After inspection, I discovered that not only did the assemblies contain contigs with duplicate names, the contigs seemed to be duplicated themselves. For example, one of my individuals had two contigs named "NODE_59443_length_307_cov_1.210317" that appear to be identical in sequence:

grep -A 2 "NODE_59443_length" contigs_2.fasta

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA

I backtracked and it appears this error might have to do with two joins involving the same assembly contig and two different contigs of the reference genome (JACDOX010001555.1 and JACDOX010001572.1)?

grep "NODE_59443" assembly_38_joins.csv JACDOX010001555.1,NODE_59443_length_307_cov_1.210317,forward,NODE_210648_length_227_cov_1.162791,forward,79 JACDOX010001572.1,NODE_59443_length_307_cov_1.210317,reverse,NODE_210648_length_227_cov_1.162791,reverse,-201

Do we anticipate the finished assemblies would contain duplicate sequences?

Any assistance on the matter would be greatly appreciated.

[I see from faToTwoBit that duplicates can be ignored, so this is likely not critical, but I just wanted to make sure this was not indicative of a larger issue.]

Thank you.

Jack

zanfona734 commented 1 month ago

Hi Jack, Thanks for your email, and apologies for the delay in response. The developer believes that the cause of this is a repeat region in the reference genome. If you blast this region against the reference, do you find multiple places where it occurs? If so, this is the cause, and not necessarily an error. Let me know if this seems to be the case. Kind regards, Stacy On Friday, July 19, 2024 at 06:45:20 PM EDT, Jack Hruska @.***> wrote:

Hello --

Thank you so much for making an accessible genome finishing tool. I've greatly enjoyed running the pipeline, and it was wonderfully easy to install and run. I did want to ask about one issue I've run into, that could entirely be user error. In case it wasn't, I was wondering if you might be able to help me out with an issue regarding potentially duplicate contigs in the polished assemblies.

I ran the pipeline successfully, with no issues or fatal error messages. However, when trying to convert my assemblies to 2bit I ran into an error message referencing "Duplicate sequence names" in the assemblies. This happened for four different individuals, from the same species, and all blasted to the same reference genome. After inspection, I discovered that not only did the assemblies contain contigs with duplicate names, the contigs seemed to be duplicated themselves. For example, one of my individuals had two contigs named "NODE_59443_length_307_cov_1.210317" that appear to be identical in sequence:

grep -A 2 "NODE_59443_length" contigs_2.fasta

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA

I backtracked and it appears this error might have to do with two joins involving the same assembly contig and two different contigs of the reference genome (JACDOX010001555.1 and JACDOX010001572.1)?

grep "NODE_59443" assembly_38_joins.csv JACDOX010001555.1,NODE_59443_length_307_cov_1.210317,forward,NODE_210648_length_227_cov_1.162791,forward,79 JACDOX010001572.1,NODE_59443_length_307_cov_1.210317,reverse,NODE_210648_length_227_cov_1.162791,reverse,-201

Do we anticipate the finished assemblies would contain duplicate sequences?

Any assistance on the matter would be greatly appreciated.

[I see from faToTwoBit that duplicates can be ignored, so this is likely not critical, but I just wanted to make sure this was not indicative of a larger issue.]

Thank you.

Jack

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jphruska commented 1 month ago

Hi Stacy --

Thank you for the reply.

I've isolated the two contigs with duplicated names, and blasted each to the reference genome. I am now also aware that they are not duplicates, but do share a high degree of sequence similarity (identical for the first 307 bp).

Contig A:

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA AGAGAGAGAGAGAGAGAGAGACTAGGTAATGGTGTGTCCAGAAGAAGCTTGAATTGATGAAAAAATGAAA GTACACAGATATTAAGCAGAAAAATTATAAACTTATGTTTCTATAGTTGTAAGTAAGAATATGCCTTAAG TAAGAAACTGTATTGGGTAAGATGGTG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTAGAAGTATACAGACATTAAGCAGAAAAATTATAGACTTATGTTTCTATAGTTGTAAGTAAGAATATGCC TTAAGTAAGAAACTTAGTATTAAGTAAGAAACTTAAGTATTGGGTAAGATGGTGAAGGGTTTTTAGTGTA GTAATGTAATTGCACAGAGTAAGATCAGAGTTTAGAGGTTATAACAGAAGCAAGTGTGTGTATGAATATC

Contig B:

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA AGAGAGAGAGAGAGAGAGAGACTAGGTAATGGTGTGTCCAGAAGAAGCTTGAATTGATGAAAAAATGAAA GTACACAGATATTAAGCAGAAAAATTATAAACTTATGTTTCTATAGTTGTAAGTAAGAATATGCCTTAAG TAAGAAACTGTATTGGGTAAGATGGTG ATGAATATCCAGAGCAGCCCATTGGA

Here are the blast results for each:

Contig A:

BLASTN 2.9.0+ Query: NODE_59443_length_307_cov_1.210317 Database: assembly_38 Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score 6 hits found NODE_59443_length_307_cov_1.210317 NODE_59443_length_307_cov_1.210317 100.000 307 0 0 1 307 307 1 8.45e-161 568 NODE_59443_length_307_cov_1.210317 NODE_59443_length_307_cov_1.210317 92.708 96 5 1 390 485 99 6 2.82e-31 137 NODE_59443_length_307_cov_1.210317 NODE_210648_length_227_cov_1.162791 100.000 227 0 0 387 613 227 1 2.50e-116 420 NODE_59443_length_307_cov_1.210317 NODE_210648_length_227_cov_1.162791 92.708 96 5 1 209 302 224 129 2.82e-31 137 NODE_59443_length_307_cov_1.210317 NODE_224269_length_223_cov_1.196429 92.248 129 7 3 181 307 153 26 4.62e-44 180 NODE_59443_length_307_cov_1.210317 NODE_224269_length_223_cov_1.196429 93.182 88 4 1 390 477 124 39 1.70e-28 128

Contig B:

BLASTN 2.9.0+ Query: NODE_59443_length_307_cov_1.210317 Database: assembly_38 Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score 3 hits found NODE_59443_length_307_cov_1.210317 NODE_59443_length_307_cov_1.210317 100.000 307 0 0 1 307 307 1 4.49e-161 568 NODE_59443_length_307_cov_1.210317 NODE_224269_length_223_cov_1.196429 92.308 130 7 3 181 308 153 25 6.84e-45 182 NODE_59443_length_307_cov_1.210317 NODE_210648_length_227_cov_1.162791 92.708 96 5 1 209 302 224 129 1.50e-31 137

It does appear each contig is matching more than one region of the reference genome, in particular the first contig, which mappes with 100 % identity to two distinct contigs. Given this information is it therefore to be expected that zanfona would retain the same name for these two distinct sequences?

Thanks again for your help.

Jack

zanfona734 commented 1 month ago

Hi Jack, I don't think the same name was intended, but probably didn't come up when it was being developed and tested. Having said that, duplicate names might mess up some downstream applications. I suggest you run this script afterwards:

cat filename1.fasta | awk '/^>/{$0=">"++i}1' > filename2.fasta

That will renumber everything and save you some potential trouble later.

Kind regards, Stacy On Thursday, August 1, 2024 at 05:17:32 PM EDT, Jack Hruska @.***> wrote:

Hi Stacy --

Thank you for the reply.

I've isolated the two contigs with duplicated names, and blasted each to the reference genome. I am now also aware that they are not duplicates, but do share a high degree of sequence similarity (identical for the first 307 bp).

Contig A:

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA AGAGAGAGAGAGAGAGAGAGACTAGGTAATGGTGTGTCCAGAAGAAGCTTGAATTGATGAAAAAATGAAA GTACACAGATATTAAGCAGAAAAATTATAAACTTATGTTTCTATAGTTGTAAGTAAGAATATGCCTTAAG TAAGAAACTGTATTGGGTAAGATGGTG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTAGAAGTATACAGACATTAAGCAGAAAAATTATAGACTTATGTTTCTATAGTTGTAAGTAAGAATATGCC TTAAGTAAGAAACTTAGTATTAAGTAAGAAACTTAAGTATTGGGTAAGATGGTGAAGGGTTTTTAGTGTA GTAATGTAATTGCACAGAGTAAGATCAGAGTTTAGAGGTTATAACAGAAGCAAGTGTGTGTATGAATATC

Contig B:

NODE_59443_length_307_cov_1.210317 GCCTGATGTATTTATAGATTTTCAATTACATATATCATAGCCTGCAATACATTCCTAATACTACTTTGAA ACTTAAATCACTGCCTTAGTTTTTTTCCTTATCTGTTTCAAGTCTGAGCACTTTTTAAAAAAGAAGAAAA AGAGAGAGAGAGAGAGAGAGACTAGGTAATGGTGTGTCCAGAAGAAGCTTGAATTGATGAAAAAATGAAA GTACACAGATATTAAGCAGAAAAATTATAAACTTATGTTTCTATAGTTGTAAGTAAGAATATGCCTTAAG TAAGAAACTGTATTGGGTAAGATGGTG ATGAATATCCAGAGCAGCCCATTGGA

Here are the blast results for each:

Contig A:

BLASTN 2.9.0+

Query: NODE_59443_length_307_cov_1.210317

Database: assembly_38

Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

6 hits found

NODE_59443_length_307_cov_1.210317 NODE_59443_length_307_cov_1.210317 100.000 307 0 0 1 307 307 1 8.45e-161 568 NODE_59443_length_307_cov_1.210317 NODE_59443_length_307_cov_1.210317 92.708 96 5 1 390 485 99 6 2.82e-31 137 NODE_59443_length_307_cov_1.210317 NODE_210648_length_227_cov_1.162791 100.000 227 0 0 387 613 227 1 2.50e-116 420 NODE_59443_length_307_cov_1.210317 NODE_210648_length_227_cov_1.162791 92.708 96 5 1 209 302 224 129 2.82e-31 137 NODE_59443_length_307_cov_1.210317 NODE_224269_length_223_cov_1.196429 92.248 129 7 3 181 307 153 26 4.62e-44 180 NODE_59443_length_307_cov_1.210317 NODE_224269_length_223_cov_1.196429 93.182 88 4 1 390 477 124 39 1.70e-28 128

Contig B:

BLASTN 2.9.0+

Query: NODE_59443_length_307_cov_1.210317

Database: assembly_38

Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

3 hits found

NODE_59443_length_307_cov_1.210317 NODE_59443_length_307_cov_1.210317 100.000 307 0 0 1 307 307 1 4.49e-161 568 NODE_59443_length_307_cov_1.210317 NODE_224269_length_223_cov_1.196429 92.308 130 7 3 181 308 153 25 6.84e-45 182 NODE_59443_length_307_cov_1.210317 NODE_210648_length_227_cov_1.162791 92.708 96 5 1 209 302 224 129 1.50e-31 137

It does appear each contig is matching more than one region of the reference genome, in particular the first contig, which mappes with 100 % identity to two distinct contigs. Given this information is it therefore to be expected that zanfona would retain the same name for these two distinct sequences?

Thanks again for your help.

Jack

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jphruska commented 1 month ago

Hi Stacy --

Right. I was worried about potential downstream effects and did consider renaming the headers (given that the sequences were unique). Thank you for the one-liner, and I appreciate your feedback.

Best Jack