wwood / finishm

genome improvement and finishing without further sequencing effort
MIT License
5 stars 2 forks source link

Scaffolding can contain excessive numbers of ambiguous bases #27

Closed donovan-h-parks closed 9 years ago

donovan-h-parks commented 9 years ago

Ran finishm roundup with default settings. This resulted in 15 of 126 scaffolds being joined together. Excellent! However, this also increased the number of contigs (as determined by breaking scaffolds at 10 successive N's) from 202 to 253 and the number of ambiguous bases from 851 to 1534. This seems like a lot of additional ambiguous bases (many of which extend for at least 10 base pairs).

Two thoughts: 1) Is scaffolding where there are so many ambiguous bases reliable? 2) If the scaffolding is to be trusted, perhaps it is best to put a longer stretch of N's instead of having so many short sections between ambiguous sections. This is sort of silly I know, but people don't like the number of contigs in their draft genomes to jump up. Just my $0.02.

I'd be very curious on your thoughts as to the reliability of the scaffolding.

wwood commented 9 years ago

Hmm, good question. How high is the coverage of the genome? Higher coverage genomes in general can be trusted more, I theorise, but better validation is required before saying anything with certainty.

Re the number of Ns, both the scaffolding and gap filling parts of the algorithm can make extra stretches of Ns. I'm quite a fan of this so not keen to give it up so easily, because the genome is arguably more 'complete' regardless of those statistics about stretches of Ns. I'd like to have a look at the input and output data, but I suspect maybe the stretches of Ns are clustered in particular places in the genome, and maybe I should be reporting a modified statistic that somehow takes this into account to allay user concerns / do more explaining.

Thanks for the tip

On 12 March 2015 at 02:45, Donovan Parks notifications@github.com wrote:

Ran finishm roundup with default settings. This resulted in 15 of 126 scaffolds being joined together. Excellent! However, this also increased the number of contigs (as determined by breaking scaffolds at 10 successive N's) from 202 to 253 and the number of ambiguous bases from 851 to 1534. This seems like a lot of additional ambiguous bases (many of which extend for at least 10 base pairs).

Two thoughts: 1) Is scaffolding where there are so many ambiguous bases reliable? 2) If the scaffolding is to be trusted, perhaps it is best to put a longer stretch of N's instead of having so many short sections between ambiguous sections. This is sort of silly I know, but people don't like the number of contigs in their draft genomes to jump up. Just my $0.02.

I'd be very curious on your thoughts as to the reliability of the scaffolding.

— Reply to this email directly or view it on GitHub https://github.com/wwood/finishm/issues/27.

Ben Woodcroft http://ecogenomic.org/users/ben-woodcroft http://www.ecogenomic.org/

donovan-h-parks commented 9 years ago

The genome is around 30x. I agree with your statement regarding Ns. Perhaps with an upfront explanation (bubbles can occur which result in ambiguous bases, but resolvable scaffolding) and a modified stats (# contigs excluding those introduced by the scaffolding which are known 'hard' areas of the genome).

donovan-h-parks commented 9 years ago

My bad! Seems my counting is buggy. While the number of N's has increased (no big deal and almost certainly more accurate), the scaffolding looks great (i.e., it is a single stretch of N's between the two contigs). So yeah! I'm only playing with two bins, but one went from 126 scaffolds down to 111 and the other didn't have any scaffolding performed, but some gapfilling was still done.

I'm very satisfied with my FinishM experience, what do I cite?