singlem incorrectly parse sample name with paired-end data

wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes

http://wwood.github.io/singlem/

GNU General Public License v3.0

129 stars 16 forks source link

singlem incorrectly parse sample name with paired-end data #151

Open fplazaonate opened 10 months ago

fplazaonate commented 10 months ago

Hi @wwood ,

Many thanks for developing singlem. This is great tool that deserves more attention.

It seems singlem incorrectly parse sample name with paired-end data as it just removes the file extension: https://github.com/wwood/singlem/blob/4a6803db95ddb2424a79e7bb52457f08335c30dc/singlem/singlem.py#L35

Could you fix this?

Best, Florian

wwood commented 10 months ago

Hi,

Thanks for kind words.

Can you be a bit more specific? You mean it doesn't remove the e.g. .1 or _1 bit?

fplazaonate commented 10 months ago

Yes, that's it. In the output file, the sample name is 'sample_1' instead of 'sample'

wwood commented 10 months ago

Ah right. I made the decision not to wade into parsing the different possibilities there. Is there some general solution?

fplazaonate commented 10 months ago

You can add an option where the user explicitly provides the sample name. The alternative is to find a shared substring between the forward and reverse file. EDIT: the first option is probably the best as the user may provide several fastq files from different sequencing runs.

adityabandla commented 7 months ago

Looking for the same feature as I have samples sequenced across multiple runs