zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Support fastq comments/attributes/metadata #287

Closed d-cameron closed 1 month ago

d-cameron commented 1 month ago

Currently the noodles fastq API only supports returning the full Name/header/Sequence identifier line for each record. It would be very useful if the noodles API supported extracting the read name from the header line. That is, the ASCII from immediately after the @ till the first whitespace.

Comments/metadata/additional information after the read name are extremely common. Examples include:

Illumina basespace https://help.basespace.illumina.com/files-used-by-basespace/fastq-files

SRA

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

SPAdes (error correction)

@chr1_MATERNAL-28-497631.498403.F BH:changed:1

BFC (uses a TAB instead of SPACE)

@HISEQ1:22:H9UJNADXX:1:2115:3225:19584  ec:Z:0_72:1_0_0:0_0
zaeleus commented 1 month ago

noodles-fastq does already support definitions with descriptions separated by a space ( ), e.g.,

https://github.com/zaeleus/noodles/blob/c3f569749fef05ba06f98dcef708e5bb5d02a387/noodles-fastq/src/io/reader/record/definition.rs#L98-L103

That is, the ASCII from immediately after the @ till the first whitespace.

I updated the reader to split on either a space or horizontal tab (\t) in fcc9767878e308cb76b95814ad7600f494095285. Is this sufficient, or do you expect the separator to be any locale-specific whitespace (described by isspace(3))?

The writer now supports custom definition separators as well (08f0a3fc0d65b5b5f2005028a1de7db132787822).

d-cameron commented 1 month ago

I updated the reader to split on either a space or horizontal tab (\t) in fcc9767. Is this sufficient, or do you expect the separator to be any locale-specific whitespace (described by isspace(3))?

That is sufficient. I haven't encountering any separator other than tab or space.

zaeleus commented 1 month ago

This behavior is now available in noodles 0.79.0 / noodles-fastq 0.14.0.