zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Unable to parse BED files which have non-integer score values #175

Closed J-Wall closed 1 year ago

J-Wall commented 1 year ago

I have BED files with non-integer values in the score field. noodles tries to parse these into u16 and fails with Error: Custom { kind: InvalidData, error: InvalidScore(Parse(ParseIntError { kind: InvalidDigit })) }.

I am wondering whether it would be better to use a floating point representation of the scores, as far as I know there is no formal spec for BAM files, and the closest thing to one doesn't explicitly say they should be integers.

brainstorm commented 1 year ago

as far as I know there is no formal spec for BAM [sic] files (...)

Yes, there is a quite official BED spec since 2022: https://github.com/samtools/hts-specs/blob/master/BEDv1.pdf and in section 1.5 it is specified that the score field should be represented as an integer with values between 0 and 1000:

Screenshot 2023-06-09 at 3 21 58 pm

J-Wall commented 1 year ago

OK, thanks. Given that most software in existence was written before 2022, I'm not sure there is strong justification for strictly adhering to it, but it's not my library... This overall issue is related to #111

zaeleus commented 1 year ago

noodles-bed follows hts-specs' description, which, as above, defines the score values to be integers. I recommend opening an issue in samtools/hts-specs if the value type is incorrect or fails to cover common usages.

One workaround is to parse the data as BED4+ records instead of BED5+ and above. This will parse the first set of standard fields and put the custom fields in Record::<4>::optional_fields, e.g.,

use noodles_bed as bed;

fn main() -> Result<(), bed::record::ParseError> {
    const DATA: &str = "sq0\t8\t13\t.\t21.0";

    assert!(DATA.parse::<bed::Record<5>>().is_err());

    let record: bed::Record<4> = DATA.parse()?;
    dbg!(record.reference_sequence_name()); // => "sq0"
    dbg!(record.start_position()); // => 9
    dbg!(record.end_position()); // => 13
    dbg!(record.name()); // => None
    dbg!(record.optional_fields().get(0)); // => Some("21.0")

    Ok(())
}
J-Wall commented 1 year ago

Thanks heaps, that's very helpful

On Sat, 10 June 2023, 1:22 am Michael Macias, @.***> wrote:

noodles-bed follows hts-specs' description, which, as above, defines the score values to be integers. I recommend opening an issue in samtools/hts-specs https://github.com/samtools/hts-specs if the value type is incorrect or fails to cover common usages.

One workaround is to parse the data as BED4+ records instead of BED5+ and above. This will parse the first set of standard fields and put the custom fields in Record::<4>::optional_fields, e.g.,

use noodles_bed as bed; fn main() -> Result<(), bed::record::ParseError> { const DATA: &str = "sq0\t8\t13\t.\t21.0";

assert!(DATA.parse::<bed::Record<5>>().is_err());

let record: bed::Record<4> = DATA.parse()?;
dbg!(record.reference_sequence_name()); // => "sq0"
dbg!(record.start_position()); // => 9
dbg!(record.end_position()); // => 13
dbg!(record.name()); // => None
dbg!(record.optional_fields().get(0)); // => Some("21.0")

Ok(())}

— Reply to this email directly, view it on GitHub https://github.com/zaeleus/noodles/issues/175#issuecomment-1584759621, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADVE3QND4O2AIG2TLHPMLNDXKM5UFANCNFSM6AAAAAAZAE3GRM . You are receiving this because you authored the thread.Message ID: @.***>