zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
515 stars 53 forks source link

BED name column limit #271

Closed ghuls closed 5 months ago

ghuls commented 5 months ago

Would it be possible to optionally disable the BED name column limit of 255 bytes?

$ rg MAX_LENGTH -C 5 noodles-bed/src/record/name.rs
1-//! BED record name.
2-
3-use std::{error, fmt, ops::Deref, str::FromStr};
4-
5:const MAX_LENGTH: usize = 255;
6-
7-/// A BED record name.
8-#[derive(Clone, Debug, Eq, PartialEq)]
9-pub struct Name(String);
10-
--
59-fn is_valid_name_char(c: char) -> bool {
60-    matches!(c, ' '..='~')
61-}
62-
63-fn is_valid_name(s: &str) -> bool {
64:    s.len() <= MAX_LENGTH && s.chars().all(is_valid_name_char)
65-}
66-
67-#[cfg(test)]
68-mod tests {
69-    use super::*;
--
80-        assert_eq!(" ~".parse(), Ok(Name(String::from(" ~"))));
81-
82-        assert_eq!("".parse::<Name>(), Err(ParseError::Empty));
83-        assert_eq!("🍜".parse::<Name>(), Err(ParseError::Invalid));
84-
85:        let s = "n".repeat(MAX_LENGTH + 1);
86-        assert_eq!(s.parse::<Name>(), Err(ParseError::Invalid));
87-    }
88-}

When trying to load a BED file with long names (concatenated peak names before merging peaks), I hit this limit in biobear: https://github.com/wheretrue/biobear/issues/145

zaeleus commented 5 months ago

This validation comes from The Browser Extensible Data (BED) format § 1.5 "BED fields" (2022-01-05), which requires names/descriptions (column 4) to be <= 255 characters in length:

Col BED Field Type Regex or range Brief description
4 name String [\x20-\x7e]{1,255} Feature description

I plan to split the format and buffer records for BED, but until then, I'll remove name validation on parse and apply it at serialization.