zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
515 stars 53 forks source link

Error parsing vcf data from ncbi #302

Closed puli1027 closed 2 months ago

puli1027 commented 2 months ago

error: number too large to fit in target type

Data: Info("RS=2148352434;dbSNPBuildID=156;SSR=0;GENEINFO=GPR153:387509;VC=SNV;INT;GNO;FREQ=1000Genomes:0.9998,0.0001562") I find that defind 'https://docs.rs/noodles/0.82.0/noodles/vcf/variant/record_buf/info/field/value/enum.Value.html'

pub enum Value {
    Integer([i32](https://doc.rust-lang.org/nightly/std/primitive.i32.html)),
    Float([f32](https://doc.rust-lang.org/nightly/std/primitive.f32.html)),
    Flag,
    Character([char](https://doc.rust-lang.org/nightly/std/primitive.char.html)),
    String([String](https://doc.rust-lang.org/nightly/alloc/string/struct.String.html)),
    Array([Array](https://docs.rs/noodles/0.82.0/noodles/vcf/variant/record_buf/info/field/value/enum.Array.html)),
}

i32 is to small than 2148352434;

puli1027 commented 2 months ago

The data source is https://ftp.ncbi.nih.gov/snp/latest_release/VCF/

zaeleus commented 2 months ago

Thanks for the report and example!

This is likely an invalid INFO field value. While the range is undefined in VCF 4.2, VCF 4.3 clarifies that integers are 32-bit signed integers (§ 1.3 "Data types" (2022-11-27)):

Data types supported by VCF are: Integer (32-bit, signed)...

See also § 7.2 "Changes between VCFv4.2 and VCFv4.3" (2022-11-27):

In order for VCF and BCF to have the same expressive power, we state explicitly that Integers and Floats are 32-bit numbers. Integers are signed.

By default, htslib can't read this value and silently discards the data:

$ bcftools --version
bcftools 1.21
Using htslib 1.21
$ bcf view --no-header 302.vcf
[W::vcf_parse_info] Extreme INFO/RS value encountered and set to missing at sq0:1
sq0 1   .   A   .   .   .   RS=.

In noodles, I recommend redefining the RS type as a string. If it needs to be used, parse it manually as a larger integer type, e.g.,

use noodles_vcf::{
    self as vcf, header::record::value::map::info::Type, variant::record_buf::info::field::Value,
};

const DATA: &[u8] = br#"##fileformat=VCFv4.2
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
sq0 1   .   A   .   .   .   RS=2148352434
"#;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut reader = vcf::io::Reader::new(DATA);
    let mut header = reader.read_header()?;

    if let Some(rs) = header.infos_mut().get_mut("RS") {
        *rs.type_mut() = Type::String;
    }

    for result in reader.record_bufs(&header) {
        let record = result?;
        let info = record.info();

        if let Some(Some(Value::String(value))) = info.get("RS") {
            dbg!(value.parse::<i64>())?;
        }
    }

    Ok(())
}
puli1027 commented 2 months ago

@zaeleus I understand now. Thank you for your reply