wheretrue / biobear

Work with bioinformatic files using Arrow, Polars, and/or DuckDB
https://www.wheretrue.dev/docs/exon/biobear/
MIT License
163 stars 9 forks source link

name column in BED file is limited to 255 bytes. #145

Closed ghuls closed 5 months ago

ghuls commented 5 months ago
In [80]: %time  b = session.read_bed_file("name_256bytes.one.bed")
CPU times: user 87 µs, sys: 13 µs, total: 100 µs
Wall time: 103 µs

In [81]: %time c = b.to_polars()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
File <timed exec>:1

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/table.pxi:4755, in pyarrow.lib.Table.from_batches()

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/ipc.pxi:666, in pyarrow.lib.RecordBatchReader.__next__()

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/ipc.pxi:700, in pyarrow.lib.RecordBatchReader.read_next_batch()

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: C Data interface error: External error: Arrow error: External error: Io error: invalid record: invalid name

BED file with name column of more than 255 bytes, gives this error:

$ cat name_256bytes.one.bed
chr1    1000160 1000660 PURK_peak_11,INH_SST_peak_18b,INH_SST_peak_18c,GC_peak_40a,GC_peak_40b,GC_peak_40c,OPC_peak_30b,OPC_peak_30c,MOL_peak_32c,INH_VIP_peak_18a,NFOL_peak_32e,NFOL_peak_32f,AST_CER_peak_48e,AST_CER_peak_48f,ENDO_peak_7,AST_peak_17c,GP_peak_20,ASTP_peak_19e,INH_V    500 .
tshauck commented 5 months ago

Thanks for making this issue. I'll keep an eye on the ticket in noodles you filed, and make downstream updates as needed.

ghuls commented 5 months ago

BED name length validation on reading is removed in noodles.

tshauck commented 5 months ago

Thanks -- @zaeleus would you be willing to do a release at some point soon so I can pull in the update?

zaeleus commented 5 months ago

The change is now available in noodles 0.77.0/noodles-bed 0.15.0.

tshauck commented 5 months ago

Thanks! -- https://github.com/wheretrue/exon is being released now with the update, and once it's done I'll do a biobear release. Will follow up here when it's on pypi.

tshauck commented 5 months ago

@ghuls This update is in biobear 0.22.1 on pypi. I'm going to close the ticket, but please reopen if you still have issues. Thanks!