wheretrue / biobear

Work with bioinformatic files using Arrow, Polars, and/or DuckDB
https://www.wheretrue.dev/docs/exon/biobear/
MIT License
163 stars 9 forks source link

Support reading BED files with less than 12 column. #144

Closed ghuls closed 5 months ago

ghuls commented 5 months ago

It would be nice if BED files with less than 12 columns could be read.

For example if in BEDReadOptions, you can specify how many of the BED columns follow the spec. Additional columns could be read as String columns.

Similarily to UCSC bigBed: BED3 or -type=bedN[+[P]], where N is an integer between 3 and 12 and the optional +[P] parameter specifies the number of extra fields, not required, but preferred http://genome.ucsc.edu/goldenPath/help/bigBed.html

tshauck commented 5 months ago

Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.

Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?

tshauck commented 5 months ago

@ghuls I think this should be doable now if you update biobear. BEDReadOptions now takes an n_fields param... e.g...

In [5]: session.read_bed_file('./test-three.bed', options=bb.BEDReadOptions(n_fields=3)).to_polars()
Out[5]: 
shape: (10, 3)
┌─────────────────────────┬───────┬───────┐
│ reference_sequence_name ┆ start ┆ end   │
│ ---                     ┆ ---   ┆ ---   │
│ str                     ┆ i64   ┆ i64   │
╞═════════════════════════╪═══════╪═══════╡
│ chr1                    ┆ 11874 ┆ 12227 │
│ chr1                    ┆ 12613 ┆ 12721 │
│ chr1                    ┆ 13221 ┆ 14409 │
│ chr1                    ┆ 14362 ┆ 14829 │
│ chr1                    ┆ 14970 ┆ 15038 │
│ chr1                    ┆ 15796 ┆ 15947 │
│ chr1                    ┆ 16607 ┆ 16765 │
│ chr1                    ┆ 16858 ┆ 17055 │
│ chr1                    ┆ 17233 ┆ 17368 │
│ chr1                    ┆ 17606 ┆ 17742 │
└─────────────────────────┴───────┴───────┘

Technically things shouldn't fail anymore if you don't specify the number of fields and the BED less than the full complement of fields, it just fills the additional cols with null.

In [7]: session.read_bed_file('./test-three.bed').to_polars()
Out[7]: 
shape: (10, 12)
┌─────────────────────────┬───────┬───────┬──────┬───┬───────┬─────────────┬─────────────┬──────────────┐
│ reference_sequence_name ┆ start ┆ end   ┆ name ┆ … ┆ color ┆ block_count ┆ block_sizes ┆ block_starts │
│ ---                     ┆ ---   ┆ ---   ┆ ---  ┆   ┆ ---   ┆ ---         ┆ ---         ┆ ---          │
│ str                     ┆ i64   ┆ i64   ┆ str  ┆   ┆ str   ┆ i64         ┆ str         ┆ str          │
╞═════════════════════════╪═══════╪═══════╪══════╪═══╪═══════╪═════════════╪═════════════╪══════════════╡
│ chr1                    ┆ 11874 ┆ 12227 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 12613 ┆ 12721 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 13221 ┆ 14409 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 14362 ┆ 14829 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 14970 ┆ 15038 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 15796 ┆ 15947 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 16607 ┆ 16765 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 16858 ┆ 17055 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 17233 ┆ 17368 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 17606 ┆ 17742 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
└─────────────────────────┴───────┴───────┴──────┴───┴───────┴─────────────┴─────────────┴──────────────┘

I'm gonna close this task, but please reopen if it remains an issue. Thanks!

ghuls commented 5 months ago

Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.

Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?

I didn't find the bedN[+[P]] documentation for just the BED format on UCSC. So far we only use bigBed when we create UCSC sessions. The main problem at the moment with bigBed is that you can only create/manipulate them with Kent tools and not many other tools support it (pyBigWig and pybigtools). Support for it in biobear could change this.

tshauck commented 5 months ago

Cool, thanks for the context.