Closed ghuls closed 5 months ago
Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.
Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?
@ghuls I think this should be doable now if you update biobear. BEDReadOptions
now takes an n_fields
param... e.g...
In [5]: session.read_bed_file('./test-three.bed', options=bb.BEDReadOptions(n_fields=3)).to_polars()
Out[5]:
shape: (10, 3)
┌─────────────────────────┬───────┬───────┐
│ reference_sequence_name ┆ start ┆ end │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════════════════════╪═══════╪═══════╡
│ chr1 ┆ 11874 ┆ 12227 │
│ chr1 ┆ 12613 ┆ 12721 │
│ chr1 ┆ 13221 ┆ 14409 │
│ chr1 ┆ 14362 ┆ 14829 │
│ chr1 ┆ 14970 ┆ 15038 │
│ chr1 ┆ 15796 ┆ 15947 │
│ chr1 ┆ 16607 ┆ 16765 │
│ chr1 ┆ 16858 ┆ 17055 │
│ chr1 ┆ 17233 ┆ 17368 │
│ chr1 ┆ 17606 ┆ 17742 │
└─────────────────────────┴───────┴───────┘
Technically things shouldn't fail anymore if you don't specify the number of fields and the BED less than the full complement of fields, it just fills the additional cols with null.
In [7]: session.read_bed_file('./test-three.bed').to_polars()
Out[7]:
shape: (10, 12)
┌─────────────────────────┬───────┬───────┬──────┬───┬───────┬─────────────┬─────────────┬──────────────┐
│ reference_sequence_name ┆ start ┆ end ┆ name ┆ … ┆ color ┆ block_count ┆ block_sizes ┆ block_starts │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str ┆ ┆ str ┆ i64 ┆ str ┆ str │
╞═════════════════════════╪═══════╪═══════╪══════╪═══╪═══════╪═════════════╪═════════════╪══════════════╡
│ chr1 ┆ 11874 ┆ 12227 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 12613 ┆ 12721 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 13221 ┆ 14409 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 14362 ┆ 14829 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 14970 ┆ 15038 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 15796 ┆ 15947 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 16607 ┆ 16765 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 16858 ┆ 17055 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 17233 ┆ 17368 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ chr1 ┆ 17606 ┆ 17742 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
└─────────────────────────┴───────┴───────┴──────┴───┴───────┴─────────────┴─────────────┴──────────────┘
I'm gonna close this task, but please reopen if it remains an issue. Thanks!
Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.
Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?
I didn't find the bedN[+[P]]
documentation for just the BED format on UCSC. So far we only use bigBed when we create UCSC sessions. The main problem at the moment with bigBed is that you can only create/manipulate them with Kent tools and not many other tools support it (pyBigWig and pybigtools). Support for it in biobear could change this.
Cool, thanks for the context.
It would be nice if BED files with less than 12 columns could be read.
For example if in BEDReadOptions, you can specify how many of the BED columns follow the spec. Additional columns could be read as String columns.
Similarily to UCSC bigBed:
BED3 or -type=bedN[+[P]], where N is an integer between 3 and 12 and the optional +[P] parameter specifies the number of extra fields, not required, but preferred
http://genome.ucsc.edu/goldenPath/help/bigBed.html