wheretrue / exon

Exon is an OLAP query engine specifically for biology and life science applications.
https://www.wheretrue.dev/docs/exon/
Other
45 stars 2 forks source link

feat: support bed files with commented lines #539

Closed ghuls closed 2 months ago

ghuls commented 2 months ago

For example fragments files created by CellRangerATAC/CellRangerARC have 52 lines of comment at the start of the file:

$ zcat "fragments.tsv.gz" | tail -n +50 | head
# primary_contig=GL456216.1
# primary_contig=JH584292.1
# primary_contig=JH584295.1
chr1    3000052 3000361 GGACATAAGGGCCACT-1  1
chr1    3000092 3000324 CGCGATTCAGGAACAT-1  1
chr1    3000132 3000184 ATATGGTGTTAGACCA-1  1
chr1    3000134 3000251 ACCTTCATCTTTGAGA-1  1
chr1    3000142 3000174 GACCGTTCAGCAAATA-1  1
chr1    3000227 3000395 TGTTATGAGAGAGGAG-1  1
chr1    3000281 3000485 TGTTACTTCCTTAGGG-1  1
In [1]: import biobear as bb

In [2]: session = bb.new_session()

In [3]: fragments_with_comments = session.read_bed_file("fragments.tsv.gz", bb.BEDReadOptions(file_compression_type=bb.FileCompressionType("gzi
   ...: p"), n_fields=5, file_extension="tsv.gz"))

In [4]: %time fragments_with_comments_df = fragments_with_comments.to_polars()
CPU times: user 58.4 s, sys: 6.43 s, total: 1min 4s
Wall time: 60 s

In [6]: fragments_with_comments_df
Out[6]: 
shape: (102_753_538, 5)
┌─────────────────────────┬─────────┬─────────┬────────────────────┬───────┐
│ reference_sequence_name ┆ start   ┆ end     ┆ name               ┆ score │
│ ---                     ┆ ---     ┆ ---     ┆ ---                ┆ ---   │
│ str                     ┆ i64     ┆ i64     ┆ str                ┆ i64   │
╞═════════════════════════╪═════════╪═════════╪════════════════════╪═══════╡
│ chr1                    ┆ 3000053 ┆ 3000361 ┆ GGACATAAGGGCCACT-1 ┆ 1     │
│ chr1                    ┆ 3000093 ┆ 3000324 ┆ CGCGATTCAGGAACAT-1 ┆ 1     │
│ chr1                    ┆ 3000133 ┆ 3000184 ┆ ATATGGTGTTAGACCA-1 ┆ 1     │
│ chr1                    ┆ 3000135 ┆ 3000251 ┆ ACCTTCATCTTTGAGA-1 ┆ 1     │
│ chr1                    ┆ 3000143 ┆ 3000174 ┆ GACCGTTCAGCAAATA-1 ┆ 1     │
│ …                       ┆ …       ┆ …       ┆ …                  ┆ …     │
│ JH584295.1              ┆ 1947    ┆ 1976    ┆ TCCTTGCAGCTCAATA-1 ┆ 1     │
│ JH584295.1              ┆ 1947    ┆ 1976    ┆ TTCCACGGTTAACACG-1 ┆ 1     │
│ JH584295.1              ┆ 1956    ┆ 1976    ┆ CTGTATTTCCGCTAGA-1 ┆ 2     │
│ JH584295.1              ┆ 1956    ┆ 1976    ┆ CTTTAGTTCCACCCTG-1 ┆ 3     │
│ JH584295.1              ┆ 1956    ┆ 1976    ┆ GACCTAGTCCTCAGCT-1 ┆ 1     │
└─────────────────────────┴─────────┴─────────┴────────────────────┴───────┘

# Before this patch, I had to make a fragments.tsv.gz file without comment lines to be able to read it:
In [7]: fragments = session.read_bed_file("fragments.no_comments.tsv.gz", bb.BEDReadOptions(file_compression_type=bb.FileCompressionType("gzip"
   ...: ), n_fields=5, file_extension="tsv.gz"))

In [8]: %time fragments_df = fragments.to_polars()
CPU times: user 1min 1s, sys: 6.86 s, total: 1min 8s
Wall time: 1min 2s

In [9]: fragments_df
Out[9]: 
shape: (102_753_538, 5)
┌─────────────────────────┬─────────┬─────────┬────────────────────┬───────┐
│ reference_sequence_name ┆ start   ┆ end     ┆ name               ┆ score │
│ ---                     ┆ ---     ┆ ---     ┆ ---                ┆ ---   │
│ str                     ┆ i64     ┆ i64     ┆ str                ┆ i64   │
╞═════════════════════════╪═════════╪═════════╪════════════════════╪═══════╡
│ chr1                    ┆ 3000053 ┆ 3000361 ┆ GGACATAAGGGCCACT-1 ┆ 1     │
│ chr1                    ┆ 3000093 ┆ 3000324 ┆ CGCGATTCAGGAACAT-1 ┆ 1     │
│ chr1                    ┆ 3000133 ┆ 3000184 ┆ ATATGGTGTTAGACCA-1 ┆ 1     │
│ chr1                    ┆ 3000135 ┆ 3000251 ┆ ACCTTCATCTTTGAGA-1 ┆ 1     │
│ chr1                    ┆ 3000143 ┆ 3000174 ┆ GACCGTTCAGCAAATA-1 ┆ 1     │
│ …                       ┆ …       ┆ …       ┆ …                  ┆ …     │
│ JH584295.1              ┆ 1947    ┆ 1976    ┆ TCCTTGCAGCTCAATA-1 ┆ 1     │
│ JH584295.1              ┆ 1947    ┆ 1976    ┆ TTCCACGGTTAACACG-1 ┆ 1     │
│ JH584295.1              ┆ 1956    ┆ 1976    ┆ CTGTATTTCCGCTAGA-1 ┆ 2     │
│ JH584295.1              ┆ 1956    ┆ 1976    ┆ CTTTAGTTCCACCCTG-1 ┆ 3     │
│ JH584295.1              ┆ 1956    ┆ 1976    ┆ GACCTAGTCCTCAGCT-1 ┆ 1     │
└─────────────────────────┴─────────┴─────────┴────────────────────┴───────┘
tshauck commented 2 months ago

Thanks!

tshauck commented 2 months ago

Thanks again, this is available in biobear v0.22.2