Closed ghuls closed 2 months ago
For example fragments files created by CellRangerATAC/CellRangerARC have 52 lines of comment at the start of the file:
$ zcat "fragments.tsv.gz" | tail -n +50 | head # primary_contig=GL456216.1 # primary_contig=JH584292.1 # primary_contig=JH584295.1 chr1 3000052 3000361 GGACATAAGGGCCACT-1 1 chr1 3000092 3000324 CGCGATTCAGGAACAT-1 1 chr1 3000132 3000184 ATATGGTGTTAGACCA-1 1 chr1 3000134 3000251 ACCTTCATCTTTGAGA-1 1 chr1 3000142 3000174 GACCGTTCAGCAAATA-1 1 chr1 3000227 3000395 TGTTATGAGAGAGGAG-1 1 chr1 3000281 3000485 TGTTACTTCCTTAGGG-1 1
In [1]: import biobear as bb In [2]: session = bb.new_session() In [3]: fragments_with_comments = session.read_bed_file("fragments.tsv.gz", bb.BEDReadOptions(file_compression_type=bb.FileCompressionType("gzi ...: p"), n_fields=5, file_extension="tsv.gz")) In [4]: %time fragments_with_comments_df = fragments_with_comments.to_polars() CPU times: user 58.4 s, sys: 6.43 s, total: 1min 4s Wall time: 60 s In [6]: fragments_with_comments_df Out[6]: shape: (102_753_538, 5) ┌─────────────────────────┬─────────┬─────────┬────────────────────┬───────┐ │ reference_sequence_name ┆ start ┆ end ┆ name ┆ score │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str ┆ i64 │ ╞═════════════════════════╪═════════╪═════════╪════════════════════╪═══════╡ │ chr1 ┆ 3000053 ┆ 3000361 ┆ GGACATAAGGGCCACT-1 ┆ 1 │ │ chr1 ┆ 3000093 ┆ 3000324 ┆ CGCGATTCAGGAACAT-1 ┆ 1 │ │ chr1 ┆ 3000133 ┆ 3000184 ┆ ATATGGTGTTAGACCA-1 ┆ 1 │ │ chr1 ┆ 3000135 ┆ 3000251 ┆ ACCTTCATCTTTGAGA-1 ┆ 1 │ │ chr1 ┆ 3000143 ┆ 3000174 ┆ GACCGTTCAGCAAATA-1 ┆ 1 │ │ … ┆ … ┆ … ┆ … ┆ … │ │ JH584295.1 ┆ 1947 ┆ 1976 ┆ TCCTTGCAGCTCAATA-1 ┆ 1 │ │ JH584295.1 ┆ 1947 ┆ 1976 ┆ TTCCACGGTTAACACG-1 ┆ 1 │ │ JH584295.1 ┆ 1956 ┆ 1976 ┆ CTGTATTTCCGCTAGA-1 ┆ 2 │ │ JH584295.1 ┆ 1956 ┆ 1976 ┆ CTTTAGTTCCACCCTG-1 ┆ 3 │ │ JH584295.1 ┆ 1956 ┆ 1976 ┆ GACCTAGTCCTCAGCT-1 ┆ 1 │ └─────────────────────────┴─────────┴─────────┴────────────────────┴───────┘ # Before this patch, I had to make a fragments.tsv.gz file without comment lines to be able to read it: In [7]: fragments = session.read_bed_file("fragments.no_comments.tsv.gz", bb.BEDReadOptions(file_compression_type=bb.FileCompressionType("gzip" ...: ), n_fields=5, file_extension="tsv.gz")) In [8]: %time fragments_df = fragments.to_polars() CPU times: user 1min 1s, sys: 6.86 s, total: 1min 8s Wall time: 1min 2s In [9]: fragments_df Out[9]: shape: (102_753_538, 5) ┌─────────────────────────┬─────────┬─────────┬────────────────────┬───────┐ │ reference_sequence_name ┆ start ┆ end ┆ name ┆ score │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str ┆ i64 │ ╞═════════════════════════╪═════════╪═════════╪════════════════════╪═══════╡ │ chr1 ┆ 3000053 ┆ 3000361 ┆ GGACATAAGGGCCACT-1 ┆ 1 │ │ chr1 ┆ 3000093 ┆ 3000324 ┆ CGCGATTCAGGAACAT-1 ┆ 1 │ │ chr1 ┆ 3000133 ┆ 3000184 ┆ ATATGGTGTTAGACCA-1 ┆ 1 │ │ chr1 ┆ 3000135 ┆ 3000251 ┆ ACCTTCATCTTTGAGA-1 ┆ 1 │ │ chr1 ┆ 3000143 ┆ 3000174 ┆ GACCGTTCAGCAAATA-1 ┆ 1 │ │ … ┆ … ┆ … ┆ … ┆ … │ │ JH584295.1 ┆ 1947 ┆ 1976 ┆ TCCTTGCAGCTCAATA-1 ┆ 1 │ │ JH584295.1 ┆ 1947 ┆ 1976 ┆ TTCCACGGTTAACACG-1 ┆ 1 │ │ JH584295.1 ┆ 1956 ┆ 1976 ┆ CTGTATTTCCGCTAGA-1 ┆ 2 │ │ JH584295.1 ┆ 1956 ┆ 1976 ┆ CTTTAGTTCCACCCTG-1 ┆ 3 │ │ JH584295.1 ┆ 1956 ┆ 1976 ┆ GACCTAGTCCTCAGCT-1 ┆ 1 │ └─────────────────────────┴─────────┴─────────┴────────────────────┴───────┘
Thanks!
Thanks again, this is available in biobear v0.22.2
For example fragments files created by CellRangerATAC/CellRangerARC have 52 lines of comment at the start of the file: