wjwei-handsome / wgatools

Whole Genome Alignment Tools
MIT License
119 stars 7 forks source link

Missing some fields in minimap2 PAF #7

Open baozg opened 8 months ago

baozg commented 8 months ago

Hi, Weijie

I found the wgatools will stop after raise an error with minimap2 PAF minimap2 -x asm20 --eqx ref.fa query.fa > query.paf.

RROR CSV deserialize error by: CSV error: record 193 (line: 194, byte: 1339969): found record with 23 fields, but the previous record has 24 fields

Then I checked the PAF, only 2276/4195 records's 24 column start from cg:Z. According to https://github.com/lh3/miniasm/blob/master/PAF.md, only the first 12 columns is the requirement, so the aligner may choose random output of some tags.

wjwei-handsome commented 8 months ago

Ah-oh...

It seems that I am very partial to CIGAR 😆

I will fix it in a few days

wjwei-handsome commented 8 months ago

Hi @baozg

I revisited my code. Then I found out that I didn't actually make a mandatory requirement for the fields after 12 columns.

That is, if you generate paf for each row that keeps only 12 columns, this won't raise error.

The reason for the error is that some rows have 12 columns and some rows have 13 columns. I think this may require some formatting. Is this paf file produced by minimap2?

baozg commented 8 months ago

That's what I mean since the PAF don't require a real tsv, then the different rows may have different columns. My PAF was aligned by minimap2 2.27, you can test with yeast data in the wfmash repo https://github.com/waveygang/wfmash/tree/main/data

seqkit grep -r -p "SGD" scerevisiae8.fa.gz |bgzip -@ 12 -c > SGDref.fa.gz
minimap2 -x asm5 -c --eqx SGDref.fa.gz scerevisiae8.fa.gz > test.paf
cut -f24 test.paf|grep "cg:Z"|wc -l
cut -f23 test.paf|grep "cg:Z"|wc -l
johnstonmj commented 7 months ago

I just had the same experience as @baozg

PAFs produced by minimap2 -cx asm10 can have different numbers of columns. In my case, three alignments were generated, but only one of them had the tag zd:i:1

And I also encounter the error: ERROR CSV deserialize error by: CSV error: record 2 (line: 3, byte: 576): found record with 24 fields, but the previous record has 23 fields

wjwei-handsome commented 7 months ago

@johnstonmj @baozg

Hey, It seems that this is a common problem. I will rethinking the logic of file reading.

Please give me a few days :)

baozg commented 7 months ago

@johnstonmj My ugly solution is only to keep the first 12 columns and the column with CIGAR. You can use simple code to filter the minimap2 paf: perl -lane 'print join("\t", @F[0..11], grep(/^cg:Z/, @F[12..$#F]))' your_file.paf

wjwei-handsome commented 7 months ago

HAHA!

@baozg @johnstonmj

I fixed this bug and only added one line of code! LOL

https://github.com/wjwei-handsome/wgatools/commit/da0e5b5766d95d8fa2c3df75e7d8390088c6e1b6

Please clone the newest and try again.

Let me know if any problems. ❤️

johnstonmj commented 7 months ago

@johnstonmj My ugly solution is only to keep the first 12 columns and the column with CIGAR. You can use simple code to filter the minimap2 paf: perl -lane 'print join("\t", @F[0..11], grep(/^cg:Z/, @F[12..$#F]))' your_file.paf

@baozg Thanks for sharing this workaround! I didn't have a chance to try it yesterday, but I guess I won't need to because...

@wjwei-handsome The latest update works. paf2maf worked for me this time. No errors thrown and the output MAF seems good. Many thanks!

baozg commented 6 months ago

wgatools filter -f paf -a 50000 still not work with minimap2 paf

wjwei-handsome commented 6 months ago

wgatools filter -f paf -a 50000 still not work with minimap2 paf

It's OK in latest ✅