yupenghe / methylpy

WGBS/NOMe-seq Data Processing & Differential Methylation Analysis
Apache License 2.0
135 stars 48 forks source link

bam filter #37

Closed KaushikPanda1 closed 5 years ago

KaushikPanda1 commented 5 years ago

Hi yupenghe,

Can you look into the bam-quality-filter program? It does not function as its supposed to. I am trying to use it to filter reads with very high non-CG methylation >90%. The program is able to filter a subset of these reads successfully but not all of the highly methylated reads. In comparison, bismark's filte program is effectively able to remove all reads, but it needs initial mapping to be done in bismark and therefore I cannot use it for bam file generated through methylpy.

Thank you.

yupenghe commented 5 years ago

Hi, the function filters out reads if there are 3 non-CG sites and more than 70% of them are methylated (by default), it could be different than the threshold used by bismark. For the reads that you think should be filtered out, are there any reads that meet this criterion? If so, it could be a bug and please ping me to fix it.

Yupeng

KaushikPanda1 commented 5 years ago

Hi, I provided parameters such that there should be minimum of 10 non-CG sites and reads with >90% methylation to be filtered out. As I mentioned earlier, many of such reads are successfully filtered out, but a good amount (~20-30% of such highly methylated reads) are NOT filtered out and so I think there might be a bug. Bismark programs works a different way, but the idea is the same to filter out highly methylated (close to 100% methylated) reads because they are likely due to inefficiency of bisulfite conversion.

Let me know if you need test files to look into this bug.

yupenghe commented 5 years ago

Thanks for classification. Yes, some test files will be super helpful.

KaushikPanda1 commented 5 years ago

I have attached an alignment bam file and a reference file that can be used to run the bam quality filter program of methylpy. I fetch the sequence from the filtered bam file and use kismeth (http://katahdin.mssm.edu/kismeth/revpage.pl) to check if the filtering of highly methylated reads actually worked successfully or not.

Thank you. test_files.zip

yupenghe commented 5 years ago

Thanks. It was due to some issues on handling reads with indels. I have updated Can you try it with the latest version of methylpy (1.4.0)? Is your data paired-end or single-end? Note that this function is only able to handle single-end data.

Yupeng

KaushikPanda1 commented 5 years ago

My data is single-end. I installed the new version using PyPi. I tried the bam-quality-filter and still some 100% methylated reads are passing the filter (--min-num-ch 10 --max-mch-level 0.8 or 0.9). I am comparing unfiltered and filtered reads in kismeth.

Thanks Yupeng, but I don't think the issue is fixed yet.

yupenghe commented 5 years ago

ok. Can you send me a few examples of the unfiltered reads that are marked as 100% methylated? It will be helpful for figuring out what is missing.

Yupeng

KaushikPanda1 commented 5 years ago

I will try, but it is not trivial. Out of roughly 4000 sequences, there are about 20 fully methylated reads. In Kismeth, using dotblots of the pool, we can clearly see the fully methylated reads, but from the display it is difficult to identify which exact reads they are. If I were able to specifically capture these fully methylated reads, I would have been able to filter it. I will try to grab a few examples.

Thanks, Kaushik.

yupenghe commented 5 years ago

Thanks. I tried Kismeth but did not have much luck. What did you use as input to Kismeth?

Yupeng

KaushikPanda1 commented 5 years ago

In Kismeth you need 2 inputs. 1st is the reference file and the 2nd is the read files in fasta format, so I generally fetch that from the alignment bam file (try the following parameters: Remove PCR duplicates, Min. fraction of positive matches - 0.8 and Min. fraction of length - 0.25).

I did some crude fetching yesterday to make my case. I have attached 3 files here - an unfiltered highly methylated reads fasta file, and 2 filtered highly methylated reads with different parameters settin (0.8 or 0.9 for max mCH methylation). In this case, out of the 29 highly methylated reads, 25 are removed but 4 fully methylated reads pass through the filter.

Have a naive question - for filtering the reads, does the program calculate methylation percent of each read that pass the minimum 10 CH sites filter?

Kaushik. Archive.zip

yupenghe commented 5 years ago

Thanks. This is super helpful. For your question, yes, it calculates the fraction of CH sites that are methylated for all reads that pass the minimum CH site filter, and then apply a threshold on that fraction.

yupenghe commented 5 years ago

The latest methylpy (1.4.1) should be able to filter out the leftover 4 highly methylated reads. Do you mind to give it a try on your end?

I also attached the test data I ran (example.tar.gz). The command is methylpy bam-quality-filter --input-file SL1403_processed_reads.high_mch.bam --ref-fasta Reference.fa --output-file test_output.bam

Yupeng

KaushikPanda1 commented 5 years ago

Thanks Yupeng. It works now!!! Can you share what was the last edit you did? Something regarding the edges of the reads?

Kaushik.

yupenghe commented 5 years ago

Cool. Glad it works. Methylpy failed to filter out the highly methylated reads because it was unable to deal with indel within the reads. If there was any indel in the read, the methylation level of mCH sites was not calculated correctly.