Closed KaushikPanda1 closed 5 years ago
Hi, the function filters out reads if there are 3 non-CG sites and more than 70% of them are methylated (by default), it could be different than the threshold used by bismark. For the reads that you think should be filtered out, are there any reads that meet this criterion? If so, it could be a bug and please ping me to fix it.
Yupeng
Hi, I provided parameters such that there should be minimum of 10 non-CG sites and reads with >90% methylation to be filtered out. As I mentioned earlier, many of such reads are successfully filtered out, but a good amount (~20-30% of such highly methylated reads) are NOT filtered out and so I think there might be a bug. Bismark programs works a different way, but the idea is the same to filter out highly methylated (close to 100% methylated) reads because they are likely due to inefficiency of bisulfite conversion.
Let me know if you need test files to look into this bug.
Thanks for classification. Yes, some test files will be super helpful.
I have attached an alignment bam file and a reference file that can be used to run the bam quality filter program of methylpy. I fetch the sequence from the filtered bam file and use kismeth (http://katahdin.mssm.edu/kismeth/revpage.pl) to check if the filtering of highly methylated reads actually worked successfully or not.
Thank you. test_files.zip
Thanks. It was due to some issues on handling reads with indels. I have updated Can you try it with the latest version of methylpy (1.4.0)? Is your data paired-end or single-end? Note that this function is only able to handle single-end data.
Yupeng
My data is single-end. I installed the new version using PyPi. I tried the bam-quality-filter and still some 100% methylated reads are passing the filter (--min-num-ch 10 --max-mch-level 0.8 or 0.9). I am comparing unfiltered and filtered reads in kismeth.
Thanks Yupeng, but I don't think the issue is fixed yet.
ok. Can you send me a few examples of the unfiltered reads that are marked as 100% methylated? It will be helpful for figuring out what is missing.
Yupeng
I will try, but it is not trivial. Out of roughly 4000 sequences, there are about 20 fully methylated reads. In Kismeth, using dotblots of the pool, we can clearly see the fully methylated reads, but from the display it is difficult to identify which exact reads they are. If I were able to specifically capture these fully methylated reads, I would have been able to filter it. I will try to grab a few examples.
Thanks, Kaushik.
Thanks. I tried Kismeth but did not have much luck. What did you use as input to Kismeth?
Yupeng
In Kismeth you need 2 inputs. 1st is the reference file and the 2nd is the read files in fasta format, so I generally fetch that from the alignment bam file (try the following parameters: Remove PCR duplicates, Min. fraction of positive matches - 0.8 and Min. fraction of length - 0.25).
I did some crude fetching yesterday to make my case. I have attached 3 files here - an unfiltered highly methylated reads fasta file, and 2 filtered highly methylated reads with different parameters settin (0.8 or 0.9 for max mCH methylation). In this case, out of the 29 highly methylated reads, 25 are removed but 4 fully methylated reads pass through the filter.
Have a naive question - for filtering the reads, does the program calculate methylation percent of each read that pass the minimum 10 CH sites filter?
Kaushik. Archive.zip
Thanks. This is super helpful. For your question, yes, it calculates the fraction of CH sites that are methylated for all reads that pass the minimum CH site filter, and then apply a threshold on that fraction.
The latest methylpy (1.4.1) should be able to filter out the leftover 4 highly methylated reads. Do you mind to give it a try on your end?
I also attached the test data I ran (example.tar.gz). The command is
methylpy bam-quality-filter --input-file SL1403_processed_reads.high_mch.bam --ref-fasta Reference.fa --output-file test_output.bam
Yupeng
Thanks Yupeng. It works now!!! Can you share what was the last edit you did? Something regarding the edges of the reads?
Kaushik.
Cool. Glad it works. Methylpy failed to filter out the highly methylated reads because it was unable to deal with indel within the reads. If there was any indel in the read, the methylation level of mCH sites was not calculated correctly.
Hi yupenghe,
Can you look into the bam-quality-filter program? It does not function as its supposed to. I am trying to use it to filter reads with very high non-CG methylation >90%. The program is able to filter a subset of these reads successfully but not all of the highly methylated reads. In comparison, bismark's filte program is effectively able to remove all reads, but it needs initial mapping to be done in bismark and therefore I cannot use it for bam file generated through methylpy.
Thank you.