yupenghe / methylpy

WGBS/NOMe-seq Data Processing & Differential Methylation Analysis
Apache License 2.0
135 stars 47 forks source link

methylpy call-methylation-state #87

Open MagdalenaWinklhofer opened 1 year ago

MagdalenaWinklhofer commented 1 year ago

Hi,

I am working with WGBS data on a non-model organism. I used Bismark (with Bowtie) to align my reads and would like to use Methylpy to identify DMRs. I installed methylpy in a conda environment on our HPC. Since we have a lot of modules available there I did not install other dependencies so far. I wanted to use the "methylpy call-methylation-state" command to get allc files from the alignment .bam files (from Birmark). I used the alignment files and not the deduplicated files (is that correct?). Since I got the error that my genome was not indexed, I indexed it with the following command:

methylpy build-reference \ --input-files .../working.genome.masked.262contigs.fasta \ --output-prefix genome_index \

This script finished and gave me the following files:

genome_index_f.1.bt2
genome_index_f.3.bt2
genome_index_f.rev.1.bt2
genome_index_r.1.bt2
genome_index_r.3.bt2
genome_index_r.rev.1.bt2 genome_index_f.2.bt2 genome_index_f.4.bt2
genome_index_f.rev.2.bt2
genome_index_r.2.bt2
genome_index_r.4.bt2
genome_index_r.rev.2.bt2

After that I tried it again:

methylpy call-methylation-state \ --input-file A1_val_1_bismark_bt2_pe.bam \ --paired-end True \ --sample A1 \ --ref-fasta .../working.genome.masked.262contigs.fasta \ --num-procs 2 \ --path-to-output .../11_DML/ \ --path-to-samtools /cluster/software/SAMtools/1.16.1-GCC-11.3.0 \

But I get multiple errors:

Begin flipping the strand of read 2 Thu Sep 21 16:13:23 2023

Input not indexed. Indexing... Thu Sep 21 16:14:06 2023

[E::hts_idx_push] Unsorted positions on sequence #49: 1104419 followed by 1104238 samtools index: failed to create index for "A1-D17-1_R1_val_1_bismark_bt2_pe.bam.read2flipped.bam" Traceback (most recent call last): File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/call_mc_se.py", line 1483, in call_methylated_sites open(inputf+".bai",'r') FileNotFoundError: [Errno 2] No such file or directory: 'A1-D17-1_R1_val_1_bismark_bt2_pe.bam.read2flipped.bam.bai'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File ".../methylpy_env/bin/methylpy", line 5, in parse_args() File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/parser.py", line 221, in parse_args keep_temp_files=args.keep_temp_files) File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/call_mc_pe.py", line 1373, in call_methylated_sites_pe keep_temp_files = keep_temp_files) File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/call_mc_se.py", line 1486, in call_methylated_sites subprocess.check_call(shlex.split(path_to_samtools+"samtools index "+inputf)) File ".../conda_environments/methylpy_env/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['samtools', 'index', 'A1-D17-1_R1_val_1_bismark_bt2_pe.bam.read2flipped.bam']' returned non-zero exit status 1.

I am sure that the .bam file exists, and they are even in the same directory. I don't know what is meant by the "'['samtools', 'index'," output, since I state the path to SAMtools (is a module on our HPC and I made sure that it was loaded) in the command and I indexed the genome (.fasta file is also stated in the command).

I removed sensitive information in the code above!

Thank you very much for your help!!!

yupenghe commented 1 year ago

Hey, sorry that the error messages are not super clear.

  1. For genome indexing, it is about indexing the fasta file. Please checkout this link: http://www.htslib.org/doc/samtools-faidx.html
  2. For the error related to bam file, my guess is that the error is caused by the input bam file being unsorted. Could you try to sort the bam file with samtools sort (link: http://www.htslib.org/doc/samtools-sort.html) and then run methylpy?

In general I would suggest trying the above two fixes with a subset of the input files to know whether they work sooner.

MagdalenaWinklhofer commented 1 year ago

Hi, thank you for the very helpful reply. I got it working! :) I had the genome already indexed, but my bam files were unsorted as you suspected. I ran "samtools sort" and "samtools index," after that, the DMRfind worked as usual.

I performed a test run so far and was surprised that you were always talking about differentially methylated regions (DMR), but in the "result" (output) file, I get positions that are 1-based (column: pos). Wouldn't that be a differentially methylated location (DML). Is there a way to get both outputs, one for DML and one for DMR? I am also struggling a bit to understand all the columns in the output file. I particularly looked at the "results.tsv" file. Do you write anywhere about what all those columns mean? I am particularly interested in understanding the abbreviations "h" and"uc".

Thank you for your help!

PS: Sorry for my late reply. I had a course the last weeks and no time to work on my project.

yupenghe commented 1 year ago

Hey, what is your command to run the DMR caller? DMR caller identifies DML or DMS (S = site) first and then merges them to DMRs. The output should be in the *.DMR.bed file. For the results.tsv file, m, uc and h means the # methylated reads, # unmethylated reads and the total # of reads.

MagdalenaWinklhofer commented 1 year ago

Hi, thank you for the information regarding the abbreviations, that really helped.

So far, I have used the "sorted.bam" files for each sample to extract the methylation state with the "methylpy call-methylation-state" command. The outputs I received were "allc.tsv_" files. After that, I merged all 4 biological replicates (a1,a2,a3,a4) files into one condition file (a = anoxia). I used that merged "_allcanoxia.tsv.gz" file in combination with the "_allcnormoxia.tsv.gz" file as input for the "DMR find" script.

methylpy DMRfind \ --allc-files allc_normoxia.tsv.gz allc_anoxia.tsv.gz \ --samples normoxia anoxia \ --mc-type "CG" \ --num-procs 2 \ --output-prefix DMR_normxia_vs_anoxia \

For that script, I get four files:

I read in your documentation that you can also use a space-separated list for different input samples, but if I tried the script like this:

methylpy DMRfind \ --allc-files allc_N1.tsv.gz,allc_N2.tsv.gz,allc_N3.tsv.gz,allc_N7.tsv.gz allc_A1.tsv.gz,allc_A2.tsv.gz,allc_A4.tsv.gz,allc_A7.tsv.gz \ --samples N1,N2,N3,N4 A1,A2,A4,A7 \ --mc-type "CG" \ --num-procs 2 \ --output-prefix DMR_normxia_vs_anoxia_test \

I got the following error: Traceback (most recent call last): File ".../conda_environments/methylpy_env/bin/methylpy", line 5, in parse_args() File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/parser.py", line 71, in parse_args seed=args.seed) File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/DMRfind.py", line 166, in DMRfind chrom_pointer[sample] = read_allc_index(allc_file) File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/utilities.py", line 724, in read_allc_index index_allc_file(allc_file,reindex=False) File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/utilities.py", line 686, in index_allc_file f = open_allc_file(allc_file) File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/utilities.py", line 1133, in open_allc_file f = open(allc_file,'r') FileNotFoundError: [Errno 2] No such file or directory: 'DMR_normxia_vs_anoxia_test_filtered_allc_N1,N2,N3,N4.tsv'

Is there another possibility to keep the biological samples unmerged and list them in the DMR find script? I would like to compare four biological replicates in the normoxia condition against the four biological replicates in the anoxia condition and receive both DML and DMR between both conditions.

Thank you for your help!!!

yupenghe commented 1 year ago

The error is due to the separator being comma --samples N1,N2,N3,N4 A1,A2,A4,A7. Switching to space separator should work.

Is there another possibility to keep the biological samples unmerged and list them in the DMR find script? I would like to compare four biological replicates in the normoxia condition against the four biological replicates in the anoxia condition and receive both DML and DMR between both conditions.

Can you explain more about "keep the biological samples unmerged"? DMRfind by default won't merge biological samples. But if you would like the replicates to be grouped during DMR finding process, you can use the sample category option.

MagdalenaWinklhofer commented 1 year ago

Hi,

thank you for that quick reply. I have implemented the space-separated list and the sample category. That also solved the "merging" issue I had before.

Now I get four output files:

I understand that the DMS.bed file contains information about the differentially methylated sites, and the DMR.bed file contains information about the differentially methylated regions. However, I could not find any information about the output format of both files.

As far as I understood, the DMS file contains the following columns: 1) chromosome 2) start position 3) stop position 4) sequence context 5) strand 6)??? (always the value 0.0003)

The DMR file contains only the following columns: 1) chromosome 2) start position 3) stop position 4) ??? (numerical values between 1 and 3)

Could you verify that or tell me where I can find such information?

Why are the regions in the DMR file only one base pair long? Shouldn't it be longer when we are talking about regions?

MagdalenaWinklhofer commented 1 year ago

Could you please answer my inquiry or refer me to a webpage where I could read up on the topic? That would really help my project.

yupenghe commented 1 year ago

Sorry for the late reply. For the DMS file, the 6th column is the p-value from the differential methylation test. You can basically ignore it. For the DMR file, the 4th column is the number of DMSs within the DMR. If there is only one DMS in DMR, the DMR will be one base pair long. I would recommend applying a filter on the number of DMSs. For example, it is proper to only consider DMRs with at least 4 DMSs.

MagdalenaWinklhofer commented 1 year ago

Thank you very much for the helpful information!! :)