zwdzwd / sesame

🍪 SEnsible Step-wise Analysis of DNA MEthylation BeadChips
Other
62 stars 32 forks source link

Having trouble with background subtraction using oob on human samples. #116

Open arualemsti opened 1 year ago

arualemsti commented 1 year ago

Hello,

I'm pretty new to bioinformatics and R, learning this stuff for the first time and just need a bit of guidance in understanding why my samples look weird after background subtraction using oob. I am processing human iPSC samples, PBMC, and Sperm samples. These data sets have been processed with the Illumina EPIC v2 array. I'm providing example data from one sample in particular.

The QC prep I do is QCDPB. These are my stats before QC prep:

sdfs = openSesame(IDATprefixes, BPPARAM = BiocParallel::SnowParam(16), func = NULL) qcs = openSesame(sdfs, prep="", func=sesameQC_calcStats) head(qcs) $iPSC_101_1_R01C01

| Detection

N. Probes w/ Missing Raw Intensity : 2 (num_dtna) % Probes w/ Missing Raw Intensity : 0.0 % (frac_dtna) N. Probes w/ Detection Success : 927987 (num_dt) % Detection Success : 99.0 % (frac_dt) N. Detection Succ. (after masking) : 896247 (num_dt_mk) % Detection Succ. (after masking) : 100.0 % (frac_dt_mk) N. Probes w/ Detection Success (cg) : 924326 (num_dt_cg) % Detection Success (cg) : 99.0 % (frac_dt_cg) N. Probes w/ Detection Success (ch) : 2802 (num_dt_ch) % Detection Success (ch) : 96.2 % (frac_dt_ch) N. Probes w/ Detection Success (rs) : 65 (num_dt_rs) % Detection Success (rs) : 100.0 % (frac_dt_rs)

| Signal Intensity

Mean sig. intensity : 4546.43 (mean_intensity) Mean sig. intensity (M+U) : 9105.88 (mean_intensity_MU) Mean sig. intensity (Inf.II) : 4390.82 (mean_ii) Mean sig. intens.(I.Grn IB) : 5571.43 (mean_inb_grn) Mean sig. intens.(I.Red IB) : 5578.12 (mean_inb_red) Mean sig. intens.(I.Grn OOB) : 351.15 (mean_oob_grn) Mean sig. intens.(I.Red OOB) : 314.58 (mean_oob_red) N. NA in M (all probes) : 2 (na_intensity_M) N. NA in U (all probes) : 2 (na_intensity_U) N. NA in raw intensity (IG) : 0 (na_intensity_ig) N. NA in raw intensity (IR) : 0 (na_intensity_ir) N. NA in raw intensity (II) : 4 (na_intensity_ii)

| Number of Probes

N. Probes : 937690 (num_probes) N. Inf.-II Probes : 809395 (num_probes_II) N. Inf.-I (Red) : 82651 (num_probes_IR) N. Inf.-I (Grn) : 45644 (num_probes_IG) N. Probes (CG) : 933252 (num_probes_cg) N. Probes (CH) : 2914 (num_probes_ch) N. Probes (RS) : 65 (num_probes_rs)

| Color Channel

N. Inf.I Probes Red -> Red : 82632 (InfI_switch_R2R) N. Inf.I Probes Grn -> Grn : 45632 (InfI_switch_G2G) N. Inf.I Probes Red -> Grn : 19 (InfI_switch_R2G) N. Inf.I Probes Grn -> Red : 12 (InfI_switch_G2R)

| Dye Bias

Median Inf.I Intens. Red : 11216.40 (medR) Median Inf.I Intens. Grn : 11171.40 (medG) Median of Top 20 Inf.I Intens. Red : 43301.14 (topR) Median of Top 20 Inf.I Intens. Grn : 34737.52 (topG) Ratio of Red-to-Grn median Intens. : 1.00 (RGratio) Ratio of Top vs. Global R/G Ratios : 1.24 (RGdistort)

| Beta Value

Mean Beta : 0.69 (mean_beta) Median Beta : 0.92 (median_beta) % Beta < 0.3 : 23.9 % (frac_unmeth) % Beta > 0.7 : 68.3 % (frac_meth) N. is.na(Beta) : 41443 (num_na) % is.na(Beta) : 4.4 % (frac_na) Mean Beta (CG) : 0.69 (mean_beta_cg) Median Beta (CG) : 0.92 (median_beta_cg) % Beta < 0.3 (CG) : 23.8 % (frac_unmeth_cg) % Beta > 0.7 (CG) : 68.5 % (frac_meth_cg) N. is.na(Beta) (CG) : 40513 (num_na_cg) % is.na(Beta) (CG) : 4.3 % (frac_na_cg) Mean Beta (CH) : 0.41 (mean_beta_ch) Median Beta (CH) : 0.39 (median_beta_ch) % Beta < 0.3 (CH) : 27.0 % (frac_unmeth_ch) % Beta > 0.7 (CH) : 7.1 % (frac_meth_ch) N. is.na(Beta) (CH) : 265 (num_na_ch) % is.na(Beta) (CH) : 9.1 % (frac_na_ch) Mean Beta (RS) : 0.51 (mean_beta_rs) Median Beta (RS) : 0.49 (median_beta_rs) % Beta < 0.3 (RS) : 26.2 % (frac_unmeth_rs) % Beta > 0.7 (RS) : 29.2 % (frac_meth_rs) N. is.na(Beta) (RS) : 0 (num_na_rs) % is.na(Beta) (RS) : 0.0 % (frac_na_rs)

These are my stats after QC prep:

prepped_qcs = openSesame(IDATprefixes, BPPARAM = BiocParallel::SnowParam(8), prep="", func=sesameQC_calcStats) head(prepped_qcs) $iPSC_101_1_R01C01

| Detection

N. Probes w/ Missing Raw Intensity : 2 (num_dtna) % Probes w/ Missing Raw Intensity : 0.0 % (frac_dtna) N. Probes w/ Detection Success : 927952 (num_dt) % Detection Success : 99.0 % (frac_dt) N. Detection Succ. (after masking) : 927952 (num_dt_mk) % Detection Succ. (after masking) : 99.0 % (frac_dt_mk) N. Probes w/ Detection Success (cg) : 924294 (num_dt_cg) % Detection Success (cg) : 99.0 % (frac_dt_cg) N. Probes w/ Detection Success (ch) : 2800 (num_dt_ch) % Detection Success (ch) : 96.1 % (frac_dt_ch) N. Probes w/ Detection Success (rs) : 65 (num_dt_rs) % Detection Success (rs) : 100.0 % (frac_dt_rs)

| Signal Intensity

Mean sig. intensity : 4978.39 (mean_intensity) Mean sig. intensity (M+U) : 9956.78 (mean_intensity_MU) Mean sig. intensity (Inf.II) : 4833.01 (mean_ii) Mean sig. intens.(I.Grn IB) : 5746.53 (mean_inb_grn) Mean sig. intens.(I.Red IB) : 5978.02 (mean_inb_red) Mean sig. intens.(I.Grn OOB) : 368.88 (mean_oob_grn) Mean sig. intens.(I.Red OOB) : 625.86 (mean_oob_red) N. NA in M (all probes) : 2 (na_intensity_M) N. NA in U (all probes) : 2 (na_intensity_U) N. NA in raw intensity (IG) : 0 (na_intensity_ig) N. NA in raw intensity (IR) : 0 (na_intensity_ir) N. NA in raw intensity (II) : 4 (na_intensity_ii)

| Number of Probes

N. Probes : 937690 (num_probes) N. Inf.-II Probes : 809395 (num_probes_II) N. Inf.-I (Red) : 82610 (num_probes_IR) N. Inf.-I (Grn) : 45685 (num_probes_IG) N. Probes (CG) : 933252 (num_probes_cg) N. Probes (CH) : 2914 (num_probes_ch) N. Probes (RS) : 65 (num_probes_rs)

| Color Channel

N. Inf.I Probes Red -> Red : 82510 (InfI_switch_R2R) N. Inf.I Probes Grn -> Grn : 45544 (InfI_switch_G2G) N. Inf.I Probes Red -> Grn : 100 (InfI_switch_R2G) N. Inf.I Probes Grn -> Red : 141 (InfI_switch_G2R)

| Dye Bias

Median Inf.I Intens. Red : 11913.00 (medR) Median Inf.I Intens. Grn : 11589 (medG) Median of Top 20 Inf.I Intens. Red : 47776.50 (topR) Median of Top 20 Inf.I Intens. Grn : 33636.00 (topG) Ratio of Red-to-Grn median Intens. : 1.03 (RGratio) Ratio of Top vs. Global R/G Ratios : 1.38 (RGdistort)

| Beta Value

Mean Beta : 0.68 (mean_beta) Median Beta : 0.89 (median_beta) % Beta < 0.3 : 23.4 % (frac_unmeth) % Beta > 0.7 : 67.1 % (frac_meth) N. is.na(Beta) : 9738 (num_na) % is.na(Beta) : 1.0 % (frac_na) Mean Beta (CG) : 0.68 (mean_beta_cg) Median Beta (CG) : 0.89 (median_beta_cg) % Beta < 0.3 (CG) : 23.3 % (frac_unmeth_cg) % Beta > 0.7 (CG) : 67.3 % (frac_meth_cg) N. is.na(Beta) (CG) : 8958 (num_na_cg) % is.na(Beta) (CG) : 1.0 % (frac_na_cg) Mean Beta (CH) : 0.40 (mean_beta_ch) Median Beta (CH) : 0.38 (median_beta_ch) % Beta < 0.3 (CH) : 25.6 % (frac_unmeth_ch) % Beta > 0.7 (CH) : 4.4 % (frac_meth_ch) N. is.na(Beta) (CH) : 114 (num_na_ch) % is.na(Beta) (CH) : 3.9 % (frac_na_ch) Mean Beta (RS) : 0.52 (mean_beta_rs) Median Beta (RS) : 0.49 (median_beta_rs) % Beta < 0.3 (RS) : 24.6 % (frac_unmeth_rs) % Beta > 0.7 (RS) : 29.2 % (frac_meth_rs) N. is.na(Beta) (RS) : 0 (num_na_rs) % is.na(Beta) (RS) : 0.0 % (frac_na_rs)

I think all looks okay initially, but when I get to the individual background subtraction step using oob I get some really weird looking graphs. This is my code:

sdf_1 <- readIDATpair('/mnt/e/Human_EPIC/Chavez-UnivTexas_MethylationEPIC_20230823/Chavez-UnivTexas_MethylationEPIC_20230823/all_IDATs/iPSC_101_1_R01C01', platform = "EPIC") sdf_1.InfICorrected = inferInfiniumIChannel(sdf_1, verbose=TRUE) [2023-10-12 13:13:21.817447] Infinium-I color channel reset: R>R:91103;G>G:49589;R>G:1089;G>R:343 par(mfrow=c(1,2)) sesameQC_plotRedGrnQQ(sdf_1, main="Before") sesameQC_plotRedGrnQQ(dyeBiasNL(sdf_1.InfICorrected), main="After") # nonlinear correction image sdf_1.InfICorrected <- dyeBiasNL(sdf_1.InfICorrected) par(mfrow=c(2,1), mar=c(3,3,2,1)) sesameQC_plotBetaByDesign(sdf_1.InfICorrected, main="Before", xlab="\beta") sesameQC_plotBetaByDesign(noob(sdf_1.InfICorrected), main="After", xlab="Beta") image

I'm not entirely sure how to proceed. I'm worried my DMR/DML analysis will be really weird and on top of that I am not able to inferEthnicity. I reached out to Illumina and provided my scan folders so that they can look at the data quality.

Any help and guidance on this would be much appreciated!

Thank you!

zwdzwd commented 1 year ago

I am not 100% sure if that's the cause but you mentioned that this is EPICv2 array, if that's the case you need platform = "EPICv2"

sdf_1 <- readIDATpair('/mnt/e/Human_EPIC/Chavez-UnivTexas_MethylationEPIC_20230823/Chavez-UnivTexas_MethylationEPIC_20230823/all_IDATs/iPSC_101_1_R01C01', platform = "EPIC")
arualemsti commented 12 months ago

Thank you! I think this pretty much fixed the problem. Getting better green and red overlap.