zwdzwd / sesame

🍪 SEnsible Step-wise Analysis of DNA MEthylation BeadChips
Other
58 stars 31 forks source link

New Horvath age computation gives higher values than deprecated function predictAgeHorvath353 for EPIC v1 samples #106

Open oliemery opened 1 year ago

oliemery commented 1 year ago

In a previous pipeline I used age.pred<-predictAgeHorvath353(betas) to compute the biological age of samples based on EPIC v1 epigenetic data following the Horvath method. Now the predictAgeHorvath353 function is deprecated. So I replaced it with the predictAge function using the downloaded clock model Anno/HM450/Clock_Horvath353.rds from this github using the following code:

Horvath353 <- readRDS("C:/Users/ol5363/workingDir/Code_pipeline/clock_models/Clock_Horvath353.rds")
age.pred<-predictAge(betas, Horvath353)

Strangely the new function consistently gives 2.2 years more (2.1 for one sample out of 16) to the value I previously obtained with the now deprecated function. I tested this on 16 samples and compared the results between the deprecated and new functions, always a 2.1-2.2 year difference. Hence I don't know if the clock model is wrong or if it was the deprecated function which was giving ages 2.1-2.2 years younger. Could you help me figure out where this difference comes from? Thanks.

zwdzwd commented 1 year ago

I wonder if that could be due to the use of any preprocessing that generates beta values. I checked the old and the new version they seem to generate the same result.

https://zhou-lab.github.io/sesame/v1.14/inferences.html#Human

image

image

oliemery commented 1 year ago

Thank you for your reply. I went from sesame version 1.8.2 to 1.19.5 and only adapted the code for the Horvath computation. I checked normalized beta values from 16 samples at 66 CpG sites (those I use for epigenetic signature computations and still have access to the v1.8.2 computation results) and they are all identical (with 9 decimals precisions) between my 1.8.2 and 1.19.5 pipelines. However I cannot completely rule out at this point that beta values probes used for the Horvath age computation have been modified following my update to v1.19.5. I use EPIC v1 data, could you please test both old/new methods with an EPIC v1 sample just to see if you get the same results? Alternatively, could you indicate me how I could try again the predicAgeHorvath353 function on my samples now treated/normalized with v1.19.5 (since the function is deprecated and uses the predictAge function which has also been updated since v1.8.2 I have troubles trying to use the old Horvath computation with the latest sesame version)? Like this I could check if indeed both methods give the same results or not, or if preprocessing changes due to my update to v.1.19.5 might be involved. Thank you again for your help.

oliemery commented 1 year ago

Although I was not able to run the deprecated method on my computer which has sesame version 1.19.5, I did find a method from the wateRmelon package called agep which computes Horvath age predictions. Using it with the example sample you provided, it gives the same result as predictAge using the Horvath353 model:

betas <- sesameDataGet('HM450.1.TCGA.PAAD')$betas
# Load previously downloaded Horvath 353 clock model
Horvath353_model <- readRDS("~/Downloads/Clock_Horvath353.rds")
predictAge(betas,Horvath353_model)
# [1] 84.13913

 library(wateRmelon)
 betas <- as.matrix(betas)     # conversion to matrix to make it compatible with agep
 # Horvath is the default method for agep
agep(betas)
#  horvath.age horvath.n_missing
#1    84.13913                 0

However if I use another input sample ('EPIC.imputationDefault') which I presume to contain EPIC v1 data, the results are different between the two methods:

# Build betas from 'EPIC.imputationDefault'
betas <- as.numeric(sesameDataGet('EPIC.imputationDefault')$Blood$mean)
names(betas) <- sesameDataGet('EPIC.imputationDefault')$Blood$Probe_ID
predictAge(betas,Horvath353_model)
#[1] 34.97708

# Compute Horvath age prediction with agep
library(wateRmelon)
betas <- as.matrix(betas) # conversion to make it compatible
agep(betas)
#  horvath.age horvath.n_missing
#1    33.55048                19

which is not as much as I had previously seen on my other samples (2.1-2.2 year difference, here we're closer to 1.5 years) but still rather different for an identical sample (with the same beta values, so no difference in normalized values). Could you please test this sample with predictAgeHorvath353 to confirm that the result is indeed different compared to predictAge? agep did give the same results as what I had computed with predictAgeHorvath353 previously for about 1000 samples (normalized beta values imported from a R object file so no changes due to normalization changes between my two pipeline versions in this case), while predictAge had for most of the time a difference of 1.4266026 (for some samples a bit less from 0.8350705 to 1.3459709 while one had a 16 year difference between methods). The fact that some probes are indicated as missing by agep may play a role (perhaps this was not considered or accounted for differently in predictAge ?). Interestingly the difference in the example sample above is also 34.97708-33.55048=1.4266 years older with predictAge than with agep. From this investigation I'm under the impression that agep is giving correct values corresponding to those of the now deprecated predictAgeHorvath353 while predictAge with the Horvath clock model gives higher values (assuming predictAgeHorvath353 and agep give the same results for all samples). Although I haven't tested with other HM450 samples, the issue seems to be specific to EPIC v1 samples.

I've looked at the missing sites in the Horvath age computation with the EPIC file above:

> agep(betas, missing_probes = TRUE)
  horvath.age horvath.n_missing
1    33.55048                19
                                                                                                                                                                                            horvath.missing_probes
1 cg02654291;cg02972551;cg09785172;cg09869858;cg13682722;cg14329157;cg16494477;cg17408647;cg19167673;cg19273182;cg19945840;cg27319898;cg04431054;cg05590257;cg06117855;cg19046959;cg19569684;cg24471894;cg27016307

Most of them (15/19) were present in HM450 but removed in EPIC v1 (i.e. they can be found in the file listing missing legacy CpG sites between HM450 and EPIC v1 platforms from the illumina website here)

zwdzwd commented 1 year ago

Thanks for providing this info. I tested and realized that it was due to a change in the way sesame handle missing data on the EPIC platform. Previous version (same as WaterMelon), we simply limit the model to the remaining probes. The new implementation (1.19+) has a fallback value for missing measurement. I switched this behavior back predictAge(na_fallback=FALSE) by default in 1.19.7. The new code provides both the fallback mechanism and intersection method.

betas <- as.numeric(sesameDataGet('EPIC.imputationDefault')$Blood$mean)
names(betas) <- sesameDataGet('EPIC.imputationDefault')$Blood$Probe_ID
Horvath353_model <- readRDS("~/Downloads/Clock_Horvath353.rds")
predictAge(betas,Horvath353_model)
[1] 33.55048

> sesame_checkVersion()
SeSAMe requires matched versions of R, sesame, sesameData and ExperimentHub.
Here is the current versions installed:
R: 4.3.0
Bioconductor: 3.17
sesame: 1.19.7
sesameData: 1.19.0
ExperimentHub: 2.8.1

Hope this settles the puzzle.

oliemery commented 1 year ago

Thank you very much for the update. I was wondering how are those fallback values determined and which option (i.e. with or without fallback) is supposed to be the most accurate? I updated sesame to 1.19.7 and re-ran Horvath computations with the predictAge function and the corresponding clock model, and this time the age values were the same as my previous computations with predictAgeHorvath353 for all the 1000 samples, so it looks like you found out what was happening, fixed it and this explains the differences I was observing previously. Thanks again

Greetings from Switzerland