Longitudinal biomarker analysis in matched case-control pairs using R

Introdution

Linear mixed-effects models (LMMs) that account for nested random effects can be used to analyze longitudinal changes in a biomarker between matched case-control pairs. For a detailed tutorial on how to account for random effects according to your study design, check https://ourcodingclub.github.io/tutorials/mixed-models/.

Fig 1. Example of longitudinal study design with matched case-control pairs from our study (in review). Here it is clear that case-control pairs are matched by age, and they are also matched by sex.

We used LMMs to study changes in epigenetic biomarkers over time in Alzheimer’s disease (AD) cases matched with healthy controls by sex and age.

Here we present a mock dataset with fake matched cases and controls:

Table. Mock dataset exemplifying a longitudinal study design with AD case-control pairs matched by sex and age.

First, we used Generalized additive mixed models (GAMMs) to depict the longitudinal profile of each epigenetic biomarker, in order to check for potential non-linear associations in the biomarker. Unadjusted GAMMs were performed using the gamm4 function of the gamm4 R package.

library(gamm4)
library(splines)

epi_var <- "biomarker_1"

# Unadjusted GAMM model
gamm_model <- gamm4(biomarker_1 ~ 1 + s(age_blood_sampling), data=data, random= ~(1 | subject_number))

# Plot the model
plot(gamm_model$gam,shade=TRUE,shift=gamm_model$gam$coefficients[1],ylab=epi_var,residuals=TRUE,pch=20, pages=1)

Fig 2. Example of GAMM plots of a biomarker with slight non-linear association (Biomarker 0) and clear linear association over time (Biomarker 1). Here the time-scale is the age at blood sampling (age_blood_sampling), but other time-scales can be used, as you can see in the analysis below.

As we observed linear associations, our subsequent analyses used LMMs.

Linear mixed-effects models (LMMs)

In our data, longitudinal changes in the biomarkers were assessed employing linear mixed-effects models (LMMs). As we intended to access differential longitudinal changes between AD cases and controls, all LMMs include an interaction term between AD status as a binary indicator variable and time AD * Years_to_or_after_AD.

The models were adjusted for the covariates APOEe4_carrier, Ever_smoked, and Granulocyte_proportion that are relevant in an epigenetics and AD study. As the AD cases and controls are matched by age and sex, these covariates were not included in the models.

To be able to interpret the LMM estimates of the covariates in the output, you need to know the reference value of your variables. The model will use 0 (zero) as the reference. This also means that the model’s intercept represent the estimate when all variables = 0 (zero). Thus, when we set the categorical variable AD as controls = 0 and AD cases = 1 (see mock data), the LMM will estimate the change in the biomaker in AD cases, when compared with the controls. Also, by setting the age of AD onset = 0 in the Years_to_or_after_AD time-scale (see Fig 1 and mock data), we were able to get the estimates of the biomarker for AD at the age of AD onset.

For the same reason, when continuous variables are Z-transformed using scale the estimates are interpreted as the effect of one standard deviation’s (SD=1) from the mean (mean=0) increase in the estimate.

# Continuous covariates were z-transformed
data$Granulocyte_proportion <- scale(data$Granulocyte_proportion, scale=TRUE)

# Categorical covatiates were set as factors
data$APOEe4_carrier <- as.factor(data$APOEe4_carrier)
data$Ever_smoked <- as.factor(data$Ever_smoked) 
data$AD <- as.factor(data$AD)
data$subject_number <- as.factor(data$subject_number) 
data$matched_pair_number <- as.factor(data$matched_pair_number)

The longitudinal measures from the same subject (1 | subject_number) and the sex- and age-matched pairs (1 | matched_pair_number) were modeled as nested random effects to account for variability within these blocking variables. In particular, subjects were nested within matched pairs such that each matched pair of subjects are unique to that pair (see mock data).

LMMs were performed using the lmer function of the lme4 R package.

library(lme4)

LMM_model <- lmer(biomarker_1 ~ 1 
                  + APOEe4_carrier
                  + Ever_smoked
                  + Granulocyte_proportion
                  + AD * Years_to_or_after_AD
                  + (1 | subject_number) 
                  + (1 | matched_pair_number) 
                  , data = data)
summary(LMM_model)

Now we can estimate the outcome variable biomarker_1 in AD at the age of AD onset, of -0.0134672. This means that at AD onset (i.e., Years_to_or_after_AD = 0), AD cases have 0.0134672 lower estimates of the biomarker compared with controls. As the controls are the reference (i.e., intercept), at AD onset controls have 1.0409992 of the biomarker, thus AD cases have 1.0275320. However, the difference between AD cases and controls is not statistically significant (p-value = 0.31996).

The interpretation of the estimate for a continuous variable is a bit different. As the Granulocyte_proportion was z-transformed, the increase in = 1 standard deviation from the mean (= 0 after z-transformed) in the Granulocyte_proportion is significantly (p-value < 0.000001) associated with a increase of 0.0340426 in the outcome variable biomarker_1.

The interpretation of a LMM’s output can be tricky, and plotting the model’s prediction will aid your understanding.

library(ggplot2)
library(ggeffects)

mydf <- ggpredict(LMM_model, terms = c("Years_to_or_after_AD[all]", "AD"), type = "fixed")

 plot <- plot(mydf) 
(plot <- plot + labs(x = "Years to/after AD", y = "Biomarker 1")
        + theme_minimal()  
        + theme(legend.position = "none")
        + theme(legend.title = element_blank()) 
        + theme(strip.background = element_blank(), strip.text = element_blank())
        + theme(text = element_text(size =50)) 
        + theme(axis.line = element_line(color='black', size=1)))

Fig 3. Plot of the above LMM, in which biomarker 1 is the outcome variable. Red line depicts the prediction of biomarker 1 for Alzheimer’s disease cases (AD = 1), and blue the prediction for matched controls (AD = 0).

Longitudinal AD panel of DNA methylation sites

We used a similar longitudinal LMM approach to find DNA methylation (DNAm) sites (CpGs) that could differentiate AD cases from controls.

For that, CpGs were selected when having a significant (p-value < 0.001) difference in the model estimate of at least 0.05 in AD (≥|0.05|). As the CpGs’ β-values range from 0 to 1, an estimate = 0.05 represents a methylation difference between AD cases and controls of 5%.

We were also interested in CpGs with persistent difference over time, i.e. with absence of a cross-over interaction between AD status and time (AD * Years_to_or_after_AD).

library(lme4)

LMM_model_CpG <- lmer(CpG ~ 1 
                  + APOEe4_carrier
                  + Granulocyte_proportion
                  + AD * Years_to_or_after_AD
                  + (1 | subject_number) 
                  + (1 | matched_pair_number) 
                  , data = data)
summary(LMM_model_CpG)

At AD onset, when the time-scale Years_to_or_after_AD = 0, the estimate of the CpG β-values in AD cases (AD= 1) is - 0.1048967 (i.e., ~10% lower β-values). This means that compared with controls that have estimated CpG’s β-value of 0.5949299 (i.e., intercept), AD cases have estimated CpG’s β-value of 0.4900332. This difference found between AD cases and controls is statistically significant (p-value = 0.00000356).

It is easier to see this in a plot:

library(ggplot2)
library(ggeffects)

mydf <- ggpredict(LMM_model_CpG, terms = c("Years_to_or_after_AD[all]", "AD"), type = "fixed")
mydf

plot_cpg <- plot(mydf) 
(plot_cpg <- plot_cpg 
  + labs(x = "Years to/after AD", y = "CpG β-values") 
  + theme_minimal()
  + theme(legend.title = element_blank()) 
  + theme(text = element_text(size =50)) 
  + theme(axis.line = element_line(color='black', size=1)))

Fig 4. Plot of the above LMM, in which CpG β-values (ranging from 0 to 1) are the outcome variable. Red line depicts the prediction of the CpGs β-values for Alzheimer’s disease cases (AD = 1), and the blue line is the prediction for matched controls (AD = 0).

Longitudinal biomarker analysis in matched case-control pairs using R

Fernanda Schäfer Hackenhaar et al

3/7/2023

Introdution

Linear mixed-effects models (LMMs)

Longitudinal AD panel of DNA methylation sites