Linear mixed-effects models (LMMs) that account for nested random effects can be used to analyze longitudinal changes in a biomarker between matched case-control pairs. For a detailed tutorial on how to account for random effects according to your study design, check https://ourcodingclub.github.io/tutorials/mixed-models/.
We used LMMs to study changes in epigenetic biomarkers over time in Alzheimer’s disease (AD) cases matched with healthy controls by sex and age.
Here we present a mock dataset with fake matched cases and controls:
First, we used Generalized additive mixed models (GAMMs) to depict
the longitudinal profile of each epigenetic biomarker, in order to check
for potential non-linear associations in the biomarker. Unadjusted GAMMs
were performed using the gamm4
function of the
gamm4
R package.
library(gamm4)
library(splines)
epi_var <- "biomarker_1"
# Unadjusted GAMM model
gamm_model <- gamm4(biomarker_1 ~ 1 + s(age_blood_sampling), data=data, random= ~(1 | subject_number))
# Plot the model
plot(gamm_model$gam,shade=TRUE,shift=gamm_model$gam$coefficients[1],ylab=epi_var,residuals=TRUE,pch=20, pages=1)
As we observed linear associations, our subsequent analyses used LMMs.
In our data, longitudinal changes in the biomarkers were assessed
employing linear mixed-effects models (LMMs). As we intended to access
differential longitudinal changes between AD cases and controls, all
LMMs include an interaction term between AD status as a binary indicator
variable and time AD * Years_to_or_after_AD
.
The models were adjusted for the covariates
APOEe4_carrier
, Ever_smoked
, and
Granulocyte_proportion
that are relevant in an epigenetics
and AD study. As the AD cases and controls are matched by age and sex,
these covariates were not included in the models.
To be able to interpret the LMM estimates of the covariates in the
output, you need to know the reference value of your variables. The
model will use 0 (zero) as the reference. This also means that the
model’s intercept represent the estimate when all variables = 0 (zero).
Thus, when we set the categorical variable AD
as controls =
0 and AD cases = 1 (see mock data), the LMM will estimate the change in
the biomaker in AD cases, when compared with the controls. Also, by
setting the age of AD onset = 0 in the Years_to_or_after_AD
time-scale (see Fig 1 and mock data), we were able to get the estimates
of the biomarker for AD at the age of AD onset.
For the same reason, when continuous variables are Z-transformed
using scale
the estimates are interpreted as the effect of
one standard deviation’s (SD=1) from the mean (mean=0) increase in the
estimate.
# Continuous covariates were z-transformed
data$Granulocyte_proportion <- scale(data$Granulocyte_proportion, scale=TRUE)
# Categorical covatiates were set as factors
data$APOEe4_carrier <- as.factor(data$APOEe4_carrier)
data$Ever_smoked <- as.factor(data$Ever_smoked)
data$AD <- as.factor(data$AD)
data$subject_number <- as.factor(data$subject_number)
data$matched_pair_number <- as.factor(data$matched_pair_number)
The longitudinal measures from the same subject
(1 | subject_number)
and the sex- and age-matched pairs
(1 | matched_pair_number)
were modeled as nested random
effects to account for variability within these blocking variables. In
particular, subjects were nested within matched pairs such that each
matched pair of subjects are unique to that pair (see mock data).
LMMs were performed using the lmer
function of the
lme4
R package.
library(lme4)
LMM_model <- lmer(biomarker_1 ~ 1
+ APOEe4_carrier
+ Ever_smoked
+ Granulocyte_proportion
+ AD * Years_to_or_after_AD
+ (1 | subject_number)
+ (1 | matched_pair_number)
, data = data)
summary(LMM_model)
Now we can estimate the outcome variable biomarker_1
in
AD at the age of AD onset, of -0.0134672. This means that at AD onset
(i.e., Years_to_or_after_AD
= 0), AD cases have 0.0134672
lower estimates of the biomarker compared with controls. As the controls
are the reference (i.e., intercept), at AD onset controls have 1.0409992
of the biomarker, thus AD cases have 1.0275320. However, the difference
between AD cases and controls is not statistically significant (p-value
= 0.31996).
The interpretation of the estimate for a continuous variable is a bit
different. As the Granulocyte_proportion
was z-transformed,
the increase in = 1 standard deviation from the mean (= 0 after
z-transformed) in the Granulocyte_proportion
is
significantly (p-value < 0.000001) associated with a increase of
0.0340426 in the outcome variable biomarker_1
.
The interpretation of a LMM’s output can be tricky, and plotting the model’s prediction will aid your understanding.
library(ggplot2)
library(ggeffects)
mydf <- ggpredict(LMM_model, terms = c("Years_to_or_after_AD[all]", "AD"), type = "fixed")
plot <- plot(mydf)
(plot <- plot + labs(x = "Years to/after AD", y = "Biomarker 1")
+ theme_minimal()
+ theme(legend.position = "none")
+ theme(legend.title = element_blank())
+ theme(strip.background = element_blank(), strip.text = element_blank())
+ theme(text = element_text(size =50))
+ theme(axis.line = element_line(color='black', size=1)))
We used a similar longitudinal LMM approach to find DNA methylation (DNAm) sites (CpGs) that could differentiate AD cases from controls.
For that, CpGs were selected when having a significant (p-value <
0.001) difference in the model estimate of at least 0.05 in
AD
(≥|0.05|). As the CpGs’ β-values range from 0 to 1, an
estimate = 0.05 represents a methylation difference between AD cases and
controls of 5%.
We were also interested in CpGs with persistent difference over time,
i.e. with absence of a cross-over interaction between AD status and time
(AD * Years_to_or_after_AD
).
library(lme4)
LMM_model_CpG <- lmer(CpG ~ 1
+ APOEe4_carrier
+ Granulocyte_proportion
+ AD * Years_to_or_after_AD
+ (1 | subject_number)
+ (1 | matched_pair_number)
, data = data)
summary(LMM_model_CpG)
At AD onset, when the time-scale Years_to_or_after_AD
=
0, the estimate of the CpG β-values in AD cases (AD
= 1) is
- 0.1048967 (i.e., ~10% lower β-values). This means that compared with
controls that have estimated CpG’s β-value of 0.5949299 (i.e.,
intercept), AD cases have estimated CpG’s β-value of 0.4900332. This
difference found between AD cases and controls is statistically
significant (p-value = 0.00000356).
It is easier to see this in a plot:
library(ggplot2)
library(ggeffects)
mydf <- ggpredict(LMM_model_CpG, terms = c("Years_to_or_after_AD[all]", "AD"), type = "fixed")
mydf
plot_cpg <- plot(mydf)
(plot_cpg <- plot_cpg
+ labs(x = "Years to/after AD", y = "CpG β-values")
+ theme_minimal()
+ theme(legend.title = element_blank())
+ theme(text = element_text(size =50))
+ theme(axis.line = element_line(color='black', size=1)))