This is a demonstration of random effects in glmer() in R.
Set random seed, to ensure that we get the same results every time, and load any necessary packages
set.seed(123)
require(lattice) || install.packages("lattice")
## Loading required package: lattice
## [1] TRUE
require(lattice)
require(lme4) || install.packages("lme4")
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: Rcpp
## [1] TRUE
require(lme4)
Imaginary and slightly silly data:
Make a data frame with the fake data:
num_of_obs <- 1000 #how much total data I want
my_sd <- 50 #standard deviation for random noise
#make a data frame with speaker, each speaker's deviation from the baseline intercept, and a random sampling of times
f0_by_time <- data.frame(speaker=rep_len(x=c("Anna","Bianca","Carmen","Danielle","Evita"),length.out=num_of_obs), rand_intercept=rep_len(x=c(0,20,-40,40,-20),length.out=num_of_obs), hours_from_noon=runif(n=num_of_obs,min=-6,max=6))
#add f0 as a function of speaker's intercept and time (plus random noise)
f0_by_time$f0 <- (160 + f0_by_time$rand_intercept) + 10*f0_by_time$hours_from_noon + rnorm(n=num_of_obs,mean=0,sd=my_sd)
Plot it:
xyplot(f0_by_time$f0 ~ f0_by_time$hours_from_noon, groups=f0_by_time$speaker, auto.key=list(title="Speaker", corner=c(0.05, 1)), xlab="hours since noon", ylab="f0 (Hz)") #auto.key generates the legend
Here’s a model with no awareness of speaker (it thinks all the observations are independent):
f0.lm <- lm(f0 ~ hours_from_noon, data=f0_by_time)
summary(f0.lm)
##
## Call:
## lm(formula = f0 ~ hours_from_noon, data = f0_by_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -154.757 -36.769 -0.066 39.122 209.572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 160.5931 1.7868 89.88 <2e-16 ***
## hours_from_noon 9.8903 0.5182 19.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.5 on 998 degrees of freedom
## Multiple R-squared: 0.2674, Adjusted R-squared: 0.2667
## F-statistic: 364.3 on 1 and 998 DF, p-value: < 2.2e-16
Here’s a model that treats speaker as a fixed effect. The coefficient for each speaker is as compared to the baseline, Anna, whose intercept is the same as the overall intercept:
f0.lm.speaker_fixed <- lm(f0 ~ hours_from_noon + speaker, data=f0_by_time)
summary(f0.lm.speaker_fixed)
##
## Call:
## lm(formula = f0 ~ hours_from_noon + speaker, data = f0_by_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.900 -35.080 1.276 33.095 171.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.2360 3.5485 46.283 < 2e-16 ***
## hours_from_noon 9.7348 0.4608 21.124 < 2e-16 ***
## speakerBianca 10.7086 5.0149 2.135 0.033 *
## speakerCarmen -40.5623 5.0231 -8.075 1.94e-15 ***
## speakerDanielle 35.0047 5.0170 6.977 5.49e-12 ***
## speakerEvita -23.3912 5.0151 -4.664 3.52e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.08 on 994 degrees of freedom
## Multiple R-squared: 0.4267, Adjusted R-squared: 0.4239
## F-statistic: 148 on 5 and 994 DF, p-value: < 2.2e-16
And one that treats speaker as a random effect:
f0.lmer.speaker_random <- lmer(f0 ~ hours_from_noon + (1|speaker), data=f0_by_time)
summary(f0.lmer.speaker_random)
## Linear mixed model fit by REML ['lmerMod']
## Formula: f0 ~ hours_from_noon + (1 | speaker)
## Data: f0_by_time
##
## REML criterion at convergence: 10677.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9120 -0.6978 0.0209 0.6574 3.4421
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker (Intercept) 853.9 29.22
## Residual 2508.1 50.08
## Number of obs: 1000, groups: speaker, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 160.5881 13.1643 12.20
## hours_from_noon 9.7371 0.4608 21.13
##
## Correlation of Fixed Effects:
## (Intr)
## hors_frm_nn 0.001
ranef(f0.lmer.speaker_random)
## $speaker
## (Intercept)
## Anna 3.594084
## Bianca 14.148923
## Carmen -36.379264
## Danielle 38.093592
## Evita -19.457336
The numbers look pretty close.
Two more speakers are added, Francesca and Gaelle, but with just 20 observations each, and with extreme intercepts (-100 and 100.
#same procedure as before for making the fake data, just with fewer observations and two new speakers
num_of_obs_small <- 10
small <- data.frame(speaker=rep_len(x=c("Francesca","Gaelle"), length.out=num_of_obs_small), rand_intercept=rep_len(x=c(-100,100),length.out=num_of_obs_small), hours_from_noon=runif(n=num_of_obs_small,min=-6,max=6))
small$f0 <- (160 + small$rand_intercept) + 10*small$hours_from_noon + rnorm(n=num_of_obs_small,mean=0,sd=my_sd)
f0_by_time <- rbind(f0_by_time,small) #add the new data on to the old data frame
xyplot(f0_by_time$f0 ~ f0_by_time$hours_from_noon, groups=f0_by_time$speaker, auto.key=list(title="Speaker", corner=c(0.05, 1)), xlab="hours since noon", ylab="f0 (Hz)")
Let’s compare speaker coefficients as a fixed effect to speaker’s random intercepts.
Again, a model that treats speaker as a fixed effect.
f0.lm.speaker_fixed2 <- lm(f0 ~ hours_from_noon + speaker, data=f0_by_time)
summary(f0.lm.speaker_fixed2)
##
## Call:
## lm(formula = f0 ~ hours_from_noon + speaker, data = f0_by_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.116 -34.998 1.229 33.165 172.106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.258 3.553 46.233 < 2e-16 ***
## hours_from_noon 9.691 0.460 21.069 < 2e-16 ***
## speakerBianca 10.684 5.021 2.128 0.033592 *
## speakerCarmen -40.599 5.029 -8.073 1.96e-15 ***
## speakerDanielle 34.976 5.023 6.963 6.01e-12 ***
## speakerEvita -23.416 5.021 -4.664 3.53e-06 ***
## speakerFrancesca -87.068 22.706 -3.835 0.000134 ***
## speakerGaelle 112.711 22.722 4.960 8.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.14 on 1002 degrees of freedom
## Multiple R-squared: 0.4412, Adjusted R-squared: 0.4373
## F-statistic: 113 on 7 and 1002 DF, p-value: < 2.2e-16
And one that treats speaker as a random effect:
f0.lmer.speaker_random2 <- lmer(f0 ~ hours_from_noon + (1|speaker), data=f0_by_time)
summary(f0.lmer.speaker_random2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: f0 ~ hours_from_noon + (1 | speaker)
## Data: f0_by_time
##
## REML criterion at convergence: 10796.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9116 -0.6985 0.0243 0.6603 3.4320
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker (Intercept) 3255 57.05
## Residual 2515 50.15
## Number of obs: 1010, groups: speaker, 7
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 164.838 22.020 7.486
## hours_from_noon 9.709 0.460 21.108
##
## Correlation of Fixed Effects:
## (Intr)
## hors_frm_nn -0.005
ranef(f0.lmer.speaker_random2)
## $speaker
## (Intercept)
## Anna -0.5866296
## Bianca 10.0662803
## Carmen -41.0142812
## Danielle 34.2667318
## Evita -23.9025714
## Francesca -75.9094692
## Gaelle 97.0799392
These two new speakers’ fitted numbers are closer to zero when treated as random intercepts than when treated as fixed-effect coefficients. Let’s plot that:
xyplot(f0 ~ speaker, data=f0_by_time, panel=function(x,y) {
panel.xyplot(x,y)
panel.abline(h=mean(f0_by_time$f0), lty=2) #dotted line at the overall mean (should be about 160)
panel.segments(x0=c(1:7)-0.3,y0=f0.lmer.speaker_random2@beta[1]+ranef(f0.lmer.speaker_random2)$speaker[,1],x1=c(1:7)+0.3,y1=f0.lmer.speaker_random2@beta[1]+ranef(f0.lmer.speaker_random2)$speaker[,1], col="green", lwd=2) #a short green line at each speaker's random intercept (plus the overall intercept), given in terms of starting and ending x and y coordinates on the plot
panel.segments(x0=c(1:7)-0.3,y0=f0.lm.speaker_fixed2$coefficients[1]+c(0,f0.lm.speaker_fixed2$coefficients[3:9]),x1=c(1:7)+0.3,y1=f0.lm.speaker_fixed2$coefficients[1]+c(0,f0.lm.speaker_fixed2$coefficients[3:9]),col="red",lwd=2) # a short red line at each speaker's coefficient in the speaker-as-fixed-effect model (plus the overall intercept)
})
Red lines show intercept+coefficients in fixed-effects model. Green lines show intercept+random intercepts in mixed-effects model (not visible for first five speakers, because red line is on top of green line)
So how come the numbers are not exactly the same for Francesca and Gaelle? Why are they closer to the overall intercept?
One of the best tutorials I’ve seen on this is by USC’s Sandrah Eckel: (http://www-hsc.usc.edu/~eckel/talk_CDrandomInterceptFeb2010.pdf). I copied her format for showing the fixed-effects coefficients vs. random intercepts.