A few days ago I started working on hypotheses tests and confidence intervals for my project mixed_models, and I got pretty surprised by certain things.

Methods

There does not seem to be an agreement on a method to compute p-values (or whether to compute them at all) and confidence intervals for (generalized) linear mixed models in the scientific community. See for example the multitude of discussions on Cross Validated ((1), (2), (3), (4) among others), or the longish statement on the topic by the creator of lme4 and nlme Douglas Bates.

There are many ways to perform hypothesis tests and to compute confidence intervals for the fixed effects coefficients of a linear mixed model. For a list see for example this entry from the wiki of the r-sig-mixed-models mailing list. Unfortunately, the more accurate and universally applicable among the methods are computationally expensive and difficult to implement within Ruby’s current infrastructure of gems.

The method that is most convenient to compute is the Wald Z-test. However, its validity is often questionable. The wiki of the r-sig-mixed-models mailing list names the following reasons:

[Wald Z-statistics] are asymptotic approximations, assuming both that (1) the sampling distributions of the parameters are multivariate normal (or equivalently that the log-likelihood surface is quadratic) and that (2) the sampling distribution of the log-likelihood is (proportional to) \(\chi^2\). The second approximation is discussed further under “Degrees of freedom”. The first assumption usually requires an even greater leap of faith, and is known to cause problems in some contexts (for binomial models failures of this assumption are called the Hauck-Donner effect), especially with extreme-valued parameters.

Nevertheless, for now I decided to implement the Wald method only. It is still useful as a computationally light method for the initial data analysis, before falling back on the heavy weaponry. The LMM class provides methods to access all parameter estimates and information required in order to implement other methods to compute p-values or confidence intervals by the user, applicable to her specific situation.

For future extensibility I have included an argument :method in all of the methods.

Implementation and usage

Example data

For purposes of illustration, I use the same data as in my previous blog post. The simulated data set contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 individuals.

> alien_species.head
=> 
#<Daru::DataFrame:70197332524760 @name = 1cd9d732-526b-49ae-8cb1-35cd69541c87 @size = 10>
                  Age Aggression   Location    Species 
         0     204.95 877.542420     Asylum      Dalek 
         1      39.88 852.528392  OodSphere WeepingAng 
         2     107.34 388.791416     Asylum      Human 
         3     210.01 170.010124  OodSphere        Ood 
         4     270.22 1078.31219  OodSphere      Dalek 
         5     157.65 164.924992  OodSphere        Ood 
         6     136.15 865.838374  OodSphere WeepingAng 
         7     241.31 1052.36035      Earth      Dalek 
         8      86.84 -8.5725199     Asylum        Ood 
         9      206.7 1070.71900  OodSphere      Dalek 

We model the Aggression level of an individual as a linear function of the Age (Aggression decreases with Age), with a different constant added for each Species (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in Aggression due to the Location that an individual is at. Additionally, there is a random fluctuation in how Age affects Aggression at each different Location.

We fit this model in Ruby using mixed_models with:

require 'mixed_models'
alien_species = Daru::DataFrame.from_csv './data/alien_species.csv'
model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)", 
                             data: alien_species)

Test statistics

The Wald Z test statistics for the fixed effects coefficients can be computed with:

model_fit.fix_ef_z

# => {:intercept=>16.882603431075875, :Age=>-0.7266646548258817, 
:Species_lvl_Human=>-1862.7747813759402, :Species_lvl_Ood=>-3196.2289922406044, 
:Species_lvl_WeepingAngel=>-723.7158917283754}

We see that the variable Species seems to have a huge influence on Aggression, while Age not so much.

P-values

Based on the above test statistics, we can carry out hypotheses tests for each fixed effects term \(\beta\subscript{i}\), testing the null

\[H\subscript{0} : \beta\subscript{i} = 0\]

against the alternative

\[H\subscript{a} : \beta\subscript{i} \neq 0.\]

The corresponding (approximate) p-values are obtained with:

model_fit.fix_ef_p(method: :wald)

# => {:intercept=>0.0, :Age=>0.4674314106158888, 
:Species_lvl_Human=>0.0, :Species_lvl_Ood=>0.0, 
:Species_lvl_WeepingAngel=>0.0}

We see that indeed the aggression level of each species is highly significantly different from the base level (which is the species Dalek in this model), while statistically we don’t have enough evidence to conclude that the age of an individual is a good predictor of his/her/its aggression level.

I have specified method: :wald above for illustration purposes only, because the Wald method is currently the default and the only available method. In the future I might implement other methods which are more reliable and more computationally difficult at the same time.

Confidence intervals

We can use the Wald method for confidence intervals as well. For example 90% confidence intervals for each fixed effects coefficient estimate can be computed as follows.

model_fit.fix_ef_conf_int(level: 0.9, method: :wald)

# => {:intercept=>[917.2710135369496, 1115.302428002405],
 :Age=>[-0.2131635992213468, 0.08253129235199347],
 :Species_lvl_Human=>[-500.13493113101106, -499.25245944940696],
 :Species_lvl_Ood=>[-900.0322606117458, -899.1063820954081],
 :Species_lvl_WeepingAngel=>[-200.04258166587766, -199.13533441813757]}

As for the p-values, the Wald method is currently the only option and the default.