Article Text

## Abstract

**Objective:** The use of multilevel modelling with data from population-based surveys is often limited by the small number of cases per level-2 unit, prompting many researchers to use single-level techniques such as ordinary least squares regression.

**Design:** Monte Carlo simulations are used to investigate the effects of data sparseness on the validity of parameter estimates in two-level versus single-level models.

**Setting:** Both linear and non-linear hierarchical models are simulated in order to examine potential differences in the effects of small group size across continuous and discrete outcomes. Results are then compared with those obtained using disaggregated techniques (ordinary least squares and logistic regression).

**Main results:** At the extremes of data sparseness (two observations per group), the group level variance components are overestimated in the two-level models. But with an average of only five observations per group, valid and reliable estimates of all parameters can be obtained when using a two-level model with either a continuous or a discrete outcome. In contrast, researchers run the risk of Type I error (standard errors biased downwards) when using single-level models even when there are as few as two observations per group on average. Bias is magnified when modelling discrete outcomes.

**Conclusions:** Multilevel models can be reliably estimated with an average of only five observations per group. Disaggregated techniques carry an increased risk of Type I error, even in situations where there is only limited clustering in the data.

## Statistics from Altmetric.com

Population health outcomes are shaped by complex interactions between individuals and the diverse social and environmental contexts (eg neighbourhoods, schools and regions) in which they are situated over the life course.1^{–}5 The recent increase in the use of multilevel models to examine associations between group level characteristics and a wide range of individual health indicators6^{–}9 attest to their value as a statistical method for analysing grouped or clustered data. But the use of multilevel modelling with data from population-based surveys is often limited by *data sparseness*: a small number of cases per level-2 unit.

Although large-scale surveys make it relatively easy to achieve a large number of groups, there are often very few individuals per group. For example, the National Longitudinal Study of Adolescent Health (Add Health) is a school-based survey of health behaviours in American adolescents.10 However, because schools (not neighbourhoods) were the sampling frame, there is considerable sparseness in the data for those interested in examining “neighbourhood” effects on health. In the first wave of the survey, 16 683 students are nested in 2276 census tracts (a typical geographic area used to approximate neighbourhoods), yielding an average of 7.33 subjects per tract. Owing to subject attrition and residential mobility, sparseness only increases over surveys with longitudinal follow-up.

When there is a high level of clustering within groups (ie large group size), it is well known that disaggregation of the data by using ordinary least squares (OLS) regression analysis leads to an elevated risk of Type I error (concluding that there are significant effects when in fact these effects may have occurred by chance).11 By pretending that the observations are independent, the standard errors of the regression coefficients are biased downwards, generating artificially narrow CIs. Multilevel models appropriately partition within-group and between-group effects so that a high level of clustering within groups is statistically accounted for. But little is known about the *lower* threshold at which data sparseness renders multilevel models unreliable or even unnecessary. There are various rules of thumb stated in the literature, often in the range of 15 to 30 per group.12^{–}14 But the wide range of numbers given in these recommendations reflects the fact that very little research has been done to explicitly test the minimum level of clustering necessary for valid and reliable estimates in multilevel models.

Simulations designed to assess the data sparseness problem are beginning to appear in the literature, and results suggest that the number of groups is more important for unbiased and efficient estimates than the number of observations per group.11 15^{–}18 This is reassuring since a shortage of groups is rarely a problem in population-based data. Yet researchers continue to be concerned with small group sizes when examining contextual effects on health, and have adopted various strategies to deal with data sparseness. A common strategy is to ignore the hierarchical structure of the data altogether by using OLS regression techniques.19^{–}21 A related strategy involves the use of generalised estimating equations (GEEs),7 22 which essentially bypasses the sparseness problem by treating the group level variance as a “nuisance factor” that is adjusted for in the analysis but not explicitly investigated.

However, without quantifying the random components at level 2, both GEE and OLS approaches essentially decontextualize the importance of the fixed effect parameters.9 14 23 24 It is well recognised that it is possible to find large significant fixed effects (at level 2) in conjunction with trivial between-group variation.9 23 25 By ignoring the group level random effects the pitfalls of using OLS or GEEs are considerable, particularly from a health policy standpoint. For example, a study may show a significant mortality risk for those living in socioeconomically disadvantaged neighbourhoods. But an intervention targeted towards disadvantaged neighbourhoods may be misplaced if trivial between-group variance in mortality indicates that they are not dissimilar from other apparently advantaged neighbourhoods.

The neighbourhood literature has also witnessed a recent trend where cluster analysis techniques are used to reduce data sparseness.8 26^{–}29 Respondents are grouped together into larger “synthetic” neighbourhoods, yielding a larger number of cases per level-2 unit. Yet, the effects of such clustering techniques on the accuracy of model parameters have not been widely considered. In recent work30 we showed that clustering strategies introduce artificial within-group heterogeneity by grouping together individuals who may differ on a host of unobserved characteristics. As a consequence, the balance of within- to between-group variance is upset, and clustering has the ironic consequence of reducing the variance between neighbourhoods.29 30

Although concerns about data sparseness abound, researchers appear to be unconcerned with the consequences of using alternative statistical techniques that ignore the hierarchical structure of the data, disregard the variance components or artificially manipulate the contextual grouping in the data. In earlier work30 we examined the effect of various group sizes (gs = 2, 5, 10, 20), number of groups (ng = 50, 100, 200) and strength of group-level dependency (intraclass correlation coefficient (ICC) = 0.1, 0.2, 0.3) on bias and efficiency in two-level linear models. We found no evidence of bias in the estimates of the fixed effects across conditions, but sampling variability increased as group size decreased. We did, however, find evidence of upwards bias in the group level variance components in conditions with small group size (gs⩽2) and a small number of groups (ng = 50). This bias disappeared when the number of groups reached 200, even with only two observations per group.

The purpose of this article is to expand on this work by using Monte Carlo simulations to compare the validity and efficiency of estimates from two-level and single-level linear and non-linear models when there is minimal clustering in groups. The results show that more sparseness can be tolerated in multilevel models than is generally assumed. Disaggregated techniques carry an increased risk of Type I error, even in situations when there is only limited clustering in the data.

## THE HIERARCHICAL GENERALISED LINEAR MODEL

In general, the two-level model can be conceptualised as a hierarchical system of regression equations within *J* contextual groups, with *N _{j}* individuals in each level-2 group.31 At the individual level (level 1) there are separate regression equations for each group. For linear models the identity link function regresses the dependent variable

*Y*on a linear predictor set of one (or more) independent variables

_{ij}*X*, with normally distributed residuals (

_{ij}*e*) having a mean of 0 and variance

_{ij}*σ*

^{2}:

Y_{ij} = β_{oj}+β_{1j}X_{ij}+e_{ij}.(1.1)

For non-linear models, various link functions linearise an underlying non-linear predictor component. For the case of a binary outcome with a binomial error distribution, the logit link function is used to regress the log odds of the response probability, or proportion *π _{ij}*, on a linear predictor set of independent variables:

For both the linear and non-linear models these level-1 coefficients can then be modelled by explanatory variables at the contextual level-2 (eg neighbourhood poverty):

β_{0j} = γ_{00}+γ_{01}Z_{j}+u_{0j}(2)

β_{1j} = γ_{10}+γ_{11}Z_{j}+u_{1j}(3)

By substituting equations (2) and (3) into equations (1.1) and (1.2) and rearranging terms, we get the full two-level linear model:

Y_{ij} = γ_{00}+γ_{10}X_{ij}+γ_{01}Z_{j}+γ_{11}Z_{j}X_{ij}+u_{0j}+u_{1j}X_{ij}+e* _{ij}*(4.1)

and the full two-level logistic model:

Log it(π_{ij}) = γ_{00}+γ_{10}X_{ij}+γ_{01}Z_{j}+γ_{11}Z_{j}X* _{ij}*+u

_{0j}+u

_{1j}X

*.(4.2)*

_{ij}In both models, *u*_{0j} represents group level variability around the intercept, which is assumed to be normally distributed with a mean of 0 and variance *τ*_{00}, and *u*_{1j} represents group level variability around the regression slope, which is assumed to be normally distributed with mean of 0 and variance *τ*_{11}. The covariance between the group level variance terms *u*_{0j} and *u*_{1j} is *τ*_{01}, which is generally assumed to be greater than or less than zero. All residual errors at the group level are assumed to be independent from the individual level within-group residuals (*e _{ij}*). For the binomial error distribution (equation 4.2), the level-1 error variance is a function of the population proportion (σ

^{2}= (π

*/(1–π*

_{ij}*))) and is not estimated separately (by using a scale factor of 1).32*

_{ij}If there are no explanatory variables at levels 1 or 2, equations (4.1) and (4.2) reduce to:

*Y _{ij}* =

*γ*

_{00}+

*u*

_{0j}+

*e*(5.1)

_{ij}Log *it(π _{ij})* =

*γ*

_{00}+

*u*

_{0j},(5.2)

which are the fully unconditional, or one-way ANOVA, models for the linear and logistic case, respectively. Partitioning the variance components yields a useful statistic, the intraclass correlation coefficient (ICC), which measures the proportion of variance in the outcome that is accounted for by the group level.31 For the linear model the ICC is defined as:

Since the binomial distribution for the logistic link function with a scale factor of 1 implies a level-1 variance of *π**2**/3* 32, the ICC for the non-linear model is defined as:

For full models (with covariates), a conditional ICC can be calculated based on an adjusted value of *τ*_{00}, representing the degree of dependence among observations within groups at a given value on the covariates.31

## METHODS

### Simulation procedure

A Monte Carlo simulation33 is conducted with a two-level model. For parsimony, the number of groups is held constant at 200 groups, and group size is varied at 2, 5, 10 and 20 observations per group. The number of groups is chosen to represent the larger number of groups typically found in population-based survey data and the group sizes capture the extremes of data sparseness as well as the larger group sizes typically tested in simulations.16 18 Following the simulation procedures used in the existing literature,16 1000 simulated data sets are generated for each of the four conditions, for the linear and nonlinear models.

The parameters for the simulation were set according to the full two-level models (equations 4.1 and 4.2). The intercept (*γ*_{00}) is set to 1.00, and the fixed effect coefficients (*γ*_{10}, *γ*_{01}, *γ*_{11}) to 0.3, representing a medium effect size.34 A set of *X* and *Z* values are randomly generated from a multivariate normal distribution. Following Maas and Hox16 the residual variance at level 1 (*σ*^{2}) is fixed to 0.5 in the linear model. (The level-1 error variance in the non-linear model is not estimated separately when using a scale factor of 1.)

The population values of the level-2 variance components are derived according to the formula for a conditional ICC value of 0.1 (capturing the lower values of group level clustering typically seen with population-based survey data adjusting for covariates35 36). Thus, the population value of the level-2 intercept variance (*τ*_{00}) is set to 0.056 for the linear model and 0.366 for the non-linear model, based on equations 6.1 and 6.2, respectively. For simplicity, *τ*_{00} and *τ*_{11} are constrained to be equal (following Maas and Hox16). The level-2 covariance (*τ*_{01}) is initially set to zero in the simulations, and then a negative covariance (*τ*_{01} = −0.02) is introduced to examine the effect on the parameter estimates. The value of the covariance was chosen to capture the degree of covariance between random intercept and slope parameters often found in contextual analyses with large-scale survey data.31 (No difference was found in the results between a negative and a positive covariance; so I report the results with the negative covariance only.)

Based on these parameters, values of *Y _{ij}* are generated for each of the simulated data sets (

*Y*is continuous for the linear model, whereas for the non-linear model

*Y*can take values of 0 or 1), and the effects of the four different conditions (group size = 2, 5, 10, 20) on the estimated parameter values are examined for the linear and non-linear models, and also in the face of a negative covariance. Reproducible streams of random numbers were generated in the simulations in order to maintain comparability across models. The effects of unbalanced data are also investigated by randomly sampling 60% of the observations from the simulated data to create inconsistent group sizes. For the unbalanced data the average group sizes are 1.4 (range 1–2 per group), 2.9 (range 1–5 per group), 5.9 (range 2–9 per group) and 12.0 (range 6–18 per group).

All simulations are conducted using Mplus Version 4.21.37 Models are estimated using the pseudo-maximum likelihood estimation for general multilevel modelling and the standard errors are computed using a robust sandwich estimator. Non-linear models are estimated using a numerical integration algorithm. Standard OLS regression was used to estimate disaggregated linear models, and logistic regression was used to estimate disaggregated binary models.

### Statistical analysis

The effects of small group size are examined in terms of bias for all parameter estimates and their standard errors. Bias is assessed by examining whether the mean of the sampling distribution of estimates under each condition centres on the true value. If *θ* ∧ is the sample estimate of the population parameter *θ*, then bias = (E(*θ* ∧)–*θ )/θ.* Bias exceeding 10% for any parameter is generally considered to be meaningful.38 The precision or efficiency of the estimates is ascertained by examining the sampling distribution of standard errors for each parameter under each simulated condition. The standard deviation of the parameter estimates in the simulations is an indicator of the population standard error when the number of replications is large.38 This is compared with the average of the estimated standard errors for each parameter estimate in the simulations, and large standard errors (wide CIs) indicate decreased efficiency.

## RESULTS

All two-level models for each of the 1000 simulated data sets under each of the four conditions converged successfully when there were at least five observations per group and balanced data. However, model performance was compromised in the face of very small group size and unequal group size (unbalanced data). For the linear two-level model with group sizes of two (balanced data) only 86.7% of the models successfully converged. With unbalanced data and very small group sizes (N⩽2) only 69.3% converged. Model performance was not problematic with the non-linear two-level models in the face of sparse data.

Table 1 presents the fixed effects and associated standard errors for the linear model estimated both with maximum likelihood (linear two-level model), as well as with OLS in a disaggregated single-level regression. Results are presented for both balanced and unbalanced data. The true parameter values are indicated in parentheses in table 1. For each of the four simulation conditions, the fixed effects and the standard errors for the two-level linear model are estimated without bias. Even at the extremes of data sparseness (group size ⩽2), bias in the fixed effect parameter estimates is trivial (less than 1%). Although the fixed-effect coefficients are estimated with decreased precision in the face of unbalanced data (wider confidence intervals), there is no evidence of bias in the standard errors.

Parameter estimates from OLS models are also unbiased (table 1). However, the standard errors of these parameter estimates are consistently underestimated with OLS. Even when clustering is marginal (⩽2 observations per group), standard errors of the regression parameters are underestimated by 10–15%. As group size increases, bias only gets worse. With five observations per group, OLS standard errors are biased downwards by 15% on average, and standard errors are underestimated by as much as 40% with 20 observations per group.

Table 2 presents the fixed effects and standard errors for the binary outcome estimated with a two-level non-linear model and a single-level logistic regression. For the non-linear two-level model the fixed effects and standard errors are generally estimated without bias across all conditions. The only exception is in the case of unbalanced data with a very small group size (⩽2) where the fixed effect coefficients are biased up by as much as 16%. When disaggregating the model to a single level logistic regression, downward bias in the standard errors is present across all simulation conditions. Even with only very limited clustering in the data (⩽2 observations per group), the standard errors of the logistic regression coefficients are biased downwards by 25% on average. There is also evidence that the fixed effect for the group level parameter (*γ*_{01}) is consistently underestimated by about 20% across all group sizes when using single level logistic regression. The level-1 parameter estimate (*γ*_{10}) is consistently overestimated in the presence of unbalanced data.

Table 3 presents the variance components and standard errors for both the linear and logistic two-level models across the four simulation conditions. With very small group size (group size ⩽2), the group level variance components are over estimated by over 30% in the non-linear two-level model, with upwards bias most pronounced in the face of unbalanced data. There is no evidence of bias in the two-level linear model when the group sizes are equal, but with unbalanced data and very small group size (⩽2) the group level variance components are also overestimated by as much as 35%. There is also decreased precision (larger standard errors) in the estimate of both linear and logistic random effect coefficients with marginal group sizes. When group size equals two with balanced data the standard error for the group level intercept variance is biased upwards by over 100% in the linear model, and the standard error for the group level slope variance is overestimated by 32% in the non-linear two-level model. As a result, the power to detect significant between group variance in these effects falls below 0.3. Upwards bias in the standard errors only increases with unbalanced data and very small group size. But when average group size increases to five or more, there is no evidence of bias in the random effects or their standard errors for either the linear or logistic two-level models.

When introducing a negative covariance between the random intercept and slope parameters in the two-level models (with balanced data) there was no evidence of bias in the fixed effects or their standard errors (not shown). However, the group level variance components are overestimated for both the continuous and binary outcomes at very small group sizes (table 4). This upwards bias is corrected with group sizes of five or more. There was also decreased precision (larger standard errors) when estimating random coefficients in the presence of a covariance, particularly for the non-linear two-level model. Although again, efficiency increases once there are at least five observations per group.

## DISCUSSION

Using Monte Carlo simulations, this paper aimed to empirically define the limits of data sparseness in two-level models so that informed analytic decisions can be made when working with sparsely clustered data. Consistent with existing research in this area,15^{–}18 two-level models generate unbiased estimates of the fixed effects and their standard errors, even at the extremes of data sparseness. This holds for both continuous and discrete outcomes. The only exception is for the non-linear model with unbalanced data and very small group size (<2) where the fixed effect coefficients are biased upwards.

Although earlier work30 with two-level linear models found evidence of upwards bias in the group level variance components estimated with sparse data and a small number of groups (N = 50 groups), we found no evidence of bias with a large number of groups (N = 200 groups) and balanced data. However, the group level random effects are overestimated when using two-level models with unbalanced data at the extremes of data sparseness, and the upwards bias is exacerbated when modelling binary outcomes. Moreover, the standard errors of these random effects are overestimated with small group sizes, for both continuous and discrete outcomes. As a result, the decreased precision in these estimates reduces the power to detect significant between group variance when group sizes are small. When the level of clustering increases to at least five observations per group, there was no evidence of bias in the two-level models for either the fixed or random effects or their standard errors.

If one were to ignore the clustering in the data and conduct OLS or logistic regression analysis, unbiased estimates of the fixed effects would generally be obtained. (The exception is for non-linear models where the contextual effects are consistently underestimated by about 20% for all group sizes.) However, the standard errors of all fixed effect coefficients are biased downwards when estimated with OLS or logistic regression. This is the case even when the level of clustering is marginal (group size = 2). As a result, researchers using disaggregated models with clustered data increase the risk of Type I statistical error. This underestimation is more problematic for non-linear models than for linear models.

These results highlight potential problems that are likely to occur when working with sparsely clustered data. If one were to disregard the importance of the random effects, the disaggregated approach results in artificially narrow CIs and raises the risk of Type I statistical error. On the other hand, a two-level analysis with sparsely clustered data generates excess sampling variability, decreasing the power to detect between-group variance. However, the multilevel approach still generates valid fixed parameter estimates and standard errors with two or more observations per group. But perhaps most importantly these simulation results have demonstrated that more sparseness can be tolerated than is generally assumed. Even with an average of only five observations per group, valid and reliable estimates of all parameter estimates can be obtained when using a two-level model with either a continuous or a discrete outcome. In contrast, researchers run the risk of Type I error when using disaggregated techniques even when there are as few as two observations per group on average.

In summary, alternative analytic strategies that ignore the level of clustering in the data are misdirected and not empirically based. Too many analytic decisions are incorrectly based on the assumption that multilevel models are only appropriate for densely clustered data.19 21 The results of these simulations have proven otherwise. Multilevel models are a robust analytic tool, generating valid and reliable estimates of the fixed effect parameters, even with very limited clustering in the data. Researchers should be cognizant of the pros and cons of using disaggregated techniques when working with clustered data.

#### What is already known on this subject

The use of multilevel modelling with data from population-based surveys is often limited by data sparseness: a small number of cases per level-2 unit.

Very little research has been done to test the general threshold at which data sparseness becomes problematic for unbiased and efficient estimates in multilevel models.

#### What this study adds

Results show that with an average of only five observations per group, valid and reliable estimates of all parameters can be obtained when using two-level models with either a continuous or a discrete outcome.

Researchers run the risk of Type I error (standard errors biased downwards) when using single level models even when there are as few as 2 observations per group on average.

## Acknowledgments

This research was supported by the Canadian Institutes of Health Research Strategic Initiative: Population and Public Health Research Methods and Tools. Additional support for the analyses came from the University of Michigan Center for Social Inequality Mind and Body (NIH/NICHHD grant R24HD04786, Life Course Core). I thank Barb Strane for administrative support on this project.

## REFERENCES

## Footnotes

**Competing interests:**None.

## Linked Articles

- In this issue