Article Text
Statistics from Altmetric.com
Background
Metabolic syndrome (MetS) has been shown to be a risk factor for many chronic diseases, but the components of MetS are still controversial. In recent literature, exploratory and confirmatory factor analyses have been used to test the latent structure amongst MetS components and regression modelling is used to test the relation between chronic diseases and MetS components. The MetS components such as BMI, blood pressure and lipids are in general correlated and clustered, and this poses a challenge for statistical modelling. Collinearity amongst these components can have serious implications in regression analysis if not identified and treated with care. Whilst some exploratory analysis, such as principle component analysis (PCA), can provide an effective insight into the structure of the data, the results are often difficult to interpret to the non-statistician and lack the descriptive detail to explain effectively the clustering of the variables.
Methods
The approach we propose draws on a number of ideas in combinatorial mathematics and cluster analysis to generate a group of dependent subsets. The group is transformed into a matroid to ensure that the subsets adhere to the basic axioms of linear dependence. This allows the structure to be displayed in clear hierarchical form and provides an immediate interpretation of the clustering of the MetS components. We consider data from a paper by Shen (2003) in the American Journal of Epidemiology.
Results
The matroid technique identifies similar groups of dependent components to the factor analysis approach in the paper. These included: (1) glucose, PC glucose, insulin, PC insulin; (2) BMI, waist/hip ratio; (3) systolic BP, diastolic BP and (4) HDL and triglycerides. However, the matroid method additionally illustrates the dependencies at multiple collinearity thresholds. It shows the strengths of these dependencies and others, along with their location in the overall structure of the components of MetS. This reveals a concise depiction of the dependencies that the original factor analysis could not provide.
Conclusions
Presenting linear dependencies as subsets rather than latent variables gives the practitioner a greater choice of which variables to remove for regression analysis. This allows for an informed decision to be made about the inclusion of variables using clinical knowledge, as well as statistical reasoning, to limit the effects of collinearity in regression.