Introduction

Clustering is the task of grouping set of data into subgroups, such that data points within a subgroup (a cluster) is similar to each other but is different from other subgroups. However, clustering algorithm such as K-means clustering does not necessarily consider the temporal aspect of the data.

Growth mixture modelling (GMM) and Latent Class Growth Analysis (LCGA) are two types of longitudional modelling techniques that identify homogenous subpopulations based on growth trajectories. In medical research, the tools can be applied to study the different developmental trajectories within a population.

A useful framework for beginning to understand latent class analysis and growth mixture modelling is the distinction between person-centred and variable-centred approaches. Variable-centred approaches such as regression […] focuses on describing the relationships among variables. The goal is to identify significant predictors of outcomes, and describe how dependent and independent variables are related. Person-centred approaches, on the other hand, include methods such as cluster analysis, latent class analysis, and finite mixture modelling. The focus is on the relationships among indiviudals, and the goal is to classify individuals into distinct groups or categories based on individual response patterns[1].

One example that I am very interested in was this paper by Bandoli et al. [2], where the authors have examine whether patterns of prenatal alcohol exposure differentially affect dysphormic features in infants.

Here, using longitudinal modelling, the authors found 5 distinct trajectories of development, which corresponded to high sustaiend, moderate/high, low/moderate sustained, low/moderate and minimal/no prenatal alcohol exposure. Dysmorphology score was then calculated and examined for association with trajectory of prenatal alcohol exposure.

Example

The following two tutorials explain quite clearly how to carry out LCGA and GMM in R [3,4]. You can also use the following tutorial to generate some simulated data here [5].

Suppose you have a data for 100 cases, each with 5 equally spaced repeated measures on a continuous outcome scale (total of 500 data points).

Suppose we are now interested in separating this underlying dataset into sub-populations, where individuals within the same group have very similar trajectories.

In R, both LCGA and GMM can be accomplished with the flemix package. Below is an example with LCGA,

lcga_fit <- stepFlexmix(. ~ .|ID,
k = 1:5,
nrep = 100,
model = FLXMRglmfix(y ~ time, varFix = T),
data = mydata,
control = list(iter.max = 500, minprior = 0))

This code will try out different configurations of the data, and allows us to choose the best fit model. Of interest to us are the following parameters:

k-> the number of latent subpopulations we want the model to test.
nrep -> number of random initialisation. Here, the model can find a local minima, but may not be the most optimal output.

In all model configurations, we have fitted y as the dependent variable of time. The error variance is the same in all data groups.

Example output:

iter	converged	k	Integrated Completed Likelihood
2	True	1	2311.291
5	True	2	1788.269
10	True	3	1784.632
34	True	4	1776.328
29	True	5	1776.853

This table indicates that the model can converge at all 5 different configurations. Using a model fit measure such as the integrated completed likelihood (or Akaike information criterion), we can decide which model we want (here, the lower the score the better). For example, here we can examine the models with two and four latent subpopulations, as the differences between the model fit measures in k=1 and k=2 is most dramatic, and there is not much difference between k=4 and k=5.

The package also allows us to check the posterior probability of the cluster assignments.

	prior	size	post>0	ratio
Comp.1	0.5	250	265	0.943
Comp.2	0.5	250	255	0.980

	prior	size	post>0	ratio
Comp.1	0.318	160	250	0.640
Comp.2	0.234	120	255	0.471
Comp.3	0.266	130	245	0.531
Comp.4	0.181	90	230	0.391

Focusing on the result on the left, the model with two latent subpopulations indicates that equal number of observations were assigned to cluster 1 and cluster 2. Furthermore, the high ratio in either components indicates there is a high confidence of membership. This is, however, not the case for the model on the right. Although the model fit score is lower in this configuration, the rootograms indicate that compared to the first model, this model cannot reliably differentiate between the four clusters.

This is more evident when we plot the cluster memberships for each observation as follows

Thus, this suggests that the data would be better fit with a 2 latent population configurations. Next, we can examine whether the cluster membership is associated with any other covariates of interest, such as age, gender or demographic information.

References

[1] Jung and Wickrama. An Introduction to Latent Class Growth analysis and Growth Mixture Modeling

[2] Bandoli et al, 2020. Patterns of Prenatal Alcohol Exposure and Alcohol-Related Dysmorphic Features

[3] https://www.youtube.com/watch?v=cqnpN1k1mPk&ab_channel=RegorzStatistik

[4] https://www.youtube.com/watch?v=sQfIeOh3rJQ&ab_channel=RegorzStatistik

[5] Wardenaar, K. (2020). Latent Class Growth Analysis and Growth Mixture Modeling using R: A tutorial for two R-packages and a comparison with Mplus

Latent Class Growth Analysis and Growth Mixture Modelling

Introduction

Example

References