Introduction
Over the past 10 years latent class (LC) modeling has rapidly grown in use across a wide range of disciplines. As more and more applications are discovered, it is no longer known only as a method of clustering individuals based on categorical variables, but rather as a general modeling tool for accounting for heterogeneity in data. Vermunt and Magidson (2004) defined it more generally as virtually any statistical model where “some of the parameters … differ across unobserved subgroups” (Vermunt, J.K. and Magidson, J., 2004. Latent class analysis. In: M.S. Lewis-Beck, A. Bryman, and T.F. Liao (eds.), The Sage Encyclopedia of Social Sciences Research Methods, 549-553. Thousand Oaks, CA: Sage Publications).
Today’s fast computers together with the efficient algorithms used in Latent GOLD® make it possible to estimate LC models with many cases, many observed responses (indicators), and many explanatory variables. Extensions and variants of the basic model have been developed to include:
- response variables of mixed scale types, such as nominal, ordinal, (censored/truncated) continuous, and (truncated) counts
- several ordered categorical latent variables called discrete factors ( DFactors)
- discrete and continuous covariates predicting class membership
- predictors of a repeatedly observed response variable
- provisions to relax the local independence assumption
- tools for dealing with sparse tables (bootstrap p values), boundary solutions (Bayes constants), local maxima (multiple start sets), and other problems.
- LC regression, LC growth, LC latent Markov (latent transition) models and many further extensions of the basic LC model.
What are latent classes or latent segments?
Latent classes are unobservable (latent) subgroups or segments. Cases within the same latent class are homogeneous on certain criteria, while cases in different latent classes are dissimilar from each other in certain important ways. Formally, latent classes are represented by K distinct categories of a nominal latent variable X.
How do latent class models differ from other latent variable models?
Since the latent variable is categorical, LC modeling differs from more traditional latent variable approaches such as factor analysis, structural equation models, and random-effects regression models that are based on continuous latent variables.
Why is latent class modeling important?
Latent class (LC) modeling, also known as Finite Mixture Modeling, provides a powerful way of identifying latent segments (types) for which parameters in a specified model differ. Latent GOLD® , the most windows-friendly program for latent class modeling, focuses on the three most important kinds of statistical models used in practice – cluster, factor and regression.
How does LC analysis, as implemented by Latent GOLD®, compare with traditional procedures for cluster analysis?
Traditional clustering procedures (K-Means, hierarchical clustering) are not model-based and therefore quite limited. LC clustering consistently recovers true structural groups where the traditional algorithms fail. See Bacher (2004).
1. Adequate Assumptions
K-Means makes assumptions such as local independence or equal within class variance that often conflict with the real world. Latent GOLD® can be used to test these and relax them if they are found to be invalid. This typically yields easier to interpret and simpler (=fewer segments) segmentation in practice. See Magidson and Vermunt (2002a and 2002b).
2. Different Scale Types
Latent GOLD® allows for variables to be nominal, ordinal, continuous, count or any mixture of these, any of which may contain missing values. Different scale types are handled by automatically specifying the appropriate distribution. Moreover, additional scale types such as ranks, partial ranks, and discrete choice data can be analyzed using the Choice add-on.
3. Covariate-Based Profiling
After doing a traditional clustering, discriminant analysis or cross-tabs are often used to describe the resulting clusters, an approach confounded by misclassification and other errors. Latent GOLD® allows the inclusion of covariates for simultaneous parameter estimation (based on indicators) and descriptive profiling based on covariates.
Covariate based prediction/classification is now available so that new cases for which indicators are not present may be classified based solely on the covariates. Covariates can be continuous as well as categorical. In addition, SI-CHAID® links directly to Latent GOLD® for improved profiling capabilities. See Magidson and Vermunt (2005).
4. Optimal determination of number of clusters
In traditional clustering procedures, rules of thumb and ad-hoc guess-work are used to determine the number of clusters. Since LC is based on a statistical model, statistics are available to help determine the number of clusters. Latent GOLD® includes formal statistical assessment of the improvement resulting from an additional latent class.
How does LC analysis, as implemented by Latent GOLD®, compare with traditional procedures for regression analysis?
Traditional regression assumes homogeneity across an entire population, which does not allow for the existence of different segments. LC or mixture regression involves estimating a regression model under the assumption that the regression coefficients differ across unobserved (latent) segments, yielding improved predictions.
1. Accounts for Heterogeneity
Traditional regression programs assume that the model holds true for the entire population. Latent GOLD® explores whether model heterogeneity can be explained by unobserved latent segments. Latent GOLD Advanced/Syntax also allows for continuous heterogeneity (CFactors). See Popper, Kroll, and Magidson (2014).
2. Allows differing dependent variable scale types
Latent GOLD® ‘s mixture regression module is in the General Linear Models (GLM) framework. It allows for dependent variables that are dichotomous, nominal, ordinal, continuous or count. Just select the scale type and the appropriate model is used (logit, multinomial logit, ordinal logit, normal, poisson or binomial count).
3. Repeated measures structure
Repeated measures structure allows for latent class growth models, latent class conjoint models, Rasch type IRT models, survival models, and many other repeated measure type applications. Latent GOLD® uses a non-parametric random-coefficient model – the random effects are not assumed to come from a multivariate normal distribution. Besides less restrictive assumptions, the LC regression model has the advantage of being extremely fast compared to parametric random-coefficient models, when the outcome variable is non-normal. There are several special outputs for this: mean and standard deviation of coefficients, as well as individual effects. Latent GOLD® Advanced/Syntax also allows the estimation of parametric random coefficient models.
4. Complex Sample Design
Latent GOLD® Advanced/Syntax allows the use of sampling weights, stratum, etc., and estimate the design effect.
5. Multilevel Regression
For global segmentation, Latent GOLD® Advanced/Syntax provides for a sumultaneous segmentation at both the individual and country level.
See Bijmolt, Paas, and Vermunt (2003), and Vermunt and Magidson (2005).
How was latent class modeling developed?
Latent class (LC) analysis was originally introduced by Lazarsfeld (1950) as a way of explaining respondent heterogeneity in survey response patterns involving dichotomous items. During the 1970s, LC methodology was formalized and extended to nominal variables by Goodman (1974a, 1974b) who also developed the maximum likelihood algorithm that serves as the basis for the Latent GOLD program.
Are latent class models the same as finite mixture models?
Over the same period that latent class models evolved, the related field of finite mixture (FM) models for multivariate normal distributions began to emerge, through the work of Day (1969), Wolfe (1965, 1967, 1970) and others. FM models seek to separate out or ‘un-mix’ data that is assumed to arise as a mixture from a finite number of distinctly different populations. In recent years, the fields of LC and FM modeling have come together and the terms LC model and FM model have become interchangeable with each other. A LC model now refers to any statistical model in which some of the parameters differ across unobserved subgroups (Vermunt and Magidson, 2003a).