LATENT CLASS FACTOR AND CLUSTER MODELS, BI-PLOTS AND TRI-PLOTS
We propose an alternative method of conducting exploratory latent class analysis that utilizes latent class factor models, and compare it to the more traditional approach based on latent class cluster models. We show that when formulated in terms of R mutually independent, dichotomous latent factors, the LC factor model has the same number of distinct parameters as an LC cluster model with R+1 clusters. Analyses over several data sets suggest that LC factor models typically fit data better and provide results that are easier to interpret than the corresponding LC cluster models. We also introduce a new graphical "bi-plot" display for LC factor models and compare it to similar plots used in correspondence analysis and to a "tri-plot" display for LC cluster models. New results on identification of LC models are also presented. We conclude by describing various model extensions and an approach for eliminating boundary solutions in identified and unidentified LC models, that we have implemented in a new computer program.
The authors wish to thank Jeremy F. Magland, Leo A Goodman and Peter G.M. van der Heijden for helpful comments.
Latent class (LC) analysis is becoming one of the standard data analysis tools in social, biomedical, and marketing research. While the traditional LC model described by Lazarsfeld and Henry (1968) and Goodman (1974a, 1974b) contains only nominal indicator variables, variants have been proposed for ordinal (Clogg 1988; Uebersax 1993; Heinen 1996) and continuous indicators (Wolfe 1970; McLachlan and Basford 1988; Fraley and Raftery 1998), as well as for combinations of variables of different scale types (Lawrence and Krzanowski 1996; Moustaki 1996; Hunt and Jorgensen 1999; Vermunt and Magidson 2001). This paper concentrates on exploratory LC analysis with nominal and ordinal indicators.
In an exploratory LC analysis, the usual approach is to begin by fitting a 1-class (independence) model to the data, followed by a 2-class model, a 3-class model, etc., and continuing until a model is found that provides an adequate fit (Goodman 1974a, 1974b; McCutcheon 1987). We refer to such models as LC cluster models since the T nominal categories of the latent variable serve the same function as the T clusters desired in cluster analysis (McLachlan and Basford 1988; Hunt and Jorgensen 1999; Vermunt and Magidson 2001).
Van der Ark and Van der Heijden (1998) and Van der Heijden, Gilula and Van der Ark (1999) showed that exploratory LC analysis can be used to determine the number of dimensions underlying the responses on a set of nominal items. A LC model with three classes, for example, can be seen as a two-dimensional model similar to a two-dimensional joint correspondence analysis (JCA). However, within the context of LC analysis, a more natural manner of specifying the existence of two underlying dimensions for a set of items is to specify a model containing two latent variables.
Goodman (1974b), Haberman (1979), and Hagenaars (1990, 1993) proposed restricted 4-class LC models yielding confirmatory LC models with two latent variables. Their approach is confirmatory since, as in confirmatory factor analysis, it requires a priori knowledge on which items are related to which latent variables. In exploratory data analysis settings, we do not know beforehand which items load on the same latent variable. Hence, in exploratory analyses with several latent variables, this approach has limited practical applicability.
In this paper, we propose combining the exploratory model fitting strategy of the traditional latent class model with the possibility of increasing the number of latent variables to study the dimensionality of a set of items. Our alternative model fitting sequence involves increasing the number of latent variables (factors) rather than the number of classes (clusters). We call the latter sequence the LC factor approach because of the natural analogy to standard factor analysis. The basic LC factor model contains R mutually independent, dichotomous latent variables. To exclude higher-order interactions, logit models are specified on the response probabilities. An interesting feature of the basic R-factor model is that it has exactly the same number of parameters as an LC cluster model with T = R+1 clusters. In section 2, we describe the two types of exploratory LC models using the log-linear formulation introduced by Haberman (1979).
Section 3 compares the use of LC cluster and factor models in several examples and describes various graphical displays that facilitate the interpretation of the results obtained from these models. In particular, we consider some variations (called "tri-plots") of the ternary diagram originally proposed by Van der Ark and Van der Heijden (1998) for LC cluster models, and introduce a new display (called a "bi-plot") for LC factor models to represent various kinds of information in a 2-dimensional factor space. These two graphs are compared to each other and to similar displays used in correspondence analysis.
Section 4 describes some important extensions of the basic LC factor model, such as various model modifications needed for a more confirmatory analysis and for the inclusion of covariates. In section 5, we discuss identification issues. The paper ends with some final remarks regarding the applicability of these models.
In this section we describe and compare two competing alternative approaches for exploratory LC analysis. The traditional approach utilizes LC cluster models, while the alternative is based on LC factor models. For the sake of simplicity of exposition, below we use the log-linear formulation of LC models introduced by Haberman (1979). In Appendix A, we give the alternative probability formulation of the two types of LC models, as well as the relationship between the two formulations.
For concreteness, consider 4 nominal variables denoted A, B, C, and D. Let X represent a nominal latent variable with T categories. The log-linear representation of the LC cluster model with T classes is:
(1)
where i = 1,2, ,I; j=1,2, ,J; k=1,2, K; l=1,2, L; and t=1,2, T.
For convenience in counting distinct parameters and without loss of generality, we choose the following "dummy coding" restrictions to identify the parameters:
![]()
for i = 1,2,
,I;
j=1,2,
,J; k=1,2,
K; l=1,2,
L;
and
for t = 2,3,
,T.
As can be seen, the LC model described in equation (1) has the form of a log-linear model for the five-way frequency table cross-classifying the 4 observed variables and the latent variable; that is, the table with cell entries Fijklt. The assumed model contains one-variable terms ("main effects") associated with the latent variable X and the four observed indicators A, B, C, and D, as well as all two-variable "interaction" terms that involve X which pertain to the association between X and each of the observed indicators. The one-variable effects are included because we do not wish to impose constraints on the univariate marginal distributions. The assumption that the observed responses to A, B, C, and D are mutually independent given X = t ("local independence") is imposed by the omission of all interaction terms pertaining to the associations between the indicators. As shown in Appendix A, this set of conditional independence assumptions can also be formulated in another way, yielding the probability formulation for the LC model.
Note that for the 1-class model, since T=1, the model described in equation (1) reduces to the usual log-linear model of mutual independence between the 4 observed variables:
. (2)
More generally, for models with any number of variables, we will denote the model of mutual independence as H0, and use it as a baseline to assess the improvement in fit to the data of various LC models. The number of distinct parameters in the model of independence as described in equation (2) is:
NPAR(indep) = (I-1) + (J-1) + (K-1) + (L-1)
Expressing the number of distinct parameters in the model described in equation (1) as a function of NPAR(indep), yields:
NPAR(T) = (T-1) + NPAR(indep) x [1 + (T-1)]
= (T-1) + NPAR(indep) x T
The number of degrees of freedom (DF) associated with the test of model fit is directly related to the number of distinct parameters in the model tested.
DF(T) = IJKL NPAR(T) - 1
= IJKL [1 + NPAR(indep)] x T
Beginning with this baseline model (T=1), each time the number of latent classes (T) is incremented by 1 the number of distinct parameters increases by 1 + NPAR(indep), and, as a consequence, the degrees of freedom are reduced by 1 + NPAR(indep). The first additional parameter is the main effect for the additional latent class, and the NPAR(indep) further parameters correspond to the effects of each observed (manifest) variable on this additional latent class.
Certain LC models can be interpreted in terms of 2 or more component latent variables by treating those components as a joint variable (Goodman 1974b; McCutcheon 1987; Hagenaars 1990). For example, a 4-category latent variable X = {1, 2, 3, 4} can be re-expressed in terms of 2 dichotomous latent variables V = {1,2} and W = {1, 2} using the following correspondence:
| W=1 | W=2 | |
| V=1 | X =1 | X = 2 |
| V=2 | X =3 | X = 4 |
Thus, X=1 corresponds with V=1 and W=1, X=2 with V=1 and W=2, X=3 with V=2 and W=1, and X=4 with V=2 and W=2.
The LC cluster model given in (1) with T = 4 classes can be re-parameterized as an unrestricted LC factor model with two dichotomous latent variables V and W as follows:
(3)
The correspondence between the two representations is that the one-variable terms
pertaining to X are now written as
, and the
two-variable terms involving X as
,
, etc. It is easy to verify that this
re-parameterization does not alter the number of distinct parameters in the model.
We define the basic R-factor LC model as a restricted factor model that contains R mutually independent, dichotomous latent variables, containing parameters ("factor loadings") that measure the association of each latent variable on each indicator. Specifically, the basic R-factor model is defined by placing two sets of restrictions on the unrestricted LC factor model. The resulting 2-factor LC model is a restricted form of the 4-class LC cluster model. Without these restrictions, the 2-factor model would be unconstrained and would be equivalent to a 4-cluster model.
The first set of restrictions sets to zero each of the 3-way and higher-order
interaction terms. For the basic 2-factor model, we have
After imposing these restrictions, the 2-variable terms in the
basic 2-factor model become
,
, etc.
For variable A,
represents the loading of
A on factor V and
denotes the loading of A
on factor W, etc. The second set of restrictions imposes mutual independence between the
latent variables. For the 2-factor model, this latter restriction imposes independence in
the 2-way table <VW>.
Although the basic R-factor model is a special case of an LC cluster model containing 2R classes, we show in Appendix A that because of the restrictions of the type given above, the basic R-factor LC model is actually comparable to an LC cluster model with only T = R+1 clusters in terms of parsimony. This large reduction in number of parameters will be sufficient to achieve model identification in many situations. That is, in practice, it will frequently be the case that the basic R-factor will be identified when the LC cluster model with 2R classes is not.
TABLE 1: Equivalency Relationship between LC Cluster and Basic LC Factor Models
(Example with 5 Dichotomous Variables)
LC Cluster Models |
Basic LC Factor Models |
|||||
Number of Latent Classes |
Number of Parameters |
Degrees of Freedom |
Number of Factors |
Number of Parameters |
Degrees of Freedom |
|
1 |
5 |
26 |
0 |
5 |
26 |
|
2 |
11 |
20 |
1 |
11 |
20 |
|
3 |
17 |
14 |
2 |
17 |
14 |
|
4 |
23 |
8 |
3 |
23 |
8 |
|
5 |
29 |
2 |
4 |
29 |
2 |
|
Table 1 verifies the equivalence in number of parameters (and the associated degrees of
freedom) between the various identified LC cluster models and the corresponding basic LC
factor models in the case of 5 dichotomous indicator variables. From this table we can
also calculate, for example, that the basic LC 2-factor model requires 23 17 = 6
fewer parameters than the 4-class LC cluster model. This reduction corresponds to the 5
restrictions
, plus the restriction that V
and W are independent. (See Appendix A for a simple formula for calculating the number of
such restrictions in the more general case.)
We conclude this section by noting an important difference between our LC factor model and the LC models with several latent variables proposed by Goodman (1974b), Haberman (1979), McCutcheon (1987), and Hagenaars (1990, 1993). The basic LC factor model described above includes all factor loadings between the latent variables and the indicators. This means that no assumptions need be made about which indicators are related to which latent variables. This makes this LC factor model better suited for exploratory data analysis than the LC models with several latent variables described in the literature.
Thus far we have described two alternative approaches for exploratory LC analysis, one involving the fitting of LC cluster models, the other fitting basic LC factor models. In the next section we consider some examples to illustrate and compare their performance on real data and introduce graphical displays that facilitate the interpretation of the obtained results.Comparison of the two approaches for exploratory LC analysis across several data sets found that the factor approach resulted in a more parsimonious and easier to interpret model almost every time. Since our selection of data sets was not random, we do not present those results here. Rather, for purposes of illustration, this section considers the analysis from two data sets where a basic 2-factor model fits the data. In the first example, the comparable cluster model also provides an acceptable (but not as good) fit to the data; in the second example, the comparable cluster model provides a much worse fit, one that is not acceptable for these data.
This section also introduces graphical displays useful in displaying results from LC cluster and factor models. Details on the computation of the conditional probabilities appearing in the plots are given in Appendix B.Our first example, taken from McCutcheon (1987) and reanalyzed by Van der Heijden, Gilula, and Van der Ark (1999) involves four categorical variables from the 1982 General Social Survey. Two items are evaluations of surveys by white respondents and the other two are evaluations of these respondents by the interviewer (see Table 2). A summary of various LC models fit to these data is given in Table 3.
TABLE 2: Cross-tabulation of Observed Variables for White Respondents to the 1982 General Social Survey |
|||||
(A) COOPERATION |
|||||
(C) PURPOSE |
(D) ACCURACY |
(B) UNDERSTANDING |
Interested | Cooperative | Impatient/Hostile |
| Good | Mostly true | Good | 419 |
35 |
2 |
| Fair, poor | 71 |
25 |
5 |
||
| Not true | Good | 270 |
25 |
4 |
|
| Fair, poor | 42 |
16 |
5 |
||
| Depends | Mostly true | Good | 23 |
4 |
1 |
| Fair, poor | 6 |
2 |
0 |
||
| Not true | Good | 43 |
9 |
2 |
|
| Fair, poor | 9 |
3 |
2 |
||
| Waste | Mostly true | Good | 26 |
3 |
0 |
| Fair, poor | 1 |
2 |
0 |
||
| Not true | Good | 85 |
23 |
6 |
|
| Fair, poor | 13 |
12 |
8 |
||
TABLE 3: Results from Various LC Models Fit to Data in Table 2
Model |
Model Description |
BIC |
L² |
DF |
p-value |
% Reduction in L²( H0) |
| H0 | 1-class | 51.6 |
257.26 |
29 |
2.0x10-38 |
0 % |
| H 1 | 2-class | -76.7 |
79.34 |
22 |
2.1x10-8 |
69.1% |
| H 2C | 3-class | -98.7 |
21.89 |
15+2 |
0.19 |
91.5% |
| H 2F | basic 2-factor | -109.6 |
10.93 |
15+2 |
0.86 |
95.7% |
| H 3 | 4-class | -72.0 |
6.04 |
8+3 |
0.87 |
97.7% |
| HR2F | restricted 2-factor | -140.9 |
22.17 |
22+1 |
0.51 |
91.4% |
| H1F3 | 1-factor (3 levels) | -71.7 |
77.25 |
21 |
2.3x10-8 |
70.0% |
Model H0 is the baseline model given in equation (2) which specifies mutual independence between all four variables. Model H0 is a 1-class LC model (a 1-cluster model) which can also be interpreted as the equivalent 0-factor LC model. Since L2 = 257.26 with DF = 29, this model is rejected. Next, consider the 2-class model (H1) that can be interpreted as either a 2-cluster model or the equivalent 1-factor model where the factor is dichotomous. The L2 is now reduced to 79.34, a 69.1% reduction from the baseline model, but too high to be acceptable with DF = 22.
Next, consider the two 15-DF models -- H2C, the 3-cluster model and H2F, the basic 2-factor model. Each of these models provide an adequate fit to the data, although the factor model fits better, the L2 being half that of the comparable cluster model. For comparison, Table 3 also provides results for the 4-cluster model (H3). Among the first 5 models listed in Table 3, H2F is preferred according to the BIC criteria. The last 2 models in Table 3 are extended models that will be discussed in the next section.
TABLE 4: Comparison of results from the 3-Cluster Model with the Basic 2-Factor Model
Conditional Membership Probability of being in Cluster j =1,2,3 (for Model H2C) or level 1 of Factor k=1,2 (for Model H2F)
Model H 2C |
Model H 2F |
|||||
| Cluster 1 | Cluster 2 | Cluster 3 | Factor1(1) | Factor2(1) | ||
| Indicators | ||||||
| PURPOSE | ||||||
Good |
0.72 |
0.25 |
0.03 |
0.83 |
0.71 |
|
Depends |
0.38 |
0.17 |
0.45 |
0.65 |
0.28 |
|
Waste |
0.24 |
0.02 |
0.73 |
0.59 |
0 |
|
| ACCURACY | ||||||
Mostly True |
0.73 |
0.26 |
0.01 |
0.83 |
0.83 |
|
Not True |
0.50 |
0.15 |
0.35 |
0.71 |
0.28 |
|
| UNDERSTAND | ||||||
good |
0.76 |
0.08 |
0.16 |
0.89 |
0.53 |
|
Fair, poor |
0 | 0.77 |
0.23 |
0.28 |
0.71 |
|
| COOPERATE | ||||||
Interested |
0.70 |
0.17 |
0.13 |
0.86 |
0.58 |
|
Cooperative |
0.27 |
0.40 |
0.33 |
0.38 |
0.51 |
|
Impatient/ Hostile |
0 | 0.39 |
0.61 |
0 | 0.35 |
|
Overall Probability |
0.62 |
0.21 |
0.17 |
0.78 |
0.57 |
|
indicates a boundary solution
Table 4 compares results obtained from the 3-cluster Model (H 2C) with that from the basic 2-factor model (H 2F). The cell entries in the left-most columns are "rescaled parameter estimates" suggested by Van der Heijden, Gilula, and Van der Ark (1999), and represent the estimated conditional probabilities of being a member of one of the three clusters. The right-most columns contain corresponding quantities for the basic 2-factor model, representing the estimated probabilities of being at level 1 for each of the 2 factors. Unconditional membership probabilities for the clusters and for level 1 of the factors are given in the last row of the table.
Graphical displays of the conditional probabilities reported in Table 4 are useful in comparing results between the two models. For the 3-cluster model H2, Van der Heijden, Gilula, and Van der Ark (1999, Figure 4) present a ternary diagram for visualizing the results and show the close relationship to 2-dimensional plots produced by joint correspondence analysis (JCA). A slightly modified graphic, referred to as the "tri-plot" display by Vermunt and Magidson (2000) is given in Figure 1 for the 3-cluster model H2C. The shaded triangle in Figure 1 with lines emanating to the sides represents the overall sample which is plotted at the point corresponding to the unconditional membership probabilities for the clusters.
FIGURE 1. Tri-plot of Results Reported in Table 4 for Model H2CA different display for LC factor-models called the "bi-plot" (Vermunt and Magidson, 2000) is given in Figure 2 for the 2-factor model H2F. For comparability to the tri-plot where cluster 1 is assigned to the top vertex, we take factor 1 to be the vertical axis and factor 2 the horizontal. By comparing these plots we can see the large degree of similarity between the models, the primary difference being the relative positioning of COOPERATION = Impatient/ Hostile and UNDERSTANDING = Fair, poor.
Lines connecting the categories of a variable can make it easier to see to which factor the variables are most related. For example, Figure 3 shows that separation between the categories of the two respondent evaluation variables, PURPOSE and ACCURACY occurs primarily along Factor 2 (the horizontal axis in Figure 3) while for the two interviewer evaluation variables, UNDERSTANDING and COOPERATION separation occurs primarily along Factor 1 (the vertical axis). This makes clear that Factor 1 pertains primarily to the interviewer valuation while Factor 2 pertains primarily to the respondent valuation. These two factors are not only distinct (i.e., the 1-factor model H1 does not fit these data) but according to model H2F, they are mutually independent.
Since our models yield estimated membership probabilities for each individual case, both displays can easily be extended to include points for individual cases and covariate levels as well as any other desired groupings of the cases (see Appendix B). Our methodology is unified in the sense that the same methods and models that yield our tri-plots for LC cluster models also yield the bi-plots for the LC factor models. Our tri-plot display can be more easily extended in this manner than the methods proposed by Van der Heijden, Gilula, and Van der Ark (1999) with the ternary diagram. In our next example we will illustrate the inclusion in our plots of cases by including specific cases with selected response patterns. Then in section 4, we show how the display of all response patterns can be used to identify a natural ordering between the classes (when such an ordering exists), and we describe two different approaches for overlaying covariate values (levels) onto the displays.
The bi-plots offer several advantages over the related plots produced in correspondence analysis (CA) even when the data justifies a 2-dimensional CA solution. That is because the 2-dimensional CA solution is closely related to the 3-cluster solution (Gilula and Haberman 1986; De Leeuw and Van der Heijden, 1991) which we have found typically does not fit the data as well as the 2-factor solution. As suggested in this paper, the LC factor models generally provide simpler explanations of data than LC cluster models and the related canonical models used in CA and principal components analysis.
Our LC factor model is more closely related to traditional factor analysis than to CA. Advantages over traditional factor analysis include 1) the variables can include different scale types nominal, ordinal, continuous and/or counts, 2) solutions are typically uniquely identified and interpretable without the need for a rotation there is no rotational indeterminacy, and 3) factor scores can be obtained for each case without the need for additional assumptions. Like traditional factor analysis, LC factor analysis can be used as a first step in a more confirmatory analysis. Later in this paper (section 4) we describe a more confirmatory analysis of the data analyzed above.
Our second example consists of 5 dichotomous responses obtained from a mail survey regarding various musculo-skeletal symptoms (see Table 5). Specifically, persons were asked whether they had any of the following symptoms today: back pain, neck pain, joint pain, joint swelling, and joint stiffness. For further details see Wasmus, et al. (1989).
Table 5: Rheumatoid Arthritis Mail Survey Data
| BACK | NECK | JOINT | SWELL | STIFF | Frequency |
| no | no | no | no | no | 3,634 |
| no | no | no | no | yes | 73 |
| no | no | no | yes | no | 87 |
| no | no | no | yes | yes | 10 |
| no | no | yes | no | no | 440 |
| no | no | yes | no | yes | 89 |
| no | no | yes | yes | no | 106 |
| no | no | yes | yes | yes | 75 |
| no | yes | no | no | no | 295 |
| no | yes | no | no | yes | 25 |
| no | yes | no | yes | no | 15 |
| no | yes | no | yes | yes | 5 |
| no | yes | yes | no | no | 137 |
| no | yes | yes | no | yes | 42 |
| no | yes | yes | yes | no | 35 |
| no | yes | yes | yes | yes | 39 |
| yes | no | no | no | no | 489 |
| yes | no | no | no | yes | 37 |
| yes | no | no | yes | no | 23 |
| yes | no | no | yes | yes | 7 |
| yes | no | yes | no | no | 255 |
| yes | no | yes | no | yes | 116 |
| yes | no | yes | yes | no | 71 |
| yes | no | yes | yes | yes | 65 |
| yes | yes | no | No | no | 306 |
| yes | yes | no | No | yes | 48 |
| yes | yes | no | Yes | no | 16 |
| yes | yes | no | Yes | yes | 11 |
| yes | yes | yes | No | no | 229 |
| yes | yes | yes | No | yes | 162 |
| yes | yes | yes | Yes | no | 44 |
| yes | yes | yes | Yes | yes | 176 |
| Total | 7,162 |
The traditional LC cluster approach, as applied by Kohlmann and Formann (1997) to these data, rejects the 1-, 2-, and 3-class models in favor of the 4-class model which provides an acceptable fit to the data (L2 = 8.4 with 8 degrees of freedom; p = .39). The BIC statistic also selects the 4-class model as the one to be preferred among the LC cluster models listed in Table 6.
TABLE 6: Results from Various LC Models Fit to Data in Table 5
Model Hm |
Model Description |
BIC |
L² |
DF |
p-value |
% Reduction in L²( H0) |
H0 |
1-class |
4592.8 |
4823.6 |
26 |
3.0x10-101 |
0% |
H 1 |
2-class |
376.6 |
554.2 |
20 |
1.3x10-104 |
88.5% |
H 2C |
3-class |
38.2 |
162.4 |
14 |
2.3x10-27 |
96.6% |
H 2F |
basic 2-factor |
-110.5 |
13.7 |
14 |
0.5 |
99.7% |
H 3C |
4-class |
-62.6 |
8.4 |
8 |
0.4 |
99.8% |
H 3F |
basic 3-factor |
-85.1 |
3.7 |
8+2 |
1.0 |
99.9% |
The close relationship between the latent class cluster model and the canonical model (Gilula and Haberman 1986; De Leeuw and Van der Heijden, 1991) justifies a 2-dimensional display such as that produced in joint correspondence analysis (JCA) when the 3-cluster model is true (Van der Heijden, Gilula, and Van der Ark 1999). On the other hand, when the 3-class model must be rejected as not providing an adequate fit to data, as in the present example, the 2-dimensional JCA display can not provide a complete description of these data because a third dimension is also needed. However, as we show below, a different 2-dimensional display obtained from the LC factor model does provide a complete description of these data.
TABLE 7: Comparison of Results obtained under Models H 2C and H 3C Conditional Membership Probabilities |
||||||||
Variables |
3-Class Solution (H 2C) |
4-Class Solution (H 3C) |
||||||
Class 1 |
Class 2 |
Class 3 |
Class 1 |
Class 2 |
Class 3 |
Class 4 |
||
BACK |
||||||||
No |
0.94 |
0.32 |
0.37 |
0.93 |
0.31 |
0.60 |
0.09 |
|
Yes |
0.06 |
0.68 |
0.63 |
0.07 |
0.69 |
0.40 |
0.91 |
|
NECK |
||||||||
No |
0.96 |
0.48 |
0.50 |
0.96 |
0.44 |
0.77 |
0.15 |
|
Yes |
0.04 |
0.52 |
0.50 |
0.04 |
0.56 |
0.23 |
0.85 |
|
JOINT |
||||||||
No |
0.91 |
0.63 |
0.07 |
0.93 |
0.60 |
0.10 |
0.05 |
|
Yes |
0.09 |
0.37 |
0.93 |
0.07 |
0.40 |
0.90 |
0.95 |
|
SWELL |
||||||||
No |
0.97 |
0.96 |
0.49 |
0.98 |
0.96 |
0.55 |
0.44 |
|
Yes |
0.03 |
0.04 |
0.51 |
0.02 |
0.04 |
0.45 |
0.56 |
|
STIFF |
||||||||
No |
0.98 |
0.89 |
0.39 |
0.99 |
0.88 |
0.58 |
0.08 |
|
Yes |
0.02 |
0.11 |
0.61 |
0.01 |
0.12 |
0.42 |
0.92 |
|
Overall Probabilities |
0.62 |
0.21 |
0.17 |
0.61 |
0.21 |
0.12 |
0.06 |
|
Table 7 provides a closer look at the differences between the 3- and 4-class solutions to these data. We see that for the most part, the 4-class solution maintains classes 1 and 2 from the 3-class solution, but splits class 3 into two separate clusters. One way to visualize the close relationship between these two solutions is to combine classes 3 and 4 of the 4-class solution and compare the resulting tri-plot (displayed in Figure 5) with the original tri-plot from the 3-cluster model (Figure 4). As can be seen, these plots are almost identical, adding visual support to our conclusion (based on inspection of Table 7) that the primary difference between the two solutions is the splitting of class 3 into separate clusters. However, these plots do not describe the significant differences that exist between clusters 3 and 4 of the 4-cluster solution.
Results from fitting various basic factor models to these data are also included in Table 6. In particular, we see that despite the fact that the 3-cluster model H2C does not provide an adequate fit to these data, the comparable LC factor model H2F which posits two dichotomous factors, provides an excellent fit. While the traditional exploratory approach yields the 4-class LC cluster model H3C, this model requires 3 dimensions for a display of the results. On the other hand, the alternative approach yields factor model H2F, which justifies a valid 2-dimensional display without the necessity of collapsing or otherwise reducing the variables in the model. The resulting bi-plot presented in Figure 6 shows that JOINT, SWELL and STIFF are more strongly related to factor 1 (the arthritis factor), and BACK and NECK to factor 2 (the pain factor).
In most cases where models suggest that at least 2 dimensions are needed to provide an adequate fit to the data, it seems reasonable to expect there to be 2 underlying factors and hence at least 4 different classes to take into account both the low and high levels of each factor i.e., (low, low), (high, low), (low, high) and (high, high). If this speculation is true, it would explain why the factor approach typically provides a better fit to real data. Closer inspection of the results of the 4-cluster model parameters reported in Table 7 shows that, actually, the 4-cluster model also suggests a two-dimensional solution: the 4 clusters correspond to the (low, low), (high, low), (low, high) and (high, high) combinations of the same two dimensions encountered in the 2-factor model.
TABLE 8: Comparison between Models H2C, H3C, and H2F
Observed vs. Expected Frequencies for 4 Response Patterns
Response Pattern |
Frequency Counts | ||||||||
Observed |
Expected |
||||||||
Back |
Neck |
Joint |
Swell |
Stiff |
H2C |
H3C |
H2F |
||
1 |
No |
No |
No |
No |
No |
3,634 |
3,621.4 |
3,633.8 |
3,630.2 |
2 |
Yes |
Yes |
No |
No |
No |
306 |
304.5 |
304.8 |
307.6 |
3 |
No |
No |
Yes |
Yes |
Yes |
75 |
65.4 |
70.8 |
73.0 |
4 |
Yes |
Yes |
Yes |
Yes |
Yes |
176 |
112.0* |
173.7 |
174.9 |
Using BACK and NECK as proxies for factor 2 and the other variables for factor 1, we selected 4 response patterns as proxies for the 4 classes. Table 8 compares the estimates of the expected frequency counts obtained from models H2C, H3C, and H2F for these 4 selected response patterns. We see that the 3-class cluster model fails to provide a good estimate for respondents who reported having all 5 pain symptoms the (high, high) group.
Overall, the expected frequencies estimated under the 3-cluster model differ significantly from the observed frequencies for 7 of the 32 response patterns, while the other 2 models provide good estimates for all response patterns. The 4 selected response patterns (or cases) are plotted in Figures 4 and 6 using the symbols
¬ , ,® , and ¯ . The symbol ¯ appears in reverse shading as ¹ in Figure 4 to indicate the lack of fit. Figure 6 shows that these 4 response patterns appear in the 4 corners of the bi-plot, suggesting that they are in fact good indicators of the (low, low) (high, high) levels of the joint factor. Figure 4 on the other hand shows that 3 clusters are inadequate to separate cases with response patterns 3 and 4, and indicates that the estimate of the expected count for response pattern 4 is poor.In this section we consider some modifications and extensions of the basic LC factor
model that may be of interest in certain situations. First, although in example 1 we
treated the trichotomous variables COOPERATE (A) and PURPOSE (C) as nominal, they can be
treated as ordinal in several different ways. The most straight-forward approach is to
assume the middle category to be equidistant from the others and modify the log-linear
model described in equation (3) by using the uniform scores
and ![]()
= {0 if i = 1, 0.5 if i=2, 1 if i = 3}
= {0 if k = 1, 0.5 if k=2, 1 if k = 3}
for the categories of variables A and C. Secondly, analogous to confirmatory factor
analysis, we may wish to allow the two factors V and W to be correlated (with association
parameter
) and restrict the variables
COOPERATION (A) and UNDERSTANDING (B) to load only on factor 1 and PURPOSE (C) and
ACCURACY (D) to load only on factor 2. The log-linear representation for a confirmatory
model of this type as compared to the basic 2-factor model in Appendix A is as follows:
where i,k = 1,2,3; j,l,r,s = 1,2;
The results of fitting this restricted 2-factor model (HR2F) are reported in Table 3. These suggest that this model fits the data very well (L2 = 22.17, DF=23; p = .51). The corresponding bi-plot is shown in Figure 7.
Our examples thus far utilized only dichotomous factors. To extend the factor model so that any factor may contain more than 2 ordered levels, we assign equidistant numeric scores between 0 and 1 to the levels of the factor. Clogg (1988) and Heinen (1996) used the same strategy for defining LC models that are similar to certain latent trait models. The use of fixed scores for the factor levels in the various two-way interaction terms guarantees that each factor captures a single dimension. For factors with more than two levels, in the bi-plot we display conditional means rather than conditional probabilities (see Appendix B). Note that if we assign the score of 0 to the first level and 1 to the last level (or vice versa), for dichotomous factors the conditional mean equals the conditional probability of being at level 2 (or level 1).
Finally, the extension to include covariates in a log-linear LC model is straightforward. To illustrate the use of covariates and the extension to a 3-level factor, we will use the depression scale data for white respondents from the "Problems of Everyday Life" study conducted in 1972 by Pearlin (Pearlin and Johnson 1977) as reported separately for males and females (Schaeffer,1988). Persons who reported having the symptom during the previous week were coded 1, all others 0. The symptoms measured were lack of enthusiasm, low energy, sleeping problem, poor appetite and feeling hopeless.
Gender was included in the model as an active covariate (see the discussion in Appendix B on active vs. inactive covariates). Note that in the case of a single covariate, the log-linear LC model is identical whether GENDER is treated as a covariate or as another indicator (Clogg 1981; Hagenaars 1990).
TABLE 9: Results from Various LC Models Fit to the Depression Data
Model |
Model Description |
BIC |
L² |
DF |
p-value |
% Reduction in L²( H0) |
| H0 | 1-class | 672.8 |
1097.1 |
57 |
2.3x10-192 | 0 |
| H 1 | 2-class | -233.7 |
138.5 |
50 |
3.1x10-10 |
87.4% |
| H 2C | 3-class | -260.5 |
59.6 |
43 |
0.05 |
94.6% |
| H 2F | basic 2-factor | -274.6 |
45.5 |
43+1 |
0.37 |
95.9% |
| H 1F3 | 1-factor (3-levels) |
-297.8 |
67.0 |
49 |
0.05 |
93.9% |
df is increased by these boundary solutions
Table 10: Conditional Probabilities Estimated under the 3-Cluster model and the 1-Factor 3-level model
3-Cluster Model 1-Factor 3-level Model
| Cluster1 | Cluster2 | Cluster3 | Level1 | Level2 | Level3 | |||
| Cluster Size | 0.46 |
0.44 |
0.10 |
0.45 |
0.45 |
0.10 |
||
| ENTHUS | ||||||||
Lack of enthusiasm |
0.26 |
0.82 |
0.96 |
0.26 |
0.81 |
0.98 |
||
No |
0.74 |
0.18 |
0.04 |
0.74 |
0.19 |
0.02 |
||
| ENERGY | ||||||||
Low energy |
0.03 |
0.63 |
0.95 |
0.03 |
0.61 |
0.99 |
||
No |
0.97 |
0.37 |
0.05 |
0.97 |
0.39 |
0.01 |
||
| SLEEP | ||||||||
sleeping problem |
0.10 |
0.37 |
0.78 |
0.09 |
0.38 |
0.79 |
||
No |
0.90 |
0.63 |
0.22 |
0.91 |
0.62 |
0.21 |
||
| APPETITE | ||||||||
poor appetite |
0.04 |
0.22 |
0.73 |
0.04 |
0.24 |
0.72 |
||
No |
0.96 |
0.78 |
0.27 |
0.96 |
0.76 |
0.28 |
||
| HOPELESS | ||||||||
hopeless |
0.03 |
0.10 |
0.67 |
0.02 |
0.13 |
0.61 |
||
no |
0.97 |
0.90 |
0.33 |
0.98 |
0.87 |
0.39 |
Table 9 shows the results from fitting various LC models to these data. The traditional strategy required 3 classes as neither the 1- or 2-class models provided adequate solutions. We see again that the basic 2-factor model fits the data better than the comparable 3-cluster model. The results