Statistical Innovations logo
 






  Latent GOLD®  >
Free Demo
Order  
Documentation
Tutorials
FAQ  >
Sample Data Sets
80 File Formats
About LC Modeling

  Latent GOLD Choice®  go to section and expand
  GOLDMineR®  go to section and expand
  SI-CHAID®    go to section and expand
  Order/Shipping Policies  go to section and expand





  Latent GOLD® 4.0: Frequently Asked Questions

General

What resources are available to learn about Latent GOLD® and latent class modeling?

What data file formats can Latent GOLD® handle?

How can I use Latent GOLD® with SAS data sets?

How many records and variables can I use? How much time will it take to run?

How does Latent GOLD® differ from the LEM Program?

Do you have any tutorials for event history analysis?

LC Cluster Analysis

How does latent class cluster analysis compare with the traditional clustering procedures in SAS and SPSS?

How does Latent GOLD® classify cases into latent classes?

When the 'Include Missing' option is selected, does Latent Gold do some kind of imputation?

How are Latent Class (LC) clustering techniques related to Fuzzy Clustering Techniques?

How can I tell if my latent class cluster model contains local dependencies?

How can I handle local dependencies in my LC cluster model?

In LC cluster models containing continuous indicators, how can I determine whether a model should contain a within class correlation between 2 or more of these variables?

LC Factor Analysis

How does latent class factor analysis compare with traditional factor analysis?

Latent class factor analysis running time.

LC Regression Analysis

How does LC regression analysis compare with traditional regression modeling?

I need a mixture modeling program that can handle dependent variables that are dichotomous as well as continuous. Does Latent GOLD® handle this?

I have a binary dependent variable and five categorical independent variables. I am using Latent GOLD® to find 3 segments among the respondents. The Parameters output shows separate estimates for each segment. However, there appears to be both intercepts as well as betas for dependent variable. I am confused about how to use both of them in terms of predicting.

Is there any "stepwise" inclusion feature in the LC regression module?

Can Latent GOLD® perform multinominal LC regression models?  Can it be used with repeated measures such as obtained in conjoint and discrete choice studies?

For LC Regression models, there are several R square statistics reported in the Latent Gold output. When there are 2 or more latent segments (latent classes), do these still measure the overall strength of the predictors to predict the dependent variable?

I understand that the covariates are used to predict membership in a class based upon the probabilities derived from a multinomial logit model. The classification errors, reduction errors, entropy R square, etc. are associated with this estimation. Correct?

Questions on Tutorial #3: LC Regression with Repeated Measures

After I run the model, say on a binary response, and get two latent classes with their set of parameters, I'd like to predict the response of a new observation, with a given set of predictors and with or without covariates, but unknown response. How can I get this from Latent Gold?

Technical Questions from Latent GOLD® Users

I scored my data file using the 'classification output file' option and found that the percentage of each class is different than the class sizes given in the profile output.

I have included several ordinal variables with many values in my model and the program takes a very long time to run.  Can I do anything to speed it up?

The output listings in the manual for the IRIS data contain some errors in the statistics.  What are the correct results?"

Latent GOLD Advanced Questions

What additional functionalities are gained with the advanced version of Latent Gold?

I have the Advanced version of Latent GOLD 4.0. Is it possible to estimate IRT models such as the Rasch model, Rost's Rasch mixture model, partial credit model and rating scale models that can be estimated in the WINMIRA program? If so, how do the parameterizations differ?

 

General

Q. What resources are available to learn about Latent GOLD® and latent class modeling?

A. Before purchasing the program, you can try out the free demo version of the program, which allows access to all program features with sample data files. Tutorials take you step-by-step through several analyses of these sample files. These tutorials along with several articles are available on our website. Upon purchase of the program users can download a 200 page User's Guide that covers a wide range of topics on Latent Class Analysis and Latent GOLD® . We also offer a once a year training program (Statistical Modeling Week) which includes a 2 day course on Latent Class Analysis, as well as Online Courses

Q. What data file formats can Latent GOLD® handle?

A. The standard version of Latent GOLD® can handle ASCII Text data formats as well as SPSS files. The DBMS copy version of the program can handle over 80 different file formats, including spreadsheets and SAS files.

Return to List of Questions

Q. How can I use Latent GOLD® with SAS data sets?

SAS Export can create an SPSS .sav file which can be opened by Latent GOLD®. The SAS Documentation illustrates the Export function. View the Relevant Export page for instructions.

Q. How many records and variables can I use? How much time will it take to run?

A .There is NO limit concerning the number of records. The time will depend on several factors including the # of variables and records, speed of your machine, and the requested output. For many models, Latent GOLD® runs 20 or more times faster than other Latent Class programs. We suggest trying the demo program to see how fast Latent GOLD® works on your machine.

Q. How does Latent GOLD differ from the LEM program?

Latent GOLD® implements the 3 most important types of latent class (LC) models. It was designed to be extremely easy to use and to make it possible for people without a strong statistical background to apply LC analysis to their own data in a safe and easy way. LEM is a command language research tool that Prof. Jeroen Vermunt developed for applied researchers with a strong statistical background who want to apply nonstandard log-linear and latent class models to their categorical data. With LEM you can specify more probability structures with many more kinds of restrictions (if you know how to do it), but is not designed to be Windows friendly, requires strict data and input formats and does not provide error checks.

With Latent GOLD, continuous and count variables can be included in the model, and special LC output not available in LEM is provided, such as various graphs, classification statistics, and bivariate residuals. Latent GOLD® also has faster (full Newton-Raphson) and safer (sets of starting values, Bayes constants) estimation methods for LC models than LEM. Both programs give information on nonidentifiability and boundary solutions, but Latent GOLD® , unlike LEM, can prevent boundary solutions through the use of Bayes constants.

Q. Do you have any tutorials for event history analysis?

The set of example data files on our website contains various event history analysis examples. Tutorials are not yet available for these. However, to get you started, you might look at the data file land.sav, the full reference for which is " Land, K.C., Nagin, D.S., and McCall (2001). Discrete-time hazard regression models with hidden heterogeneity: the semi-parametric mixed Poisson approach. Sociological Methods and Research, 29, 342-373." Another good example is jobchange.dat.

Land.sav contains information on 411 males from working-class area of London who were followed from ages 10 through 31. The dependent variable is "first serious delinquency". As can be seen, there is one record for each time point, which is called a person- period data format. The dependent "first" is zero for all records of a person, expect for the last if a person experienced the event of interest at that age.

The variables age and age_sq are the duration variables. These can also be seen as time-varying predictors. The variable "tot" is a time-constant covariate/predictor (a composite risk factor). Of course the ID should be used as Case ID to indicate which records belong to the same case.

The dependent "first" can be treated as a Poisson count or as a binomial count. The former option yields a piece-wise constant log- linear hazard model, the latter a discrete-time logit. If treated as Poisson count, it is best to set the exposure to one half (exp_half: event occurs in the middle of the interval) for the time point at which the event occurs. With a binomial count the exposure should be one all the time (=default). Age and age_sq should be used as class- dependent predictors. You identify two groups with clearly different age pattern in the rate of first delinquency. The variable "tot" can be used as class-independent predictor, but more interesting is to use it as covariate: does the risk factor determine the type of delinquency trajectory?

This example can be modified.extended in many ways. - you can include other time-varying predictors than the time variables. These can be assumed to have the same or different effects across classes. - you can include information on another event. In that case your classes describe the pattern in multiple events - you can include as many covariates as you want (this will usually be demographics, but can also be a treatment) - you can model the time dependence as nominal, yielding a Cox-like model.

A general reference on event history combined with LC analysis is Vermunt (1997), Log-linear event history analysis. Sage Publications..

Return to List of Questions

LC Cluster Analysis

Q. How does latent class cluster analysis compare with the traditional clustering procedures in SAS and SPSS?

A. LC clustering is model-based in contrast to traditional approaches that are based on ad-hoc distance measures. The general probability model underlying LC clustering more readily accommodates reality by allowing for unequal variances in each cluster, use of variables with mixed scale types, and formal statistical procedures for determining the number of clusters, among many other improvements. For a detailed comparison showing how LC cluster outperforms SPSS K-means clustering and SAS FASTCLUS procedures, see Latent Class Modeling as a Probabilistic Extension of K-means Clustering.

Q. How does Latent GOLD® classify cases into latent classes.

A. Cases are assigned to the latent class having the highest posterior membership probability. Covariates can be added to the model for improved description and prediction of the latent classes.

Return to List of Questions

Q. When the 'Include Missing' option is selected, does Latent Gold do some kind of imputation?

A. No, imputation is not necessary. Classification with missing values works exactly the same as classification without missing values. It is simply based on the variables that are observed for the case concerned. There is no imputation of missing values for indicators. One of the nice things about LC analysis is that imputation is not necessary.

In the User's Guide, we give the general form of the density with missing values. The crucial thing is the delta, which is 0 if an indicator is missing. If that occurs the term cancels (it is equal to 1 irrespective of the value of y).

Thus with 4 indicators y1, y2, y3, and y4, two clusters and y2 missing

P(x|y1,y3,y4) = P(x) P(y1|x) P(y3|x) P(y4|x) / P(y1,y3,y4)

where

P(y1,y3,y4) = P(1) P(y1|1) P(y3|1) P(y4|1) + P(2) P(y1|2) P(y3|2) P(y4|2)

Return to List of Questions

Q. How are Latent Class (LC) clustering techniques related to Fuzzy Clustering Techniques

A. In fuzzy clustering, a case has grades of membership which are the "parameters" to be estimated (Kaufman and Rousseeuw, 1990). In contrast, in LC clustering an individual's posterior class-membership probabilities are computed from the estimated model parameters and the observed scores. The advantage of the LC approach is that it is possible to use the LC model to classify other cases (outside the sample used to estimate the model) which belong to the population. This is not possible with standard fuzzy clustering techniques.

Kaufman, L. and Rousseeuw, P.J. 1990. Finding groups in data: An introduction to cluster analysis, New York: John Wiley and Sons.

Return to List of Questions

Q. How can I tell if my latent class cluster model contains local dependencies?

A. Local dependence for a K-class model exists if the model does NOT fit the data. One such measure of model fit is given by the bivariate residuals (BVRs) associated with each pair of model indicators. Large BVRs (values over 2) can be viewed as evidence of local dependence associated with that pair of model indicators (see the residuals output of Latent GOLD for these statistics).

Return to List of Questions

Q. How can I handle local dependencies in my LC cluster model?

A. Local dependence can be accounted for by simply adding latent classes or by maintaining the current number of classes and modifying the model in other ways such as adding direct effects associated with 2 variables that have large bivariate residuals. See the LG 4.0 manual for details of how to add direct effects. See also section 3 in http://www.statisticalinnovations.com/articles/sage11.pdf for further details of the different approaches for dealing with local dependence.

Return to List of Questions

Q. In LC cluster models containing continuous indicators, how can I determine whether a model should contain a within class correlation between 2 or more of these variables?

A. You can estimate several models and select the one that fits best according to BIC. For example, six types of LC cluster models are reported in Table 1 of the Latent Class Cluster Analysis article. These models differ with respect to a) the specification of class dependent vs. class independent error variances and b) the 'direct effects' included in the LC cluster model estimated by LatentGOLD. The 3-class type-5 model is best according to the BIC statistic. Various parameter estimates and standard errors from this 'final' model are obtained from the Profile and Parameters Output.

Click on dataset #29 and download the data and the model setup file diabetes.lgf containing the specifications for each of the 6 types of 3-class cluster models described in Table 1.

Return to List of Questions

LC Factor Analysis

Q. How does latent class factor analysis compare with traditional factor analysis?

A. The LC factor model assumes that each factor contains 2 or more ordered categories as opposed to traditional factor analysis which assumes that the factors (as well as the variables) are continuous (interval scaled). The variables in LC factor analysis need not be continuous. They may be mixed scale types (nominal, ordinal, continuous, counts, or combinations of these). LC Factor also has a close relationship to cluster analysis. For an introduction to LC factor analysis, and to see how it relates to LC cluster analysis, see Magidson and Vermunt Sociological Methodology 2001. For a comparison with traditional Factor Analysis in datamining see Traditional vs. Latent Class Factor Analysis for Datamining

Q. I was looking at the lifestyle data set (tutorial #1) and was trying to run a factor model on all lifestyle indicator variables (from Tennis to Military). I have requested an 8-factor model and it has been running for 30 minutes. Am I doing something wrong?

A. A 2-factor model on a 650 MH computer took less than 2 minutes to estimate and a 3-factor model 4 minutes. As the number of factors increases the estimation time increases exponentially. From an exploratory perspective, you may well find that a 2 or 3 factor solution will already be quite informative -- 3 dichotomous factors will segment the sample into 8 distinct clusters! On the other hand, 8 dichotomous factors corresponds to 2 to the power 8 = 256 clusters. To see the relationship between factors and clusters, see Magidson and Vermunt Sociological Methodology 2001.

Traditional factor analysis (FA) is faster because it makes a simplifying assumption that all variables are continuous and that they follow a multivariate normal (MVN) distribution. When these assumptions are true, only the second order moments (the correlations between the variables) are needed to estimate the model. For these data, the FA assumptions are not justified.

Latent GOLD® does not assume MVN and hence is much more general. It utilizes information from all higher order associations (more than means and correlations) in the estimation of parameters. The resulting solution will be directly interpretable and unique, unlike the traditional FA solution which requires a rotation for interpretability. Traditional vs. Latent Class Factor Analysis for Datamining is an article by the developers of LG that will appear in a book on datamining. It shows why LG factor analysis often provides insights into data that are missed by traditional FA.

Return to List of Questions

LC Regression Analysis

Q. How does latent class regression analysis compare with traditional regression modeling?

A. There are 2 primary kinds of differences. First, the particular regression is automatically determined according to the scale type of the dependent variable. For continuous, the traditional linear regression is employed; for dichotomous, logistic regression; for ordinal, the baseline/adjacent category logit extension; for nominal, multinomial logit; for count, Poisson regression. models are used. For example, for dichotomous dependent variables, the logistic regression model is used. Second, LC Regression is a mixture model and hence is more general than traditional regression. The special case of 1-class corresponds to the homogeneous population assumption made in traditional regression. In LC regression, separate regressions are estimated simultaneously for each latent class.

I need a mixture modeling program that can handle dependent variables that are dichotomous as well as continuous. Does Latent GOLD® handle this?

A. Yes. Mixture modeling and latent class modeling are synonymous

Return to List of Questions

Is there any "stepwise" inclusion feature in the LC regression module?

A. No. Since the latent classes may be highly dependent on the predictors that are included, stepwise features have not been implemented in the latent class regression module.

Q. I have a binary dependent variable and five categorical independent variables. I am using Latent GOLD® to find 3 segments among the respondents. The Parameters output shows separate estimates for each segment. However, there appears to be both intercepts as well as betas for dependent variable. I am confused about how to use both of them in terms of predicting.

A. The 'gamma' parameters labeled ‘Intercept’ (and other gamma parameters that would appear if you have covariates) refer to the model to predict the latent variable classes as a function of the covariates. If no covariates are included in the model only the Intercept appears under the label ‘(gamma)’. Beneath the gamma parameters, the parameters labeled 'beta' appear. These refer to the model to predict the dependent variable (which including the dependent variable regression intercept). This output has been rearranged in Latent GOLD® 4.0 to provide better separation of the parameters from these two different models. See Tutorial 3 (PDF) for an example.

Latent GOLD® 4.0 also has many additional features useful for prediction, such as the automatic generation of predicted values, the ability to restrict the regression coefficients in many ways, and R-square statistics. See the User's Guide (PDF) and Technical Guide(PDF) for further details on these new features.

Can Latent GOLD® perform multinominal LC regression models?  Can it be used with repeated measures such as obtained in conjoint and discrete choice studies?

A. Multinomial LC regression models are estimated simply by specifying the dependent variable to be nominal. In the case of repeated measures, (multiple time points, multiple ratings by the same respondent, etc.) an ID variable can be used to identify the records associated with the same case. (See tutorial #2 for an example of a repeated measures conjoint study.) Latent GOLD® cannot currently estimate conditional logit models of the kind used in discrete choice studies, although such capability will be incorporated in Latent GOLD Choice, and add-on to Latent GOLD® , that is now under development.

For LC Regression models, there are several R square statistics reported in the Latent Gold output. When there are 2 or more latent segments (latent classes), do these still measure the overall strength of the predictors to predict the dependent variable?

A. Yes. One important additional aspect is that estimated class-membership also improves overall prediction and contributes to the magnitude of R square.

I understand that the covariates are used to predict membership in a class based upon the probabilities derived from a multinomial logit model. The classification errors, reduction errors, entropy R square, etc. are associated with this estimation. Correct?

A. This is not fully correct: These measures indeed indicate how well we can predict class membership. But, the covariates alone do not determine classification -- the regression model itself plays a major role in predicting class membership. This prediction/classification is based on a person’s responses on the dependent variable (given predictor values). If you look at the formulas, you can see that the posterior membership probabilities do not only depend on P(x|z), but also on P(y|x,z). Even without any covariates (z), these models usually predict class membership quite well.

Intuitively, one determines which class-specific regression model fits best to the responses of a certain case. The better that a regression model associated with a particular class fits, the higher the probability of belonging to that class. Price sensitive people are assigned to the class for which the regression shows higher price effects, etc.

In Latent GOLD 4.0, we also report a separate R-squared for the prediction of class membership based on covariates only.

Return to List of Questions

Q. I need additional information on Tutorial # 3: LC Regression with Repeated Measures. Specifically, I would like to know how preference ratings and probabilities for different preference levels for a given profile are computed in this example. In other words, how do I use the estimated beta coefficients to compute the probabilities of choosing Ratings 1 thru 5 for a given profile, say [Fashion=Traditional, Quality=High, Price=Lower]? Also, what exactly are the gamma coefficients as distinguished from those labeled betas in the parameters output?

A. The 'ordinal' dependent variable specification is used in this example which causes the baseline category logit model to be used. The beta coefficients listed in the column of the parameters output file corresponding to a particular latent class are the b-coefficients in the following model:

f( j | Z1, Z2, Z3) = b0(j) + b1*Z1*y(j) + b2*Z2*y(j) + b2*Z3*y(j).

The b0 estimates are the betas associated with each rating category j of the dependent variable RATING.

The y(j), j=1,2,3,4,5 are the fixed scores used for the dependent variable, (1, 2, 3, 4, and 5 in this example)

The desired probabilities are thus computed as:

Prob(Rating = j | Z1, Z2, Z3) = exp[f(j)] / [exp(f(1))+exp(f(2))+exp(f(3))+exp(f(4))+exp(f(5))] , j = 1,…,5

(For additional technical information on this model see the associated Magidson references)

"Maximum Likelihood Assessment of Clinical Trials Based on an Ordered Categorical Response."  Drug Information Journal, Maple Glen, PA: Drug Information Association, Vol. 30, No. 1, 1996.

"Multivariate Statistical Models for Categorical Data," chapter 3 in Bagozzi, Richard, Advanced Methods of Marketing Research, Blackwell, 1994.

Additional coefficients, labeled gammas (as opposed to betas) pertaining to the multinomial logit model for predicting the latent variable as a function of the covariates (SEX and AGE for this example) are listed at the bottom of the parameters output file (in Latent GOLD® 4.0). In the model containing no covariates, the gamma coefficients (labeled 'intercepts') relate to the size of the classes which are always ordered from largest (the first latent class) to smallest (the last class).

Return to List of Questions

Q. After I run the model, say on a binary response, and get two latent classes with their set of parameters, I'd like to predict the response of a new observation, with a given set of predictors and with or without covariates, but unknown response. How can I get this from Latent Gold?

A. With active covariates, posterior membership probabilities are computed for cases with missing responses (whether or not they are 'new' cases), based on their covariate values, as shown in Latent GOLD's 'covariate classification' output. These probabilities are used as weights applied to the predictions for each latent class, using the predictors for such cases and the regression coefficients associated with that class to get the appropriate prediction for each class. Without active covariates, the posterior membership probabilities are taken to be the overall class sizes.

In practice, if the new cases are included in the data file but given a case weight close to 0 (say 1E-49), while all other cases are given a weight of 1, and the 'include missing' option is used, such cases will not be used to estimate the model parameters (so the same solution would be obtained without the new cases), but by requesting that Predictions be output to a file, predictions for ALL cases, including the new cases, will be output to the file..

Return to List of Questions

Technical Questions from Latent GOLD® Users

Q. I scored my data file using the 'classification output file' option and found that the percentage of each class is different than the class sizes given in the profile output. What am I (or Latent GOLD® ) doing wrong?

A. Nothing is wrong. What you are observing is the effects of misclassification errors associated with assignment to a latent class based upon the modal (highest) class probability. For example, in a 3-class model if the posterior membership probabilities for cases having a given response pattern are 0.2 (for class 1), 0.7 (for class 2), and 0.1 (for class 3), the modal probability is 0.7. Assignment based on the modal probability means that all such cases will be assigned to class 2. However, such assignment is expected to be correct for only 70% of these cases, since 20% truly belong to class 1 and the remaining 10% belong to class 3. The expected misclassification rate for these cases will be 20% + 10% = 30%. For cases with other response patterns, the misclassification rate may be 7%, or 2%, etc. The modal assignment rule minimizes the overall expected misclassification rate (the overall expected misclassification rate is given in the output). To the extent the misclassification rate is greater than 0, the observed frequency distribution of class memberships will reflect the effects of such misclassification. The marginal distributions in the classification table show how the marginal distribution changes when using modal class assignment.

Return to List of Questions

Q. I have included several ordinal variables with many values in my model and the program takes a very long time to run? Can I do anything to speed it up?

A. Substantial improvement in speed can be accomplished by specifying the variables to be continuous. Alternatively, the grouping option can be used to reduce the number of levels in the ordinal variables (to say 10 or 20).

Q. The output listings in the manual for the IRIS data contain some errors in the statistics. What are the correct results?

A. The output listings in the manual for the IRIS data contain some errors in the statistics. You can download the correct specification for each model (iris.lgf) and the data (iris.dat).

Return to List of Questions

Latent GOLD Advanced Questions

Q. What additional functionalities are gained with the advanced version of Latent Gold?

A. The advanced version of Latent GOLD 4.0 consists of an advanced module containing the ability to 1) estimate multi-level latent class models, 2) incorporate complex sampling designs, and 3) include random effects with continuous factors (CFactors). An overview of these capabilities is provided in section 8 of the Technical Guide, followed by detailed documentation for each of these 3 advanced features in sections 9, 10 and 11, as well as output produced by these advanced features in section 12. You may download a demo version that contains the Advanced module and use it with any of our demo data sets.

Q. I have the Advanced version of Latent GOLD 4.0. Is it possible to estimate IRT models such as the Rasch model, Rost's Rasch mixture model, partial credit model and rating scale models that can be estimated in the WINMIRA program? If so, how do the parameterizations differ?

A. Yes, Latent GOLD Advanced (LGA) can be used to estimate a wide variety of IRT and IRT mixture models. This .pdf describes the connections between various LGA and standard IRT models. The associated .lgf and data files illustrate examples that can be run with our demo data sets. (Note that we set the Bayes constant to 0 in these runs.)

Return to List of Questions


E-mail Contact: will@statisticalinnovations.com
Address: Statistical Innovations, 375 Concord Avenue, Belmont, MA 02478-3084
Phone: +1.617.489.4490
Fax: +1.617.489.4499