This note presents an analysis of a famous data example in GOLDMineR® and shows unique
insights into the effects present in the data. This note also shows how you can use an
odds framework to understand effects in the data.
In his 1972 article "A Modified Multiple Regression Approach to the Analysis of
Dichotomous Variables," Leo Goodman presents a version of data presented by Stouffer
et al. in their study "The American Soldier." Stouffer's study was published in
1949, and was based upon surveying American soldiers in World War II. The data should be
understood in the context of the segregated American military and society of the time. The
data have been analyzed many times, and Goodman's approach is both very fruitful for
understanding what is going on in the data, as well as superior to other approaches used
in the past. Goodman's analysis is replicable in SPSS Genlog or in GOLDMineR® .
Goodman presents a cross-classification of soldiers with respect to four dichotomized
- Race: Black (b) or White (w).
- Region of origin: North (on) or
- Present camp location: North (pn) or
- Preference as to camp location: North (rn) or South (rs).
Goodman fits a logit model in which Preference is a response variable and Race, Region,
and Present camp location are predictors.
First, examine the two-way marginal tables relating each predictor to Preference. Here
is the two-way marginal table relating Race to Preference for camp location.
You can understand the data in terms of simple percentage differences. 52.8% of the
Blacks prefer residing in the South relative to the North, while only 45.8% of the Whites
prefer residing in the South relative to the North. Therefore, Blacks are about 7% more
likely than Whites to prefer the South.
You can also understand the data in terms of odds. Blacks prefer the South to the North
by odds of 2268 / 2027 or 1.119 to 1. Whites prefer the South to the North by odds of 1717
/ 2024 or 0.848 to 1. Therefore, Blacks are 1.119 / 0.848 or 1.32 times as likely as
Whites to prefer the South.
Here is the two-way marginal table relating Region of Origin to Preference for camp
In percentage terms, 23.7% of those from the North preferred the South, while 75.9% of
those from the South preferred the South, for a large percentage difference of 52.2%. In
odds terms, those originating in the North prefer the South by odds of 958 / 3092 or 0.31
to 1, while those originating in the South prefer the South by odds of 3027 / 959 or 3.156
to 1. Therefore, those originating in the South are 3.156 / .31 or 10.186 times as likely
as those originating in the North to prefer the South.
Here is the two-way marginal table relating Present camp location to Preference for
In percentage terms, 26% of those presently in the North prefer the South, while 60.1%
of those presently in the South prefer the South, for a percentage difference of about
34%. In odds terms, those presently in the North prefer the South by odds of 644 / 1829 or
0.363 to 1, while those presently in the South prefer the South by odds of 3341 / 2222 or
1.5 to 1. Therefore, those presently in the South are 1.504 / 0.363 or 4.143 times as
likely as those presently in the North to prefer the South.
To sum up so far: Analyzing the two-way marginal tables reveals variable importance
order to be Origin, Present, and Race. However, the two-way tables above show the effect
of each predictor as if it were the only one. Analogous to multiple regression, what is
the effect of each predictor given the others?
One way to get a sense of that is to examine the four-way cross-classification of all
variables. This is shown in the next figure.
Reading the above table, for example, for a Black Northerner in a Northern camp, the
odds are 36 to 387 (0.093 or about 1 to 10.75) that he will prefer a Southern camp or 387
to 36 (10.75 to 1) that he will prefer a Northern camp. Examining the table, you can see
how the observed odds vary as the predictor variables vary.
Using GOLDMineR® , you can:
- Develop a model that describes
quantitatively how the odds in the above table are affected
by the predictor variables.
- Test whether the model fits the
- Measure how well the model fits the
- Assess the statistical significance
of the model parameters.
- Estimate the main effects of the
predictor variables and their effect on the odds.
- Obtain expected frequencies given the model.
The GOLDMineR® Define Model window appears in Figure 4 below.
Specify Prefer as the dependent variable, and specify Race, Origin, and Present camp
location as predictors. The model fitted is a main effects model.
Figure 5 (below) shows the Association Summary.
The Total L2 (likelihood ratio chi-square) shown, 3111.47, is the same as
that reported above in the test of joint independence. The main effects model L2
is 3086.51, which is 99.2% of the total. The residual L2 of 24.96 on 4 degrees
of freedom is statistically significant but relatively small. The model R-squared is
0.345. Note that for dichotomous and ordered response variables, an R-square of 1 is not
always mathematically attainable, so you should interpret this coefficient with that in
mind. The phi coefficient of association is 0.7551. These numbers are both sizable under
Here are Likelihood ratio tests for the individual effects in the model.
All terms are highly significant, with predictor importance order being Origin,
Present, and Race. The exp(Beta) column shows coefficients that have an odds
- Net of other terms in the model,
Whites are 2.14 times as likely as Blacks to prefer the
South to the North.
- Net of other terms in the model,
those of Southern origin are 13.64 times as likely as those
of Northern origin to prefer the South to the North.
- Net of other terms in the model, those presently in the South are 4.71 times as likely
as those presently in the North to prefer the South to the North.
You can show these effects graphically in GOLDMineR® 's partial X plot. Here are the
partial X plots for each of the three predictors.
Consider the Race plot. The Partial X plot is so-named because it places the predictor
categories for a given predictor on the horizontal axis. The baseline for Race is Black,
so the odds comparison is White versus Black. The baseline for Prefer is "Prefer the
North," so the odds comparison is "Prefer the South" versus "Prefer
the north." The diagonal solid line is the expected odds ratio line, while the
diamonds are the corresponding observed odds ratios observed in the table for preferring
the South to the North. The expected odds, which is the slope of the line, is 2.14.
Another useful GOLDMineR® plot is the Partial regression plot. With the response
variable coded 0,1, this plot shows the probability of a "1" response, here
"Prefers the South," as a function of a given predictor's values conditional on
values of the other predictors. One way to scale the plot is to use weighted average
scoring for all variables. That is, with this scoring scheme, when viewing a particular
partial regression plot, keep in mind that all other predictors are set to their mean
value. This practice is analogous to what is sometimes done in logistic regression. To
understand the impact of a given predictor, you plug in its low and high values while
setting the other predictors to their means. Of course, you are free to set the other
predictors to other values of interest.
Note that the above three plots are on the same vertical scale.
A related GOLDMineR® plot that puts the multiple predictor information together in one
plot is the joint regression plot. This plot shows categories of a "joint X"
variable formed by crossing predictor categories on the horizontal axis.
This plot shows the probability of Preferring the South relative to the North on the
vertical axis, with each joint-X category on the horizontal axis. The connected diamonds
are the fitted model, while the circles are the probabilities for the observed categories
of the joint-X variable. Note the extremes. At the left, Blacks originally from the North
who are presently located in the North have the lowest probability of preferring the South
to the North, while Whites originally from the South who are presently located in the
South have the highest probability of preferring the South. The lack of fit indicated in
the Association Summary above manifests itself in departures of the 8 observed points from
the fitted curve. While the departures appear relatively slight, they are based on some
sizable category frequencies. The circle size indicates the relative category size. Five
of the eight points shown are whited out, indicating that they have statistically
significant adjusted residuals. In order to see the value of the adjusted residual, click
on the point in question. Or, you can open a Table Window in GOLDMineR® and view the
adjusted residual values. Specify
Here is a portion of that table.
Note that the pattern of signs of the adjusted residuals is + + - - - - + +. This
suggests that an added interaction term involving Origin and Present camp location might
improve the fit.
Indeed, adding an interaction term to the model reduces the residual L2 to
1.45 on 3 degrees of freedom, with a p value of 0.69.
Here is the joint regression plot with this revised model.
Note the better fit of the fitted curve to the observed points.
Here are the individual coefficients with the revised model.
Of course, the added interaction term complicates things a bit, but results in better