# CORExpress®

CORExpress® 1.1 implements Correlated Component Regression (CCR) and focuses on regression analysis (linear regression, logistic regression, linear discriminant analysis and survival models) with a large numbers of correlated predictors *P* which may exceed the sample size* n * (see product description below for more details).

CORExpress® is licensed on an annual basis. Discounts for multiple users are automatically applied to your order (2-4 licenses – 15%, 5-9 licenses – 20%, 10+ licenses – 25%).

## Description

CORExpress® develops improved regression and classification models for linear regression, logistic regression, linear discriminant analysis, and survival models (Cox regression). It also handles multicolinearity due to correlated predictors effectively even with high dimensional data (more variables than cases).

CORExpress® improves:

- Interpretation of regression coefficients (see Tutorial 1)
- Out-of-sample prediction (see Tutorial 2)
- Classification (see Tutorial 3)
- Variable selection (see Tutorial 1-3)

CORExpress® develops regression models using Correlated Component Regression (CCR) methods. CCR was developed by Dr. Jay Magidson for simultaneously estimating regression models and selecting predictors from a potentially large number of candidate predictors. Reliable models are obtained using a fast algorithm that incorporates M-fold cross-validation to optimize tuning parameters (amount of regularization K and # predictor variables P). Final models may even include more predictors than cases, which is impossible with traditional regression methods!

Program features include:

Correlated Component Regression (CCR) is a general ensemble approach for the development of a K-component regression model, developed by Dr. Jay Magidson. The 1st component is an average (ensemble) of all 1-predictor direct effects. Thus, the 1-component CCR model is equivalent to Naïve Bayes, and natural generalizations of the Naïve Bayes approach. Naïve Bayes has been shown to outperform traditional regression methods in situations involving high-dimensional data in theory (see: Bickel and Levina, 2004) and in practice with real data (Dudoit et al., 2002).

The 2nd component, correlated with the 1st, improves prediction over the 1-component model, primarily by capturing the effects of suppressor variables, and each additional component improving prediction further. Suppressor variables are typically among the most important predictors and are commonly available in practice. Unfortunately, suppressor variables are automatically excluded by most screening methods in current use, a misguided practice akin to ‘throwing out the baby with the bath water’ according to Magidson and Wassmann (2010). In addition, the effects of suppressor variables are severely understated in Naïve Bayes models. Therefore, for high-dimensional data, CCR models containing 2-4 components typically outperform both Naïve Bayes and traditional models.

CCR methods also include an optional predictor selection procedure, which has been shown to be effective in excluding irrelevant and extraneous predictors, thus improving the predictive performance of the models on new data. For a given K-component model, the unique step-down procedure eliminates the variable that is the least important, where importance is quantified as the absolute value of the variable’s standardized coefficient. CORExpress allows the elimination of more than 1 predictor at a time. By default, 1% of predictors are eliminated at each step, until 100 predictors are left, at which time it returns to the normal step-down of 1 predictor at a time. This feature is useful as a time saving device when you have a very large number of predictors. See Tutorial 3 for an example where the best 10 predictors from over 3,000 candidate predictors are selected in only a few seconds.

You can not only specify the minimum number of predictors to end up with, but also the maximum number of predictors. Suppose you had 10,000 predictors to begin with. If you step down to 1 predictor, but specify a maximum of 20 predictors, you will end up with the best model containing between 1 and 20 predictors. Thus, the summary output at the bottom of the Model Output Window as well as the associated graph will only report results for models containing between 1 and 20 predictors.

While R2 is the primary statistic to assess prediction in the case of linear regression (CCR-Linear), accuracy (ACC) and the Area Under the ROC Curve (AUC) are the corresponding statistics used for regression involving a dichotomous dependent variable, based on either the logistic regression (CCR-Logistic) or linear discriminant analysis (CCR-LDA), which is the primary model used to predict a dichotomous dependent variable. A unique integrated ROC/Scatter plot is available for the latter case which allows easy manipulation of the cutpoint for classifying cases and assessing the performance on training data, and validation data when available.

The use of M-fold cross-validation to determine the optimal value for one or more 'tuning parameters' is standard practice in data mining. It is used for example, to obtain the single tuning parameter 'lambda' in the lasso penalized regression approach. In CORExpress®, we use it to optimize the two tuning parameters, P and K. We do it in an efficient way by doing it for each component separately. In practice, it is very fast, since users will typically evaluate only a small number of models, with the number of components K less than 10 regardless of the number of predictors P.

For cross-validation purposes, users can also specify R repeated rounds of M-folds. A table of predictor variable counts is provided when the step-down algorithm is used with cross-validation (CV) to complement the use of the default CV statistics in the selection of the number of predictors as well as the particular predictors to include in the model. For each round of M-folds, the predictor count table provides the total number of times a predictor was selected for inclusion in the model, with a maximum of M in the case that it was selected regardless of which fold was excluded during the cross-validation process. Irrelevant and extraneous predictors will tend to have a count of 0, or if non-zero, much lower than M in each round.

- Full windows implementation - point and click
- Automatic model scoring of training and validation (holdout) records
- Variables of several scale types can be used:
- CCR-Linear – Continuous dependent variable
- CCR-Logistic – Dichotomous dependent variable
- CCR-LDA – Dichotomous dependent and continuous predictors satisfying assumptions of linear discriminant analysis (LDA)

Big Data/ High Dimensional analysis book featuring PLS-Regression/ Correlated Component Regression (CCR) approaches. Read Dr. Magidson’s keynote CCR Chapter.

## System Requirements

CORExpress® is designed to operate on XP/Vista, Windows 7/8, Windows 10 or Windows 11.

System Requirements: 64MB Drive Space, 512MB of RAM.

Input files: SPSS system files, delimited text files.

Correlated Component Regression (CCR) is a general ensemble approach for the development of a K-component regression model, developed by Dr. Jay Magidson. The 1st component is an average (ensemble) of all 1-predictor direct effects. Thus, the 1-component CCR model is equivalent to Naïve Bayes, and natural generalizations of the Naïve Bayes approach. Naïve Bayes has been shown to outperform traditional regression methods in situations involving high-dimensional data in theory (see: Bickel and Levina, 2004) and in practice with real data (Dudoit et al., 2002).

The 2nd component, correlated with the 1st, improves prediction over the 1-component model, primarily by capturing the effects of suppressor variables, and each additional component improving prediction further. Suppressor variables are typically among the most important predictors and are commonly available in practice. Unfortunately, suppressor variables are automatically excluded by most screening methods in current use, a misguided practice akin to ‘throwing out the baby with the bath water’ according to Magidson and Wassmann (2010). In addition, the effects of suppressor variables are severely understated in Naïve Bayes models. Therefore, for high-dimensional data, CCR models containing 2-4 components typically outperform both Naïve Bayes and traditional models.

CCR methods also include an optional predictor selection procedure, which has been shown to be effective in excluding irrelevant and extraneous predictors, thus improving the predictive performance of the models on new data. For a given K-component model, the unique step-down procedure eliminates the variable that is the least important, where importance is quantified as the absolute value of the variable’s standardized coefficient. CORExpress allows the elimination of more than 1 predictor at a time. By default, 1% of predictors are eliminated at each step, until 100 predictors are left, at which time it returns to the normal step-down of 1 predictor at a time. This feature is useful as a time saving device when you have a very large number of predictors. See Tutorial 3 for an example where the best 10 predictors from over 3,000 candidate predictors are selected in only a few seconds.

You can not only specify the minimum number of predictors to end up with, but also the maximum number of predictors. Suppose you had 10,000 predictors to begin with. If you step down to 1 predictor, but specify a maximum of 20 predictors, you will end up with the best model containing between 1 and 20 predictors. Thus, the summary output at the bottom of the Model Output Window as well as the associated graph will only report results for models containing between 1 and 20 predictors.

While R2 is the primary statistic to assess prediction in the case of linear regression (CCR-Linear), accuracy (ACC) and the Area Under the ROC Curve (AUC) are the corresponding statistics used for regression involving a dichotomous dependent variable, based on either the logistic regression (CCR-Logistic) or linear discriminant analysis (CCR-LDA), which is the primary model used to predict a dichotomous dependent variable. A unique integrated ROC/Scatter plot is available for the latter case which allows easy manipulation of the cutpoint for classifying cases and assessing the performance on training data, and validation data when available.

The use of M-fold cross-validation to determine the optimal value for one or more 'tuning parameters' is standard practice in data mining. It is used for example, to obtain the single tuning parameter 'lambda' in the lasso penalized regression approach. In CORExpress®, we use it to optimize the two tuning parameters, P and K. We do it in an efficient way by doing it for each component separately. In practice, it is very fast, since users will typically evaluate only a small number of models, with the number of components K less than 10 regardless of the number of predictors P.

For cross-validation purposes, users can also specify R repeated rounds of M-folds. A table of predictor variable counts is provided when the step-down algorithm is used with cross-validation (CV) to complement the use of the default CV statistics in the selection of the number of predictors as well as the particular predictors to include in the model. For each round of M-folds, the predictor count table provides the total number of times a predictor was selected for inclusion in the model, with a maximum of M in the case that it was selected regardless of which fold was excluded during the cross-validation process. Irrelevant and extraneous predictors will tend to have a count of 0, or if non-zero, much lower than M in each round.

- Full windows implementation - point and click
- Automatic model scoring of training and validation (holdout) records
- Variables of several scale types can be used:
- CCR-Linear – Continuous dependent variable
- CCR-Logistic – Dichotomous dependent variable
- CCR-LDA – Dichotomous dependent and continuous predictors satisfying assumptions of linear discriminant analysis (LDA)