CCR Online Course
SI Online Courses
Regression Modeling and Classification with Many Correlated Predictors:
Key Driver Regression and Other Applications Using CCR Methods
with exercises based on demo versions of CORExpress and XLSTAT-CCR
Course Overview: Recent advances in analysis of high dimensional data now allow reliable regression models to be developed even when the number of predictors exceeds the number of cases! In this course we begin by reviewing problems and limitations with traditional linear and logistic regression. Our applications-oriented presentation provides insight into how the new approaches work through examples and by providing an overview of the relevant theory, supplemented by the supporting equations. We use real and simulated data sets to illustrate the different approaches.
Dates: April 5, 2013 – May 3, 2013
Please note: There is no "live" element to the course (required login times). You can access course materials and discussion board at your own pace based upon your schedule!
Register: Please go to our Course Registration Form to register for the course.
Course Homepage: The course website will contain links to all the assignments and discussions, and it will contain all the materials for each Session.
Note: Participants need not license a copy of the CORExpress or XLSTAT-CCR programs. All participants will have free access to the demo versions of these programs, which allows unrestricted analyses of all course datasets.
Who should sign up for this course: Marketing researchers, biomedical researchers, survey analysts, and anyone who wants to learn the latest tools to develop reliable regression models given the challenges of many correlated predictors that approach or exceed the sample size N (high dimensional data). Applications include predictive models based on gene expression, key driver regression, and more.
Prerequisite: Participants should have taken at least two courses in statistics, and be familiar with the use of linear and logistic regression.
Course Structure: The course takes place online at statisticalinnovations.com. Course participants will be given a username and password for access to a private bulletin board that serves as a forum for discussion and interaction with the instructor. The course is divided into four weekly sessions. Attendees typically spend about 5-15 hours on each session. At the beginning of each week, participants receive the relevant material, in addition to answers to exercises from the previous session. All course materials are posted to a dedicated course homepage, which can be accessed via the same username and password. During each session, participants review the course materials and work through exercises using the CORExpress program. The instructor will provide answers to the exercises and to posted questions, but participants may also engage in discussions with other course participants.
Course Material: No text required -- The material for this course will appear in Dr. Magidson's forthcoming book to be published by Chapman & Hall/CRC. Through your comments and questions, you have the unique opportunity to contribute to this book. Copies of published articles, forthcoming book excerpts, and other material will be made available. All participants will have free access to the demo version of CORExpress, which allows unrestricted analyses of all course datasets.
||Instructor: Dr. Jay Magidson, founder and president of Statistical Innovations Inc.. Dr. Magidson's clients have included A.C. Nielsen Co., Household Finance Corp., Blue Cross Blue Shield Association, and Pfizer.
He taught statistics at Tufts and Boston University and is widely published on the theory and applications of multivariate statistical methods. Dr. Magidson designed SPSS CHAID, SI-CHAID®, GOLDMineR® and CORExpress®, is the co-developer (with Jeroen K. Vermunt) of the Latent GOLD® and Latent GOLD® Choice programs, and is co-developer (with Thierry Fahmy) of the XLSTAT-CCR module.
- $495 (commercial)
- $295 (academic)
- multiple attendees – after the first attendee registers at the full price, additional attendees from the same organization receive a 50% discount.
(Further discounts are available for 4 or more registrants from the same organization – contact us)
- Includes all course materials and access to the CORExpress and XLSTAT-CCR demo program.
Session 1: Regression Analysis Basics
Session 2: Regularization: Penalized Regression and Dimension Reduction Approaches
- Primary types of regression – linear, logistic, linear discriminant analysis (LDA)
- Prediction vs. classification
- Assessing model performance
- R2 for linear regression
- Classification Table and ROC Curve for dichotomous dependent variable
- Accuracy and AUC
- Examples with simulated data
- Linear regression
- Logistic regression / LDA
- Assessing/Validating model performance with and without cut-points
- Holdout samples and M-fold cross-validation
- Generalizability and R2
- Repeated rounds of M-folds
- Graphical displays
- Raw vs. standardized coefficients and measures of predictor importance
- Problems with stepwise regression methods
Session 3: Comparison of Variable Selection/Reduction Approaches
- Background/Introduction to Problems with High Dimensional Data
- Need for Regularization
- Naïve Bayes as Extreme form of regularization
- Naïve Bayes outperforms LDA and Logistic regression
- Naïve Bayes as starting point for Correlated Component Regression (CCR)
Tutorial with dichotomous dependent variable and P = 3,571 predictors
- Sparseness and Regularization
- Stepwise Regression
- Penalized regression approaches – Ridge Regression, Lasso, Elastic Net
- Principle Components (PCR) and Partial Least Squares Regression (PLS-R)
- Correlated Component Regression (CCR)
- CCR step-down algorithm
- M-fold cross-validation with CCR step-down
- Right and wrong way to do cross-validation
- Improvements over Penalized Regression, PCR and PLS-R
- Relationship between CCR, Naïve Bayes and traditional regression
- Saturated CCR is equivalent to traditional regression
- K = 1-component CCR is equivalent to Naïve Bayes
- Example: Key Driver Regression
- CCR Variants
- Graphical displays: boxplots and coefficient trace plots
- Examples with real and simulated data
- Example with Near Infrared (NIR) Data
Session 4: Issues and Extensions
- Importance of suppressor variables
- Using M-fold cross-validation for model tuning
- M-fold cross-validation with CCR step-down
- Examples with simulated data with many true predictors
- Logistic Regression and LDA
- Use of interactive ROC/Scatter plot
- Failure of common prescreening methods to capture suppressor variables
- Failure of Naive Bayes to capture suppressor variables
- Coefficient path plots
- Guidelines to avoid over-fitting
- Extended CCR models
- Discrete dependent variable with more than 2 categories
- CCR-Survival/Event history model
- Hybrid CCR/Latent Class models
- Example: Key Driver Regression on orange juice ratings data