The following data sets accompany the CORExpress demo version (generally, "C:\Program Files\CORExpress\DemoData"). You can also download all data sets or click on the filenames to download them separately.
autoprice.sav [ Download data set ]
- Continuous Dependent Variable
- 6 predictors
- 24 cases
Source:
Michel Tenenhaus
Tutorial info:
This data set is used in Tutorial 1: Getting Started with Correlated Component Regression (CCR) in CORExpress.
LMTr&Val.sav [ Download data set ]
- Continuous Dependent Variable
- 56 predictors
- 14 are valid predictors (true coefficients are non-zero)
- 42 extraneous predictors (true coefficients equal zero)
- 14 are correlated with the 14 valid predictors
- 28 are totally irrelevant predictors, uncorrelated with each of the other 55 predictors
- 10,000 cases (5,000 training cases with 100 sets of N=50; 5,000 used as test data)
Source:
Data simulated according to the traditional assumptions of ordinary least squares (OLS) regression.
Tutorial info:
This data set is used in Tutorial 2: Linear Regression with High-Dimensional Data.
LDASim.sav [ Download data set ]
- Dichotomous Dependent Variable (group 1 vs. group 2)
- 84 predictors
- 28 are valid predictors (true coefficients are non-zero, 15 of which are weak)
- 56 irrelevant predictors (true coefficients equal zero)
- 28 are correlated with each other
- 28 are uncorrelated with each of the other 83 predictors
- 5,000 cases (100 sets of N=50 with equal numbers in each group)
Source:
Data simulated according to the traditional assumptions of linear discriminant analysis (LDA).
Tutorial info:
This data set is used in Tutorial 3: High-Dimensional Data Analysis with a Dichotomous Dependent Variable.
leukemia.Tr&Val.sav [ Download data set ]
- Dichotomous Dependent Variable (Two different types of leukemia)
- 3,571 predictors (measuring expression of 3,571 genes)
- Total N=72; 38 training cases plus 34 validation (test) cases
Source:
Golub et. al. (1999)
Dudoit, et. al. (2004)
True_driver_data.sav [ Download data set ]
- Dependent variable (Rating of overall satisfaction)
- 34 predictors (Attitude ratings)
- N=76 respondents
Source:
Logit Research
isis400.demo.sav [ Download data set ]
- Dichotomous Dependent Variable
- 1000 predictors
- N=400 cases
Source:
Data simulated according to the specifications in Fan et al. (2009).
Fan, J., R. Samworth, and Y. Wu (2009). Ultrahigh Dimensional Feature Selection: Beyond the Linear Model, Journal of Machine Learning Research 10, 2013-2038.
data10samples.sav [ Download data set ]
- Simulated data where dependent variable ytrue = exp[(x-0.3)2]- 1
- Observed dependent variable = ytrue + error, where error is normally distributed with mean 0 and standard deviation = 0.1
- Predictors are X, X2, X3, X4, X5
- N=210 cases in 10 sets (set = 1,2,...,10) of N = 21 cases
Source:
Simulated data provided by Tony Babinec, AB Analytics.
OJtutorial2lc.sav [ Download data set ]
- Liking ratings on each of 6 different orange juice (OJ) products by 96 judges
- Each of the 6 juices is also described by 16 physico-chemical attributes
- Data contains classification information for weighting the judges according to their (posterior membership) probability of being in two different segments which have distinctly different OJ preferences. (click here for details of the random intercept latent class (LC) regression analysis used to obtain these posterior membership probabilities).
Source:
Tenenhaus, et al. (2005): Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005); PLS methodology for studying relationships between hedonic judgments and product characteristics; Food Quality and Preference. 16, 4, pp 315-325.
cookie.sav [ Download data set ]
- 4 dependent variables -- calculated percentages of the ingredients fat (Y1), sucrose (Y2), dry flour (Y3), and water (Y4) of biscuit dough pieces
- 700 predictor variables (wavelengths) were obtained with quantitative NIR spectroscopy from 700 different wave-lengths measured from 1100 to 2498 nanometers (nm) in steps of 2 nm
- N=40 training samples and N=32 validation samples
Source:
Osbourne, B., T. Fearn, A. Miller, and S. Douglas (1984). Application of near infrared reflectance spectroscopy to compositional analysis of biscuits and biscuit dough. Journal of Science and Food Agriculture, 35:99-105.
Brown, Fearn and Vannucci, JASA (2001). Bayesian wavelet regression on curves with application to a spectroscopic calibration problem. Journal of the American Statistical Association, 96(454):398-408.