Statistical Innovations logo






  CORExpress®  go to section and expand
About CCR Modeling
Documentation
Tutorials
Sample Datasets  >
Free demo
Online Course
Purchase


    >   Latent GOLD®  go to section and expand
  LG-Syntax Module  go to section and expand
  Latent GOLD® Choice  go to section and expand
  SI-CHAID®    go to section and expand
  GOLDMineR®  go to section and expand
    >





  CORExpress®: Sample Datasets
Products > CORExpress > Sample Datasets
 

All SI products are designed to operate
      on MS Windows 2000, XP, Vista, and 7

      System Requirements:
      2MB Drive Space, 512MB of RAM

      Input files: .sav and .txt

The following data sets accompany the CORExpress demo version (generally, "C:\Program Files\CORExpress\DemoData"). You can also download all data sets or click on the filenames to download them separately.


autoprice.sav    [ Download data set ]

  • Continuous Dependent Variable
  • 6 predictors
  • 24 cases

Source:
Michel Tenenhaus

Tutorial info:
This data set is used in Tutorial 1: Getting Started with Correlated Component Regression (CCR) in CORExpress.






LMTr&Val.sav    [ Download data set ]

  • Continuous Dependent Variable
  • 56 predictors
    • 14 are valid predictors (true coefficients are non-zero)
    • 42 extraneous predictors (true coefficients equal zero)
      • 14 are correlated with the 14 valid predictors
      • 28 are totally irrelevant predictors, uncorrelated with each of the other 55 predictors
  • 10,000 cases (5,000 training cases with 100 sets of N=50; 5,000 used as test data)

Source:
Data simulated according to the traditional assumptions of ordinary least squares (OLS) regression.

Tutorial info:
This data set is used in Tutorial 2: Linear Regression with High-Dimensional Data.






LDASim.sav   [ Download data set ]

  • Dichotomous Dependent Variable (group 1 vs. group 2)
  • 84 predictors
    • 28 are valid predictors (true coefficients are non-zero, 15 of which are weak)
    • 56 irrelevant predictors (true coefficients equal zero)
      • 28 are correlated with each other
      • 28 are uncorrelated with each of the other 83 predictors
  • 5,000 cases (100 sets of N=50 with equal numbers in each group)

Source:
Data simulated according to the traditional assumptions of linear discriminant analysis (LDA).

Tutorial info:
This data set is used in Tutorial 3: High-Dimensional Data Analysis with a Dichotomous Dependent Variable.






leukemia.Tr&Val.sav   [ Download data set ]

  • Dichotomous Dependent Variable (Two different types of leukemia)
  • 3,571 predictors (measuring expression of 3,571 genes)
  • Total N=62; 38 training cases plus 34 validation (test) cases

Source:
Golub et. al. (1999)

Dudoit, et. al. (2004)






True_driver_data.sav   [ Download data set ]

  • Dependent variable (Rating of overall satisfaction)
  • 34 predictors (Attitude ratings)
  • N=76 respondents

Source:
Logit Research






isis400.demo.sav   [ Download data set ]

  • Dichotomous Dependent Variable
  • 1000 predictors
  • N=400 cases

Source:
Data simulated according to the specifications in Fan et al. (2009).

Fan, J., R. Samworth, and Y. Wu (2009). Ultrahigh Dimensional Feature Selection: Beyond the Linear Model, Journal of Machine Learning Research 10, 2013-2038.






data10samples.sav   [ Download data set ]

  • Simulated data where dependent variable ytrue = exp[(x-0.3)2]- 1
  • Observed dependent variable = ytrue + error, where error is normally distributed with mean 0 and standard deviation = 0.1
  • Predictors are X, X2, X3, X4, X5
  • N=210 cases in 10 sets (set = 1,2,...,10) of N = 21 cases

Source:
Simulated data provided by Tony Babinec, AB Analytics.






OJtutorial2lc.sav   [ Download data set ]

  • Liking ratings on each of 6 different orange juice (OJ) products by 96 judges
  • Each of the 6 juices is also described by 16 physico-chemical attributes
  • Data contains classification information for weighting the judges according to their (posterior membership) probability of being in two different segments which have distinctly different OJ preferences. (click here for details of the random intercept latent class (LC) regression analysis used to obtain these posterior membership probabilities).

Source:
Tenenhaus, et al. (2005): Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005); PLS methodology for studying relationships between hedonic judgments and product characteristics; Food Quality and Preference. 16, 4, pp 315-325.






cookie.sav   [ Download data set ]

  • 4 dependent variables -- calculated percentages of the ingredients fat (Y1), sucrose (Y2), dry flour (Y3), and water (Y4) of biscuit dough pieces
  • 700 predictor variables (wavelengths) were obtained with quantitative NIR spectroscopy from 700 different wave-lengths measured from 1100 to 2498 nanometers (nm) in steps of 2 nm
  • N=40 training samples and N=32 validation samples

Source:
Osbourne, B., T. Fearn, A. Miller, and S. Douglas (1984). Application of near infrared reflectance spectroscopy to compositional analysis of biscuits and biscuit dough. Journal of Science and Food Agriculture, 35:99-105.

Brown, Fearn and Vannucci, JASA (2001). Bayesian wavelet regression on curves with application to a spectroscopic calibration problem. Journal of the American Statistical Association, 96(454):398-408.

E-mail Contact: will@statisticalinnovations.com
Address: Statistical Innovations, 375 Concord Avenue, Belmont, MA 02478-3084
Phone: +1.617.489.4490
Fax: +1.617.489.4499