Seminar Details

Seminar Details


May 18

3:30 pm

Regression Modeling and Validation Strategies, with an Interactive Analysis of Titanic Passenger Survival

Frank Harrell, Jr.


University of Virginia, School of Medicine - Department of Health Evaluation Sciences

Multivariate regression models are powerful tools for uncovering strength and shapes of relationships between predictors and response, for making tests of partial association, and for predicting outcomes of future subjects. Successful modeling must address problems such as missing values in the predictors, overfitting (solved by data reduction or shrinkage), relaxing linearity assumptions, fitting nonlinear interactions, and presenting results graphically so that non-statisticians can understand complex models.

There are a number of pre-modeling steps that can help the modeling process including (1) displaying distributions of predictors and response, (2) understanding inter-relationships among predictors, (3) displaying how missing values occur simultaneously among several predictors, and (4) uncovering what kinds of subjects tend to have missing values for some of the predictors. All of these are helpful in deciding whether and how to impute missing values. In addition, nonparametric regression is an excellent exploratory tool.

Once one is ready to model the response, techniques such as regression splines, pooled effect tests (main effects combined with interactions), automatic tests of linerity, shrinkage (penalized maximum likelihood estimation), and various graphical displays of how the predictors affect the response are very useful.

This talk will focus on live demonstrations of various pre-modeling and modeling strategies, using S-PLUS and add-on libraries Hmisc and Design (available from Statlib). Two datasets will be used to demonstrate the methods. One of them contains information on passengers on the Titanic, where we will answer the question "were women and children first?"