# In this lab we do more regression. # 1) Regression on real (hail) data: # In practice, "divergence" and "rotate" are measured by Doppler radar, # and so if we can predict hail size from divergence and rotate, then # we can predict hail size from Doppler radar. That's useful! # In regression lingo, size is the response variable, and the others # are predictors or covariates. # Do simple linear regression for predicting size from divergence. # Draw the fit on the scatterplot. dat = read.table("http://www.stat.washington.edu/marzban/390/hail_dat.txt", header=T) plot(dat) # For R geeks: this works because read.table returns a data.frame. # Never mind what a data.frame is! # Note that real scatterplots don't look as pretty as the ideal # ones we draw in class! size=dat[,3] # Give the 3 columns in dat their names, for use below. rotate=dat[,2] diverg=dat[,1] cor(dat) # This shows the correlations between ALL the vars # in the hail data, again, in one sweep! lm.1 = lm(size ~ diverg) plot(diverg,size) abline(lm.1) # Note that it seems like the line's slope # should be larger. The fit is in fact correct. # If you want to know why it looks wrong, either # ask me in class, or google "regression effect." # Use anova to decompose SST into SS_explained and SSE, and compute # R_squared, which is also called the coefficient of determination. # For now, you may ignore the adjusted R-squared. anova(lm.1) # Read-off SSExplained = 603829 and SSE=1616909 summary(lm.1) # Read-off R-squared = 0.27 # Note that the R-squared from summary() agrees with 1-(SSE/SST): # 1- (1616909/(1616909+603829)) = 0.2719047 ################################################# # 2) Now, let's do some multiple regression and model comparison, all in terms # of R-squared. Recall that the closer it is to 1, the better. # The closer to 0, the worse. But also keep overfitting in mind. # USE UP-ARROW lm.a = lm(size ~ diverg) # predicting size from divergence (simple regression) summary(lm.a)$r.squared # 0.2719047 lm.b = lm(size~rotate) # predicting size from rotation (simple regression) summary(lm.b)$r.squared # 0.2900612 lm.c=lm(size~diverg+rotate) # predicting size from both (multiple regression) summary(lm.c)$r.squared # 0.3628943 # multiple regression with interaction. lm.d = lm(size ~ diverg + rotate + I(diverg*rotate)) summary(lm.d)$r.squared # 0.3745302 # multiple quadratic regression lm.e=lm(size~diverg + rotate + I(diverg^2) + I(rotate^2) ) lm.e # Note: there are now *4* coefficients. summary(lm.e)$r.squared # 0.3799713 # multiple quadratic regression with interaction. lm.f=lm(size~diverg+rotate + I(diverg^2) + I(rotate^2) + I(diverg*rotate) ) summary(lm.f)$r.squared # 0.3800302 # Here is a discussion of all of the above results: It *seems* like # - rotate is a better predictor of size than diverg. # - The two of them together make for an even better model. # - Quadratic terms for each, make the model even better, but not by much # (R^2 goes from 0.3628943 to 0.3799713). # - An interaction term, without quadratic terms, gives a model that is # comparable to what we got from a quadratic model with no interation. # - Quadratic and interactions, together, seem to give the best model. # In the scatterplots, note the collinearity in the data. As such, the # regression coefficients are meaningless and unreliable. More, at the # bottom. But the regression models themselves are still OK. # I say "seem," because R^2 alone cannot tell which is the best model # in a predictive sense, because of overfitting. More later. # In fact, note that R^2 never decreases anyway as you add more # terms to the regression model. The main question (which you can address # only qualitatively at this point) is if the gain in R^2 is big enough # to warrant the new term, and taking the chance of overfitting the data. # In this example, the gain from R^2=0.3799 to R^2-0.3800 is probably # NOT worth the risk. So, we should keep the simpler model. That's # called the principle of "Occam's Razor," which posits that one should # go with simpler things! ########################################################################### # 3) Visual assessment of goodness-of-fit: # Even though we cannot yet identify the best predictive model, # for now, let's just say that it is lm.c . Although R-squared # does tell us something about the quality of the fit, there is nothing # that can replace a good figure, e.g., the scatterplot of predicted # versus actual y: # This is the way to get the predicted values, i.e., the y values from # the fitted line. We call this y_hat. y_hat = predict(lm.c) # This is a quick way of getting y_hat. y_hat # i.e., the y values from the fitted line. # i.e., the predicted values. # Alternatively, you can use the predictions stored in lm() itself: # y_hat = lm.c$fitted.values plot(size,y_hat) # actula size vs. predicted size. abline(0,1,col="red") # Just a diagonal line. # If the model were perfect, this scatterplot would be symmetrically # spread about the red line (i.e., with slope 1, intercept 0). Think why?! # But, obviously the model is not perfect. # This scatterplot (of actual response vs. predicted response) is often # the best way of visualizing how well the model is doing. For example, we # see that for smaller hail size (i.e., small x value), the predictions of # size are all above the diagonal. One says, the model overpredicts the size # of small hail, and underpredicts the size of large hail. This kind of # model daignosis can help in coming up with a better model (later). ####################################### # Another visual tool is the residual plot: # Although you can make the residuals by hand, they are already contained # in what lm.1 returns. So, plot( y_hat, lm.c$residuals) # or plot( lm.c$fitted.values, lm.c$residuals) abline(h=0) # In a good fit, these residuals (or errors) should not display any # correlation with the response variable. Don't let the tilted layers # fool your eye into thinking that there is some correlation. There isn't, # and you can confirm that by computing the correlation: cor( y_hat, lm.c$residuals) # USE UP ARROW. ############################################################################## # 4) Finally, here is a way of drawing the fit even if it's nonlinear. # Suppose we pick a relatively simple quadratic model for the Hail data: lm.g = lm(size ~ rotate + I(rotate^2) ) lm.g # Note there are now three coefficients. x = seq( min(rotate), max(rotate), .01) # Make a fake x . y.fit = lm.g$coeff[1] + lm.g$coeff[2] * x + lm.g$coeff[3] * x^2 plot(rotate,size) points(x , y.fit , col="red", type="l") # Alternatively, a fancier way is this (on your own time): # x = matrix(seq(min(rotate),max(rotate),.01),byrow=T) # Make a fake x . # colnames(x) = "rotate" # Necessary for predict(). # plot(rotate,size) # lines(x, predict(lm.g, newdata=data.frame(x) ) , col=2) ################################################################### # You can also plot the *surface* that goes through the cloud of points in 3d. # Suppose we decided that the best model is the most complex model, above: # cut/paste from http://www.stat.washington.edu/marzban/390/lab4_supp.txt lm.e=lm(size~diverg+rotate+ I(diverg^2) + I(rotate^2) + I(diverg*rotate)) x1 = seq(min(rotate),max(rotate), length=100) # This x and y simply define a x2 = seq(min(diverg),max(diverg), length=100) # grid in the x-y plane. f = function(x,y) {r = lm.e$coeff[1] + lm.e$coeff[2]*x + lm.e$coeff[3]*y + lm.e$coeff[4]*x^2 + lm.e$coeff[5]*y^2 + lm.e$coeff[6]*x*y} y.fit = outer(x1, x2, f) # y contains the "height" values of the surface, # over the x-y grid. library(lattice) # For cloud() cloud(y.fit,type="p",screen=list(z=10,x=-70,y=0) ) # This does a 3d plot of # the points of the plane. # or persp(x1, x2, y.fit, theta = 30, phi = 30) # Note that in spite of the nonlinearity of the regression function # itself, i.e. with quadratic and an interaction terms, the surface is # mostly planar in the range of our data (i.e. x1 and x2). ############################################################################# ############################################################################# # Time permitting: # Collinearity. # We are going to show how collinearity affects multiple regression. # To that end, we'll make an R *function*, which is nothing but some lines # of code intended to be used over and over again. # The function (we'll call it make.fit) first makes data on x1, x2, and y, # with some amount of collinearity (i.e., correlation between x1 and x2). # The function, then fits that data, and returns some stats about the # estimated regression coefficients. Cut/paste it from # http://www.stat.washington.edu/marzban/390/lab4_supp.txt ###################################################################### # Here is our make.fit function: Don't forget that you won't know if # each line of the function works properly until you call/use the whole # function. So, type-in such functions in a window that will allow you # to correct it quickly in case you type-in something wrong. make.fit = function(r) # This function takes in r (i.e. cor between x1, x2, { # not between y and anything.) set.seed(123) n=100 # Make 100 cases. x1 = runif(n,0,1) # Make data on predictor 1. xx = runif(n,0,1) # Make temporary predictor, called xx. x2 = xx + (x1-xx)*r # Make data on predictor 2. # Note that predictor x2 goes from xx to x1, as r goes from 0 to 1. # So, when r=0, x1 and x2 are two independent predictors. But when r=1, # then x2 is exactly the same as x1. In this way, r controls collinearity. y = 1 + 2*x1 + 3*x2 + rnorm(n,0,0.1) # Make y data, as usual, by adding noise. dat = data.frame(x1,x2,y) # Here is the whole data. Thanks to plot(dat) # data.frame in R, we can view all scatterplots. lm.1 = lm( y ~ x1 + x2) # Here we fit a plain through the data. summary(lm.1) # The summary() command returns useful info. } #################################################################### # Now, let's see what the regression coefficients look like, for # different amounts of collinearity. make.fit(0) # No collinearity. # First look at the scatterplots. Compare with the figs on page 5 of # lecture 12. This is what we would *ideally* expect, when x1 and x2 are # not correlated. # According to the returned values, the estimated regression coefficients are # 0.98592, 2.00973, 2.99704 - very close to the "real" (or population) # answers: 1, 2, and 3. See the make.fit function, where we make y data. # Also, note "Std Error" for these coefficients. You will learn in Chapter 11, # where these come from. For now, suffice it to say that they tell us how # uncertain the estimates are. The bigger they are, the more uncertain we # are about their values. Here, they are 0.02862, 0.03457, 0.03735 . make.fit(0.5) # Some collinearity. # Now, again, look at the scatterplots, and then examine the coefficients # and their uncertainty. What do you see? And check out the estimated # regression coefficients and their std errors. make.fit(0.9) # Extreme collinearity. # And now? # Note that in reality we don't make the data; all we see are the scatterplots. # As such, if we had looked at these scatterplots and how they become more # and more correlated, we would have felt better about doing regression. # But, as you see, this is all deceiving - as evidenced by the increasing # uncertainty in the regression coefficients. And that increase is all due # to the collinearity between the predictors. In short, ideally, we would want # our predictors to be as uncorrelated as possible.