# In this lab, we will look at three of the distributions we've been talking # about in class; binomial, poisson, normal. You'll also learn how to # draw boxplots to summarize histograms; so from now on you don't have to # those by hand! Finally, time permitting you'll learn how to make qqplots # for visually testing whether some data could have come from a Normal dist. ########################################################################## # 1) Binomial: The mass function itself. # First let's compute the binomial proportions in lecture 4 (example 1.22). # In R's convention, putting a "d" before the name of a distribution # returns the value of the distribution itself. E.g., dbinom(0, 100, 0.005) # The format is dbinom(x, n, pi), where in the lecture's notation, # x = number of heads out of n tosses of a coin, and pi= prob of head. # R allows you to run dbinom() for *multiple* values of x, in one sweep: dbinom( 0:3, 100, 0.005 ) sum( dbinom( 0:3, 100, 0.005) ) # sum of the above probs. # Compare the above with what we got in lecture 4. # Replacing the "sum" with "plot" will plot the numbers: plot( dbinom( 0:3, 100, 0.005) ) # Note no need to specify x explicitly. # So, now we can also plot the mass function for different values # of n and pi. Note the n and pi values that produce normal-looking # distributions, and those that produce poisson-looking distributions. par(mfrow=c(3,4)) plot(dbinom(0:20,5,0.01),type="b") #n=5, pi=0.01 No need to specify x plot(dbinom(0:20,5,0.1),type="b") #n=5, pi=0.1 USE UP-ARROW plot(dbinom(0:20,5,0.5),type="b") #n=5, pi=0.5 USE UP-ARROW plot(dbinom(0:20,5,0.9),type="b") #n=5, pi=0.9 Etc. plot(dbinom(0:20,10,0.01),type="b") #n=10, pi=0.01 USE UP-ARROW plot(dbinom(0:20,10,0.1),type="b") #n=10, pi=0.1 plot(dbinom(0:20,10,0.5),type="b") #n=10, pi=0.5 plot(dbinom(0:20,10,0.9),type="b") #n=10, pi=0.9 plot(dbinom(0:20,20,0.01),type="b") #n=20, pi=0.01 plot(dbinom(0:20,20,0.1),type="b") #n=20, pi=0.1 plot(dbinom(0:20,20,0.5),type="b") #n=20, pi=0.5 plot(dbinom(0:20,20,0.9),type="b") #n=20, pi=0.9 # Can you identify the poisson-looking one? # It's the one with the large n and small pi, as it should be. Recall that's # how we derived the poisson from binomial in class. # Which one is the one that looks normal? # It's the one with large n but mid-range pi values. # For those values of n and pi, you can actually approximate the binomial # with the normal, so again there are no factorials to worry about. # In short, we can approximate the binomial distribution with possion # or normal, depending on the the values of n and pi. # If you've forgotten what poisson looks like, you can plot it too: # Let's do the one we did in lecture 4's poisson example: plot ( dpois(0:9, 4) , type="b" ) # "b" stands for "both" points and lines. # If you want to overlay another curve on top of this, you can use # the lines() function, following a plot() function. For example, here are # two poisson distributions, with lambda = 4 and 5: plot ( dpois(0:9, 4) , type="b" ) # USE UP-ARROW lines( dpois(0:9, 5), col=2 ) # lines() overlays on previous graph(). # Finally, dnorm() does the same thing as above, but for the Normal dist. ######################################################### # 2) Simulation/generation. # In R, you can actually generate (or simulate) a bunch of numbers that # follow the binomial distribution. In other words, you can simulate the tossing # of a coin, without actually tossing coins. For example, this is how you # generate 200 numbers each of which is the number of heads out of n=10 tosses. rbinom(200,10,0.5) # format = rbinom(number of times, n, pi) # Effectively, you just tossed 10 coins, 200 times, each time noting the number # of heads out of 10. This way, you can do a lot of experiments on the computer, # without actually doing the experiment! If the coin is not fair, then just # change the parameter pi . # Putting an "r" before the name is R's way of generating the numbers. # For example, rpois(100,4) # generates 100 numbers from the poisson distribution. So, each of these # 100 numbers could be the "number of people arriving at a teller, per hour", # if the average number of people arriving per hour is 4. # Similarly, the following draws a single sample of size 10000 from a # normal distribution with mu=0, sigma=1. x = rnorm(10000, 0, 1) # I'm assigning these numbers to x, for use below. x # shows you all 10000 values. hist(x,breaks=200) # Check the histogram. Pretty Normal, like it should be! ############################################################################# # 3) Now, boxplots of data as a means of summarizing a histogram into five # numbers that capture the shape of the histogram. # First, do a boxplot of the sample we drew from the normal population, above. boxplot(x) # If you get some circles at the end of boxplot, # don't worry; they are outliers, according to some criterion. # You are not required to know how they are defined. boxplot(x,range=0) # Turns them off. # Now, recall the bimodal histogram we saw last time in hist_dat.txt. # It was bimodal because I actually joined two separate data files, each # one with 100 cases in it. We can separate the two and boxplot them, separately: dat = read.table("http://www.stat.washington.edu/marzban/390/hist_dat.txt", header=F) x = dat[,1] # Here is all of x. x1 = x[1:100] # Put the 1st 100 cases of x in x1, x2 = x[101:200] # and the remainder in x2. par(mfrow=c(1,2)) hist(x,breaks=20) # Now look at the histogram of the whole x, boxplot(x1,x2) # and the two boxplots of x1 and x2, separately. # In class we discussed how to interpret the relative position of two boxplots, # emphasizing that it's complicated business. Here, imagine x1 and x2 as being # the height of girls and boys, or some measure of hair length for Rogaine # users (right), and non-users (left). Now interpret! Keep in mind that # everything you say pertains to the sample, NOT the population! But these # boxplots do highligh the fact that the x in each of the two groups has # a spread; it's NOT just a single mean, or median, etc. # BTW, one can also see the relative position of the x1 and x2 *histograms* # by plotting both on the same plot. Unfortunately, the only way # I know how to do that in R is not very elegant, and so, I've put # it in the "time permitting" section, below. Don't worry about the R details! # Just use it when you need it. ################################################### # Here is another example to illustrate the complexity of the issue. # It's data I collected in stat/math 390 in a past quarter. The variable is # the "percentage of time student attends lectures", and the two groups are # boys and girls. dat <- read.table("http://www.stat.washington.edu/marzban/390/attend_dat.txt", header=T) x=dat$attendance x y=dat$Gender y # 0 = girl, 1 = boy (I think!) par(mfrow=c(2,2)) hist(x[y==0]) # A way of selecting cases in x that have some value of y. hist(x[y==1]) boxplot( x[y==0] , x[y==1] ) # Is there a difference between boys and girls with respect to their # attendance? If we are asking about boys and girls *in the population*, # probably not, because there is too much overlap between the boxplots; # more on that in chapters 7 and 8. But even if we are talking about boys # and girls in this particular *sample*, the answer is still complex. # For example, look at the two sample means: mean(x[y==0]) # sample mean attendance for girls mean(x[y==1]) # and for boys # The y=0 group (girls) has the higher sample mean than the y=1 group (boys); # (87.6 vs. 86.4). But the medians are reversed (92.5 vs. 95): median(x[y==0]) # sample median attendance for girls median(x[y==1]) # and for boys. # One source of the difficulty in comparing the two groups is that # sample mean and medians capture only the "center" or the "location" # of the distributions. You MUST open your eyes and see (pay attention to) # the *spread* in the data as well. Measures of location (e.g., mean, median) # do not tell the whole story. One also needs to know something about the # spread of the data. # One measure of spread is the sample standard deviation (or variance): sd(x[y==1]) # sample std. dev. of attendance for boys sd(x[y==0]) # and for girls. # In short, even if there is no simple answer (like "no, you cannot tell # if girls attended more"), make sure you know how to interpret the two # boxplots in relation to one another. # For example, even though both boys and girls have a maximum attendance of # 100 (percent), girls reach a lower minimum than boys. Or, although girls have # a lower median than boys, their first quartile is actually higher than the # first quartile of the boys. THINK about what that means! In short, the # answer to the question is NOT simple, because of the *spread* in the # two groups. Later, we will learn how to phrase a question that actually # CAN be answered unambiguously. One question is ``Does this data suggest # that the *population* from which this sample was drawn has a higher mean # for girls than for boys?" The answer will come in Chapters 7 and 8 . ############################################################################# # 4) Finally, let's get a feeling of what a Normal qqplot looks like for # different kinds of distributions. Recall, that if the data come from a # Normal distribution, then it's qqplot should look line a straight line, # at least in the bulk of the plot; the tails usually deviate from a straight # line, because there are usually few cases there anyway. A qqplot # is a *visual* method for checking whether your data are normally distributed. # Also, if linear, then the intercept and slope of the line can be used as # estimates of the mu and sigma of the normal distribution. # The answers you get on your screen may look different from what your TA # shows on screen, but that's because of the different random samples. x = rnorm(500,0,1) # Sample of size 500 from a normal dist with mu=0, sigma=1. hist(x) qqnorm(x) # Pretty straight! x = rexp(500,1) # Sample of size 500 from an exponential dist with lambda=1. hist(x) # Use UP-ARROW qqnorm(x) # Not straight at all. ##################################################################### ##################################################################### # Time permitting. # Note that sample standard deviation, as computed with sd(), above, # agrees with the defining formula we saw in class. It's a terse formula, # but try to take it apart and convince yourself: sd(x) sqrt( sum( ( x - mean(x) )^2 ) / (length(x) -1) ) ############################################## # This is how you overlay two histograms (not elegantly): dat = read.table("http://www.stat.washington.edu/marzban/390/hist_dat.txt", header=F) x = dat[,1] # Here is all of x. x1 = x[1:100] # Put the 1st 100 cases of x in x1, x2 = x[101:200] # and the remainder in x2. a = hist(x1,plot=F) b = hist(x2,plot=F) x.lim = range(c(a$mids,b$mids)) plot(a$mids,a$counts,type="h",xlim=x.lim) lines(b$mids+0.1,b$counts,type="h",col="red") # Note the shift of 0.1 # to void overlapping hists. ##############################################