# 1) Chapter 4. # It is important to go through this lab slowly. So, please listen to # your TA's explanations, and follow them step-by-step. I have placed a # good portion of the following on the web so that you don't have to # worry too much about typing. But make sure you understand what # you are cutting and pasting too. # Here, I will try to convince you that if a population is indeed stratified, # then the formulas of Neyman allocation (derived in class) will produce # an estimate of the population mean which is more precise than the estimate # one would get from taking a simple random sample (srs). # To that end, I will create a population which is indeed stratified - it has # two distinct strata in it - and then, the question is if the stratified # estimate of the population mean (according to Neyman's formulas) is more # precise than the srs estimate. # The IMPORTANT thing to realize is the following: a precise estimate of a # population mean means that the sample means have a small variance, where the # variance is computed by taking *multiple samples* from the population, # computing the sample mean for each sample, and then computing the variance # of the means (i.e., variance *between* group) ! # Using the symbols from the lecture notes (and the book). As I said, # understanding the symbols here is more than half the solution. ###################################################################### # First, set-up the parameters of the strata in the population. N1 = 1000 # Size of stratum 1 in population. mu1 = 0 # Mean of stratum 1 in population. sigma1 = 0.9 # Std dev of stratum 1 in population. N2 = 2000 # And for stratum 2. mu2 = 3 sigma2 = 1.2 N = N1 + N2 # Population size N # Note that the strata are different in terms of size, mean, and std dev. B = 0.03 # Specify the bound. This is the "3% margin of error", on CNN. ###################################################################### # Cut/paste from http://www.stat.washington.edu/marzban/390/lab5_supp.txt # In class we saw that Neyman provides us with formulas for the optimal # sample size, and their allocation to the various strata. # The following contains Neyman's formulas. Ignore the details, for now. n = ((N1*sigma1 + N2*sigma2)^2) / ( (N*B/1.96)^2 + ( N1*sigma1^2 + N2*sigma2^2) ) n = round(n) # Just round it to an integer. n # This is the minimum sample size required # to have a margin of error of 3% (i.e., B). # The following is the *allocation* of n to the 2 strata, according to Neyman: n1 = (n*N1*sigma1)/( (N1*sigma1 + N2*sigma2 ) ) n2 = (n*N2*sigma2)/( (N1*sigma1 + N2*sigma2 ) ) n1 = round(n1) n2 = round(n2) n1 n2 # Naively, one may have thought that n1 should be half of n2, because # N1 is half of N2. But that is not the case. Why? Under what condition # would it be? ###################################################################### # Cut/paste from http://www.stat.washington.edu/marzban/390/lab5_supp.txt # So far, we have no data at all; we have been only setting the # parameters of the problem. Now, let's make the population data. # Remember: What we are doing here is a *simulation*, which means # that we create a population whose mean we know! Then we try to # see if the theory (i.e., Neyman's formulas) give us the answer # we expect. set.seed(123) # so that we all get same answers. stratum1 = rnorm(N1,mu1,sigma1) # Populate stratum 1. stratum2 = rnorm(N2,mu2,sigma2) # and stratum 2, population = c(stratum1,stratum2) # and combine the two. pop.mean = mean(population) # Here is the true pop mean pop.mean # which we want to estimate. hist(stratum1,breaks=100) # Just look at the histograms. hist(stratum2,breaks=100) hist(population,breaks=100) # Note the two humps/strata. # So, like I always say, histogram! In this case, you would know that the # pop is stratified, i.e. has clusters/strata in it. So, you've got to # do either cluster or stratified sampling. Here, we'll do the latter. ##################################################################### # Cut/paste from http://www.stat.washington.edu/marzban/390/lab5_supp.txt # Let's start! First, let's take 1000 simple random samples (SRS) from the # population, and see how much their sample means vary: xbar.srs = numeric(1000) # Space for sample means. for(i in 1:1000){ sample1 = sample(population, n, replace=F) # SRS is w/o replacement. xbar.srs[i] = mean(sample1) } hist(xbar.srs,breaks=100) # Hist of the sample *means*. mean(xbar.srs) # This is the mean of *sample means* from SRS; # Note that it's close to pop.mean, as expected. sd(xbar.srs) # This is the *std dev* of the *sample means* from SRS, # Make a note of it : 0.02556108 ###################################################################### # USE UP-ARROW, and edit the previous block to get the following: # Now, let's do the same thing, but with stratified sampling. # I.e., take srs samples from each stratum, according to the sample size # (n1 and n2) that Neyman tells us: xbar.strat = numeric(1000) # Space for stratfied mean. for(i in 1:1000){ # For a 1000 samples, take a sample1 = sample(stratum1, n1, replace = F) # srs from each stratum, sample2 = sample(stratum2, n2, replace = F) # according to Neyman. xbar1 = mean(sample1) # Sample mean xbar2 = mean(sample2) # of each stratum. s1 = sd(sample1) # Sample std dev s2 = sd(sample2) # of each stratum. xbar.strat[i] = xbar1*(N1/N) + xbar2*(N2/N) # Weighted mean. # Here is the formula for s_stratified; but Do NOT type it in. # s.strat = sqrt( (N1^2)*(s1^2)*(N1-n1)/(n1*(N1-1) ) + (N2^2)*(s2^2)*(N2-n2)/(n2*(N2-1) ) )/N } # End of loop over 1000 samples. ###################################################################### # We will discuss the above formula for xbar.strat, later, in class. # It's just the weighted mean of the sample means. # According to Neyman, this estimate is the most precise estimate. # Neyman also gives us a *theoretical* formula s.strat, the standard # deviation of the stratified mean. This quantity is often called the # standard error. We don't really need it here for this experiment # we're performing, but it's computed anyway, in case you # want to check to see if it agrees with the *empirical* std deviation # of the xbar.strat, which we compute below. # So, now, let's just look at the 1000 stratified means we've got: hist(xbar.strat,breaks=200) # Here is the mean of the stratified means: # Note that this is also close to pop.mean. mean(xbar.strat) # But, more importantly, this is the std dev of the stratified means: sd(xbar.strat) # 0.01559863. # Which mean has the smaller std dev - the one from SRS or from stratified? # Again, recall what we just did: We made a population which we intentionally # stratified, by mixing together two normal data with different mu and sigma # parameters. Then, we took 1000 different simple random samples of size n, # and computed the mean of each sample. That's 1000 different sample means. # Then we computed the std dev of these 1000 means. The resulting value is a # measure of how precise we estimated the population mean. # Finally, we repeated the above exercise, but with the samples taken according # to Neyman's formulas. We computed the stratified mean of each sample, computed # the std dev of the 1000 stratified means, and noted that the std dev of *these* # means is smaller than the std dev of the means of the SRS samples. The # conclusion is that when the population is stratified, Neyman's formulas # for n and the allocation of that n to the various strata (i.e., n1 and n2), # lead to a more precise estimate of the population mean.