Eric Rasmusen's Weblog: Case Control Studies and Repeated Sampling

A standard counterintuitive result in statistics is that if the true model is logit, then it is okay to use a sample selected on the Y's, which is what the "case-control method" amounts to. You may select 1000 observations with Y=1 and 1000 observations with Y=0 and do estimation of the effects of every variable but the constant in the usual way, without any sort of weighting. This was shown in Prentice & Pyke (1979). They also purport to show that the standard errors may be computed in the usual way--- that is, using the curvature (2nd derivative) of the likelihood function. (Click here for more)

This, I was skeptical of. If the constant is misestimated, how can you deduce the variance of the disturbance term, and if you can't deduce that, how can you deduce the standard error of any of the coefficients? Nowhere have I seen a clear demonstration or an intuition for the result, so I thought there might be a crucial unnoticed mistake in the math somewhere, as is not unknown in famous papers (e. g. Hotelling on location, Tullock on overdissipation, Viner on average cost curves, and Rothschild-Stiglitz on risk).

Since I did not follow all the steps of the Prentice-Pyke proof and so did not know of any error in what they did, I tried doing a Monte Carlo study which seemed to confirm my intuition.

Since then, however, I have seen where my Monte Carlo study went wrong, and now I believe Prentice and Pyke. Some details are instructive.

1. An intuition-- a bit shaky, I think, but better than nothing (let me know if it's false). Suppose that a coefficient is estimated correctly by some estimator. We want to estimate the estimator's standard error, to know how variable the estimate would be if we repeated the estimation with different disturbances. For this, we need to know how noisy the data is. We do not need to know how noisy the data in the whole population is, however, just how noisy in the kind of sample we draw. If our procedure is to draw a biased sample, then we need to know what will happen in other biased samples, not in the population. It is okay to use the sample for this purpose. In using a standard error, we are not generalizing anything to the population (not estimating goodness of fit, for example), we are just generalizing to repeated samples.

2. How to think about repeated sampling and how to do a Monte Carlow study. What I did was to construct a population of 60,000 data points, drawing X from a uniform distribution on [0,1] and a disturbance epsilon from a logit density with an α "constant" coefficient of -4 and a β X coefficient of 0. If α + epsilon < 0 then Y=0; if α + epsilon >= 0 then Y = 1. That yields 1,039 points with Y= 1, about 1.7% of them.

Our estimation procedure is to combine two random samples of 1,000 observations with Y=0 and 1,000 observations with Y=1 and do a logit estimate of alpha and beta. We would expect the estimate of alpha to be wrong-- not close to 0.017-- and the estimate of beta to be right-- close to 0.000-- since we have a large enough sample that consistent estimates ought to be close to the true parameters.

The maximum likelihood estimate would give us standard errors based on the second derivative of the likelihood function or on bootstrapping. In repeated sampling, we would expect the standard deviation of the alpha estimates not to be close to the average of the estimates of its standard error. The question to be investigated is whether the the standard deviation of the beta estimates is close to the average of the estimates of its standard error.

So far, so good. Where I made my mistake, I think, is in the definition of "repeated sampling". Ordinarily in frequentist thinking, in repeated sampling we keep the X values the same in each sample, and we draw new disturbances, which combine with the fixed X's to give new Y's. That also amounts to conditioning on the X's, though we wouldn't have had to condition the X's, since our estimator should work fine even if we changed the X's in each sample too. (If we did change the X's, though, that change the information content in each sample--- a sample in which X only varied between .3 and .4 would have less information and yield worse estimates than one with X varying widely between .02 and .94. So in small samples, especially, we'd have to make some allowance for that.)

Here, though, we can't keep the X's fixed. If we did, then although our first sample would have 1,000 observations with Y=1, our succeeding samples would have about 34. We wouldn't be using the case-control method.

So what we have to do is to think about repeated samples with 1,000 Y=0 observations and 1,000 Y=1's. Turning our usual thinking upside down, we need to keep the Y's fixed, draw new disturbances, and let the X's vary. This is especially hard to think about here, because knowing Y and epsilon does not tell us X-- remember, Y is coarse and contains less information than alpha + beta*X + epsilon,and beta is zero here too, making things even worse.

The best way to proceed is to think about repeating the entire scientific procedure, including the sampling as well as the estimation. The way I did this was to take 100 n=2000 samples from the 60,000-point population, each time combining equal-sized subsamples with Y=0 and with Y=1.

Recall, however, that there are only 1,037 Y=1 values in the entire population. Thus, my repeated sampling had to be with replacement, and was using the same Y=1 observations over and over. It is OK to use the same X values repeatedly, but these observations also had the same epsilon values each time, so the samples are not independent in the way needed for the law of large numbers to work. The standard errors computed by maximum likelihood came out wrong--- not equal to the standard deviation of the estimates, but that is to be expected when the draws are not independent.

Realizing this, I also tried doing the procedure with 100 n=200 samples instead of 100 n=2000 samples. I still used sampling with replacement, but now there was less overlap between replacements, less dependence between samples. And now the estimated standard errors were close to the standard deviations.

This, I expect is what would happen if I did the kind of repeated sampling that is our thought experiment for the kind of real studies that use the case-control method. That thought experiment is to take repeated draws of 60,000-point populations, with the same X's each time but with different epsilons and hence Y's. Each of the 100 Monte Carlo samples would be from a different population draw.

Labels: case-control method, frequentist, math, statistics

Eric Rasmusen's Weblog

Monday, October 8, 2007

Case Control Studies and Repeated Sampling

About Me

Previous Posts

Selected Posts of Special Interest >

Selected Archive Topics >