Category Archives: Minitab
R-squared Shrinkage and Power and Sample Size Guidelines for Regression Analysis
Using a sample to estimate the properties of an entire population is common practice in statistics. For example, the mean from a random sample estimates that parameter for an entire population. In linear regression analysis, we’re used to the idea that the regression coefficients are estimates of the true parameters. However, it’s easy to forget that R-squared (R^{2}) is also an estimate. Unfortunately, it has a problem that many other estimates don’t have. R-squared is inherently biased!
In this post, I look at how to obtain an unbiased and reasonably precise estimate of the population R-squared. I also present power and sample size guidelines for regression analysis.
R-squared as a Biased Estimate
R-squared measures the strength of the relationship between the predictors and response. The R-squared in your regression output is a biased estimate based on your sample.
- An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a random sample correctly, the sample mean is an unbiased estimate of the population mean.
- A biased estimate is systematically too high or low, and so is the average. It’s like a bathroom scale that always indicates you are heavier than you really are. No one wants that!
R-squared is like the broken bathroom scale: it is deceptively large. Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.
This bias is a reason why some practitioners don’t use R-squared at all—it tends to be wrong.
R-squared Shrinkage
What should we do about this bias? Fortunately, there is a solution and you’re probably already familiar with it: adjusted R-squared. I’ve written about using the adjusted R-squared to compare regression models with a different number of terms. Another use is that it is an unbiased estimator of the population R-squared.
Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce an accurate weight. In statistics this is called shrinkage. (You Seinfeld fans are probably giggling now. Yes, George, we’re talking about shrinkage, but here it’s a good thing!)
We need to shrink the R-squared down so that it is not biased. Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model.
Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.
The graph shows greater shrinkage when you have a smaller sample size per term and lower R-squared values.
Precision of the Adjusted R-squared Estimate
Now that we have an unbiased estimator, let’s take a look at the precision.
Estimates in statistics have both a point estimate and a confidence interval. For example, the sample mean is the point estimate for the population mean. However, the population mean is unlikely to exactly equal the sample mean. A confidence interval provides a range of values that is likely to contain the population mean. Narrower confidence intervals indicate a more precise estimate of the parameter. Larger sample sizes help produce more precise estimates.
All of this is true with the adjusted R-squared as well because it is just another estimate. The adjusted R-squared value is the point estimate, but how precise is it and what’s a good sample size?
Rob Kelly, a senior statistician at Minitab, was asked to study this issue in order to develop power and sample size guidelines for regression in the Assistant menu. He simulated the distribution of adjusted R-squared values around different population values of R-squared for different sample sizes. This histogram shows the distribution of 10,000 simulated adjusted R-squared values for a true population value of 0.6 (rho-sq (adj)) for a simple regression model.
With 15 observations, the adjusted R-squared varies widely around the population value. Increasing the sample size from 15 to 40 greatly reduces the likely magnitude of the difference. With a sample size of 40 observations for a simple regression model, the margin of error for a 90% confidence interval is +/- 20%. For multiple regression models, the sample size guidelines increase as you add terms to the model.
Power and Sample Size Guidelines for Regression Analysis
Satisfying these sample size guidelines helps ensure that you have sufficient power to detect a relationship and provides a reasonably precise estimate of the strength of that relationship. Specifically, if you follow these guidelines:
- The power of the overall F-test ranges from about 0.8 to 0.9 for a moderately weak relationship (0.25). Stronger relationships yield higher power.
- You can be 90% confident that the adjusted R-squared in your output is within +/- 20% of the true population R-squared value. Stronger relationships (~0.9) produce more precise estimates.
Terms |
Total sample size |
1-3 |
40 |
4-6 |
45 |
7-8 |
50 |
9-11 |
55 |
12-14 |
60 |
15-18 |
65 |
19-21 |
70 |
In closing, if you want to estimate the strength of the relationship in the population, assess the adjusted R-squared and consider the precision of the estimate.
Even when you meet the sample size guidelines for regression, the adjusted R-squared is a rough estimate. If the adjusted R^{2} in your output is 60%, you can be 90% confident that the population value is between 40-80%.
If you’re learning about regression, read my regression tutorial! For more histograms and the full guidelines table, see the simple regression white paper and multiple regression white paper.
This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Comparing the College Football Playoff Top 25 and the Preseason AP Poll
The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.
At least, that was the idea.
Earlier this year, I found that the final AP poll was correlated with the preseason AP poll. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.
If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.
Comparing the Ranks
There are currently 17 different teams in the committee’s top 25 that have just one loss. I recorded the order they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.
Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.
Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.
Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with.
Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.
Concordant and Discordant Pairs
We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.
For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.
When we compare every team, we end up with 136 pairs. How many of those are concordant? Our favorite statistical software has the answer:
There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.
That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It’s going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that really better than Kansas State, who lost to #3 Auburn and beat Oklahoma on the road? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.
The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we’ll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you’re not in the SEC, teams who were highly thought of in the preseason will have an edge.
This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.