When Samples Become Reliable
by Eric Seidman - May 22, 2009 · Filed under Research
One of the most difficult tasks a responsible baseball analyst must take on involves avoiding small samples of data to make definitive claims about a player. If Victor Martinez goes 4-10, it does not automatically make him a .400 hitter. We have enough information about Martinez from previous seasons to know that his actual abilities fall well short of that mark. Not everything, however, should merit a house call from the small sample size police because there are some stats that stabilize more quickly than others. Additionally, a lot of the small sample size criticisms stem from the actual usage of the information, not the information itself. If Pat Burrell struggled mightily after the all star break last season and started this season with similarly poor numbers, we can infer that his skills may be eroding. Isolating these two stretches can prove to be inaccurate, but taking them together offers some valuable information.
The question asked most often with regards to small sample sizes is essentially - When are the samples not small anymore? As in, at what juncture does the data become meaningful? Martinez at 4-10 is meaningless. Martinez at 66-165, like he is right now, tells us much, much more, but still is not enough playing time. What are the benchmarks for plate appearances where certain statistics become reliable? Before giving the actual numbers, let me point out that the results are from this article from a friend of mine, Pizza Cutter over at Statistically Speaking. Warning: that article is very research-heavy so you must put on your 3D-Nerd Goggles before journeying into the land of reliability and validity. Also, Cutter mentioned that he would be able to answer any methodological questions here, so ask away. Half of my statistics background is from school or independent study and the other half is from Pizza Cutter, so do not be shy.
Cutter basically searched for the point at which split-half reliability tests produced a 0.70 correlation or higher. A split-half reliability test involves finding the correlations between partitions of one dataset. For instance, taking all of Burrell’s evenly numbered plate appearances and separating them from the odd ones, and then running correlations on both. When both are very similar, the data becomes more reliable. Though a 1.0 correlation indicated a perfect relationship, 0.70 is usually the ultimate benchmark in statistical studies, especially relative to baseball, when DIPS theory was derived from correlations of lesser strength. Without further delay, here are the results of his article as far as when certain statistics stabilize for individual hitters:
50 PA: Swing %
100 PA: Contact Rate
150 PA: Strikeout Rate, Line Drive Rate, Pitches/PA
200 PA: Walk Rate, Groundball Rate, GB/FB
250 PA: Flyball Rate
300 PA: Home Run Rate, HR/FB
500 PA: OBP, SLG, OPS, 1B Rate, Popup Rate
550 PA: ISO
Cutter went to 650 PA as his max, meaning that the exclusion of statistics like BA, BABIP, WPA, and context-neutral WPA indicates that they did not stabilize. So, here you go, I hope this assuages certain small sample misconceptions and provides some insight into when we can discuss a certain metric from a skills standpoint. There are certain red flags with an analysis like this, primarily that playing time is not assigned randomly and by using 650 PA, a chance exists that a selection bias may shine through in that the players given this many plate appearances are the more consistent players. Cutter avoids the brunt of this by comparing players to themselves. Even so, these benchmarks are tremendous estimates at the very least.