Small Sample Size [Archive] - RedsZone.com - Cincinnati Reds Fans' Home for Baseball Discussion

EddieMilner

04-23-2007, 02:52 PM

I hear small sample size quite a bit on this website. We are now over 10% into the season. When is it no longer a small sample size? In every situation I've ever seen (outside of baseball), a 10% sample is pretty adequate.

Doro

04-23-2007, 02:59 PM

one tenth of anything is not adequate

EddieMilner

04-23-2007, 03:10 PM

one tenth of anything is not adequate

For a sample 10% is overly thorough. At least in manufacturing instances.

joshnky

04-23-2007, 03:16 PM

For a sample 10% is overly thorough. At least in manufacturing instances.

10% is fine in manufacturing but not in cases with a great deal of variation such as a baseball season. This year that 10% came in under abnormally cold conditions which favors the pitchers. Give it time for the climate to even things out and for slow starters to get going and then look at your sample. I would compare this to purchasing a new machine and sampling the first 100 pieces and expecting the next 900 to be the same. You wouldn't do that in that situation because their is a learning curve for the operators and usually a break-in period for the machine. I think it works similarly in baseball.

kaldaniels

04-23-2007, 03:25 PM

For a sample 10% is overly thorough. At least in manufacturing instances.

Ask Chris Shelton if 10% is thorough enough.

durl

04-23-2007, 03:27 PM

Sometimes 10% is adequate, but that depends on what you're researching. And a sample size in a poll is quite different than the first 10% of a baseball season.

I believe that a period 3 weeks is a fairly short amount of time to determine what the season will be like. We can look at the Astros over the past 2 years and see that how you start doesn't determine how you finish.

George Anderson

04-23-2007, 03:28 PM

one tenth of anything is not adequate

David Ross and his .100 batting average is prove positive!!!

Dunner44

04-23-2007, 03:58 PM

If you sampled 10% of the games from the season randomly (so 16ish game) you could probably get a semi-acurate prediction of what a players season looked like. But what you have here is the first ten percent of the season, which is intrinsically liked with the situations that occured over the first 16ish games. So cold weather, a hitting slump because of a minor injury, etc.

rotnoid

04-23-2007, 04:17 PM

If you sampled 10% of the games from the season randomly (so 16ish game) you could probably get a semi-acurate prediction of what a players season looked like. But what you have here is the first ten percent of the season, which is intrinsically liked with the situations that occured over the first 16ish games. So cold weather, a hitting slump because of a minor injury, etc.

Exactly, the way these stats are being used isn't a sample, it's an extrapolation. A sample implies a piece of the whole. The whole is unknown as yet. There are a myriad of things that will happen over a season to affect the whole, in order to get a picture of them, one would take a random sample, not 20 games in a row.

kaldaniels

04-23-2007, 04:28 PM

If you sampled 10% of the games from the season randomly (so 16ish game) you could probably get a semi-acurate prediction of what a players season looked like. But what you have here is the first ten percent of the season, which is intrinsically liked with the situations that occured over the first 16ish games. So cold weather, a hitting slump because of a minor injury, etc.

NEVERMIND THIS POST...I see what you meant. I disagree...when you look at 16 game stretchs you don't get an accurate sample. Or even a semi accurate sample...real quick here is Dunn's BA for each month last year....(a much larger size than your 16 games) .265 .212 .221 .354 .188 .157. Most of those numbers are not even close to his .234 season BA. Face it, baseball is a game of hot and cold streaks, go to espn.com and look at players stats for last year. Most have dramatic swings from month to month.

remdog

04-23-2007, 04:28 PM

Good explaination, Rotnoid.

Rem

texasdave

04-23-2007, 04:50 PM

Because I have no life I looked up a random number generator online and had it randomly pick out 16 games for AD during the 2006 season. I compiled the numbers and multiplied by 10. It seems as if this RNG is not a big AD fan. :)

ave - .183
slg - .350
obp - .258
ops - .608

ab - 600
hits - 110
runs - 90
hr - 20
rbi - 70
k - 300
bb - 60

This was only a test using a random sampling to see how close it would get.
I just wanted to see what it would look like. Infer nothing.

Dunner44

04-23-2007, 05:06 PM

Because I have no life I looked up a random number generator online and had it randomly pick out 16 games for AD during the 2006 season. I compiled the numbers and multiplied by 10. It seems as if this RNG is not a big AD fan. :)

ave - .183
slg - .350
obp - .258
ops - .608

ab - 600
hits - 110
runs - 90
hr - 20
rbi - 70
k - 300
bb - 60

This was only a test using a random sampling to see how close it would get.
I just wanted to see what it would look like. Infer nothing.

Nice work super sleuth. I guess this RNG picked some games dunner was not doing so hot in... for the record, I don't think 10% is that great a sample in baseball even if you have the whole season (like what you did) not bashing your work, just disagreeing with the original post that 10% is enough to draw conclusions, especially over a 162 game season.

oneupper

04-23-2007, 05:15 PM

You can go further with this and assume that there is a basic "ability" level which determines how well a batter will do (say...hit .300).

All other variations, such as weather, opposing pitchers, situation, small injuries are just randomness of the sampling process.

You can then calculate the probability of said hitter attaning a certain performance (use BA...its easier) based on sample size.

Some examples:

Prob. of a .300 hitter underperforming:

.200 or less after 50 AB: 7.9%
.140 or less after 50 AB: 0.7%
.200 or less after 100 AB: 1.7%
.140 or less after 100 AB: 0.02%

So, if you believe your guy is .300 "talent" and he slumps badly over a couple of weeks...it can be randomness. If the slump is really bad...it could be a "career" slump.

If the "career" slump continues for another two weeks...most likely your guy isn't .300 talent anymore.

remdog

04-23-2007, 05:28 PM

You can go further with this and assume that there is a basic "ability" level which determines how well a batter will do (say...hit .300).

All other variations, such as weather, opposing pitchers, situation, small injuries are just randomness of the sampling process.

You can then calculate the probability of said hitter attaning a certain performance (use BA...its easier) based on sample size.

Some examples:

Prob. of a .300 hitter underperforming:

.200 or less after 50 AB: 7.9%
.140 or less after 50 AB: 0.7%
.200 or less after 100 AB: 1.7%
.140 or less after 100 AB: 0.02%

So, if you believe your guy is .300 "talent" and he slumps badly over a couple of weeks...it can be randomness. If the slump is really bad...it could be a "career" slump.

If the "career" slump continues for another two weeks...most likely your guy isn't .300 talent anymore.

So, are you saying that David Ross may never get a hit again in his career? :p:

Rem

oneupper

04-23-2007, 05:36 PM

So, are you saying that David Ross may never get a hit again in his career? :p:

Rem

Ross is 4 for 38...

Doing it backwards:

Prob .300 = 0.32% (most likely he's not a .300 hitter)
Prob .250 = 1.63% (he could be a .250 hitter in a BAD slump)
Prob .220 = 3.71% (this is a slump even for a BAD hitter)
Prob .200 = 6.0% (see above).

The fact that all those probabilities are still within the realm of "not so rare" ...PROBABLY indicates that the sample size is still small.

Ltlabner

04-24-2007, 09:04 AM

For a sample 10% is overly thorough. At least in manufacturing instances.

In manufacturing the amount of variation is so small (for example, a few thousandths of an inch on a machined part) that 10% works just fine for random sampling.

The variation of a ballpayers peformance over the course of a season is so drastic that 10% is useless is determining anything other than what took place during that tiny window of time.

rotnoid

04-24-2007, 11:08 AM

In manufacturing the amount of variation is so small (for example, a few thousandths of an inch on a machined part) that 10% works just fine for random sampling.

The variation of a ballpayers peformance over the course of a season is so drastic that 10% is useless is determining anything other than what took place during that tiny window of time.

Right. In the type of samples used for quality assurance there are only two options, right or not right. In baseball statistics, there are no other variables such as weather, opponent, altitude, or whatever. There are simply too many variables across a season to get by with only using 10%. And as I said earlier, you can't sample from the front, it only works from the whole.