I've been increasingly frustrated by metrics used to measure starting pitcher quality. Let's look at the 3 Reds starters in 2006 who started 25 games or more:
Harang: 35 GS
Arroyo: 34 GS
Milton: 27 GS
ERA is surely decent, but it ignores the very real fact that innings and runs are accrued in bunches (35 games) and that their true value is contained within that given start. A pitcher could give you 34 CGS and 1 outing with 153 ER and 0 outs. His ERA is 4.50 and, had we not known that distribution, we would've thunk him a league average starter, despite having the best season ever. Is he really worse than the guy who gives you 35 9 IP, 4 ER starts? Of course not. So ERA could be better...
So how do measure the individual games? Well, the easiest metric is W/L, but let's not go down that road. We want to measure pitchers only, not their offensive support. So how about QS? Well, it's better, but quite flawed as there are "better" starts which don't get the QS tag. Eric Milton had a higher percentage of quality starts than Aaron Harang last year, which seems quite counter-intuitive. Aaron Harang was penalized for going longer outings and giving up an extra run. So my first thought is, rather than simply throwing out QS, let's adjust it. The idea remains the same. Namely, how often did the pitcher give his team a decent chance to win?
The 3 Reds starters made a combined 8 starts of less than 5 IP and each time allowed at least as many runs as IP. So we can confidently say that there is no such thing (for the 2006 Reds) as a "good" sub 5 IP start. So let's put the first threshold there. You gotta throw 5 IP. Next, rather than look at raw ER, let's account for the per inning run allowance so that 3 more innings and 1 more run is a GOOD thing. So, rather than looking for 3 or fewer ER, we're looking for a "Start ERA" or SERA of 4.50 or lower. Let's see how that changes things (AQS: Adjusted Quality Start):
Arroyo: QS 22/34 (65%, AQS 22/34 (65%)
Milton: QS 15/27 (56%), AQS 17/29 (59%)
Harang: QS 17/35 (49%), AQS 19/35 (54%)
Hmm... Harang still isn't looking so good. So why is his ERA so low compared to Milton? Was he really a better pitcher? And herein is the crux of the problem. While Milton had more "quality" starts, Milton toed the line. While Harang had some real dominant starts, dragging down his ERA, Milton just made it over the ASQ threshold. Ok, well, does that matter or not?
I say it does. If your team scores 4.5 runs per game, a AQS basically means you gave your team a 50% chance to win. But if you go 8 IP and 2 ER, you've done much more than that. Likewise, a 7 IP 4 ER start isn't "quality", but it gives your team a much better shot than does a 5 IP, 6 ER start. So I want to account for that, while maintaining the "by game" measurement style. Instead of a simply binomial on the 50% line, let's put starts in 5 groups, sticking the with basic AQS methodology (all the sub 5 IP starts are all very poor SERA's, so we'll toss out that requirement). The idea is to capture what kind of opportunity that start afford the Reds to win the game. The pitcher can't control runs scored by the Reds, so we have to level the playing field and assume an average of 4.50 runs scored by the Reds. There's also a correlation of -.6 between ER and IP, saying basically that the fewer runs you allow the more innings you pitch. So we can roughly say that lower SERAs in general have longer innings attached to them, capturing that ideal of longer outings as well.
(I think we often inappropriately weight IP in our off the cuff measurements of start quality. Given a replacement level bullpen with a 5.50 ERA, a 5 IP, 1 ER outing IS better than a 7 IP, 3 ER outing. Is your bullpen going to give up 2 runs in 2 innings? Most of the time, they won't. It's why I get bugged when starters who are great for 80 pitches, like Javier Vazquez, routinely get forced to throw those last 30 pitches where they get rocked. If insanity is doing the same thing but expecting a different outcome, we've got a lot of insane managers out there. Give me a Kirk Sarloos or Matt Belisle in the pen and let him throw 2-3 innings twice a week when the starter is losing it early -- but get them in there BEFORE the walk-single-3 run homer)
It's important to remember that while the average of SERA accross a season (ERA) has a pretty narrow distribution (I'd estimate 90% of SP between 3.50 and 5.50), the SERA of a given game is a fairly wide spread. That is, even a guy with a 3 ERA will give up 0 runs sometimes and 7 runs other times and rarely gives up exactly 3.
SERA <2.00 = Excellent chance to win
SERA >2.00 & <3.50 = Good chance to win
SERA >3.50 & <5.00 = Mediocre chance to win
SERA >5.00 & <6.50 = Poor chance to win
SERA >6.50 = Terrible chance to win
Now we're looking at a distribution of starts. The chart/table below will be the percentage of starts in each group: %ES/%GS/%MS/%PS/%TS
That should tell you all you need to know. Milton, Arroyo, and Harang were all Poor or Terrible at about the same frequency (35-40% of their starts). However, when they were mediocre or better, they were much different pitchers. Arroyo was Good or Excellent in 56% of his starts, compared to 43% for Harang, and just 34% for Milton.
When discussing the performance of a pitcher over time, it's important to consider the distribution of the runs he's allowed, not just the average rate at which he allows them. As bad as Eric Milton is, he doesn't "lose" the Reds any more games than Harang or Arroyo. To me, this should be the #1 take away from this whole post. Read that last sentence again. Milton doesn't lose more games. However, he doesn't win very many games either. What made Arroyo so good last year is that he "won" over 50% of his starts. That's why he was our ace.
As a follow up, I considered looking at Bill James' Game Scores. They consider IP a little more, and also consider Hits, BBs, Ks, and unearned Runs, getting a little more closely at the true effectiveness of the pitcher. You could group Game Scores as I've done with SERA and see similar distributions.
Lastly, I want to make the point about the use of averages. Averages are somewhat useful in comparing the overall effectiveness of something over large samples. However, we need to be clear that averages also represent a middle point that isn't itself a reality. Aaron Harang didn't have 35 9 inning starts in each of which he gave up 3.78 runs. In fact, less than 20% of his starts were even within .5 ER of that. In baseball, games are what counts. Every game results in a win or a loss and after that game, it doesn't matter if you lost by 10 or by 2. If you shut 'em out next time, you're gonna be 1-1, regardless of that ERA. We should be careful to account for the way in which stats are accrued as we value aggregated performances.