PDA

View Full Version : WAR



traderumor
01-03-2011, 11:01 PM
I'm trying to give this concept an ear, but help me to understand how this measure is reliable. First question--is the WAR reported on baseball-reference.com reliable? If so, is it really the position of this measure that Brandon Phillips is at or slightly below replacement level on defense for his career? Is that position supportable?
http://www.baseball-reference.com/players/p/phillbr01.shtml

Redsfan320
01-03-2011, 11:05 PM
The point still holds just as strongly really, but for dWAR, BR uses league average as replacement level, meaning that they're saying he's under 50% in defense, not under ~20%. Still doesn't help though.

320

Caveat Emperor
01-04-2011, 03:17 PM
So, what you're really asking is:

WAR, what is it good for?

traderumor
01-04-2011, 04:39 PM
;) Not quite yet, admittedly have not the depth of knowledge, but trying to get up to speed and starting with what seem to be common sense questions. I'm really not trying to start a WAR :)

Hoosier Red
01-04-2011, 04:49 PM
I'm trying to give this concept an ear, but help me to understand how this measure is reliable. First question--is the WAR reported on baseball-reference.com reliable? If so, is it really the position of this measure that Brandon Phillips is at or slightly below replacement level on defense for his career? Is that position supportable?
http://www.baseball-reference.com/players/p/phillbr01.shtml

Not to speak to the reliability one way or another, but the worst year from a defensive standpoint(2006), Phillips had a lower than league average fielding percentage, and a lower than league average range factor.

Now take away from those stats reliability as you wish.

Tom Servo
01-04-2011, 05:06 PM
So, what you're really asking is:

WAR, what is it good for?
Good God, y'all.

RedsManRick
01-04-2011, 07:19 PM
I would be careful to separate the WAR concept from a given implementation.

The framework for WAR is a standard set of component inputs which add up to a comprehensive total measure of production above replacement. But each implementation of WAR uses different versions of those inputs -- a different baseline for replacement-level, different position adjustments, different data sources for defense or different methods of calculating production (for example, B-Ref uses adjust runs allowed for pitchers, Fangraphs uses FIP)

You ask if B-Ref's WAR is reliable. There's no definite answer to that question. There are a number of ways to define reliable. At heart, what we want to know is how well it reflects "reality". But while there might be a theoretical "real" answer, there's no place to go and find it. All we can do is come up with ways to estimate it (guess, if you prefer). It's not like we can say "B-Ref says 3.5, Fangraphs says 5.0. The right answer is 4.1, therefore B-Ref is more reliable."

So we have to get at reliability other ways. Is the process logical? Is it calculated in a consistent manner? Does it use high-quality data? Does it align with other trusted sources of information? Does it align with other observed data (such as actual team wins if you add it all up)?

For me, the answer is yes for both B-Ref and Fangraphs' implementation of WAR. Sure, they're different in some ways. There are arguments for each of the various decisions made which differentiate the two. But at the end of the day, the reality for me is that WAR is like democracy. It's not a great system, but it just so happens to be better than all the rest.

I would add, some people rely heavily on the sniff test to answer your question. That is, they simply look at the results and see how well they conform to currently held expectations. This is a good first pass. If the model tells you Babe Ruth was a replacement-level player, you KNOW something is wrong. But I think some people take it too far. The sniff test doesn't tell you whether or not something is good/reliable. It simply raises the red flag and suggests you should investigate more closely. It's possible for something to provide a result that is really counterintuitive. That doesn't make it wrong.

I would suggest you go take a look at the process B-Ref uses to measure defense. I think they use Total Zone. Ask yourself if you trust the inputs Total Zone uses and its calculation process. If the answer remains yes, then maybe Phillips hasn't performed as well as we think. If not, well, there you go -- you should be more skeptical about WAR's defense measure.


Has Phillips been an average defender over his career? I don't know. I think he's been above average, but all I have to arrive at the conclusion are my memories of his performance and a vague notion of what good 2B defense looks like. But I also know that we are all subject to some pretty big biases. I know that I haven't watched all the other second baseman play. I know that people tend to remember exceptional, exciting things and give them disproportionate weight in their memories. I know that our pre-existing beliefs have a massive affect on how we perceive events. Our brains simply aren't built to do things like calculating cumulative performance relative to an unseen baseline. [I]It is ENTIRELY possible that Phillips has been just average over his career. But I'm skeptical. I certainly would not dismiss WAR out of hand simply because it suggests something that I have a hard time believing. I would dig deeper and give you kudos for doing so.

Just keep in mind, B-Ref's WAR could be the absolute best system for assessing overall player performance and still be missing the mark with any given player. All models produce outliers. The reliability of the model is about how often it does so. My opinion is that defense metrics are still somewhat unreliable both because the data is soft (input by hand, using crude tools and inconsistently applied definitions). However, I'm not aware of any systematic biases I can adjust for and don't have a better way to do it. So I go with it, keeping in mind that it is simply an education guess. I wouldn't defend what WAR says to the ends of the earth on any given player, but until and unless somebody shows a consistent way of doing it that makes more sense.

dougdirt
01-04-2011, 08:09 PM
I would suggest you go take a look at the process B-Ref uses to measure defense. I think they use Total Zone. Ask yourself if you trust the inputs Total Zone uses and its calculation process. If the answer remains yes, then maybe Phillips hasn't performed as well as we think. If not, well, there you go -- you should be more skeptical about WAR's defense measure.


I am not a fan of the TotalZone system. It has some severe limitations and I feel there are multiple better options out there to measure defensive abilities. It is why I don't pay much attention to WAR at B-Ref.

traderumor
01-04-2011, 08:17 PM
I am not a fan of the TotalZone system. It has some severe limitations and I feel there are multiple better options out there to measure defensive abilities. It is why I don't pay much attention to WAR at B-Ref.Ok, so I was right to assume that not all WAR is created equal. So when you start to see "player X is a 3 WAR player," how do I know if they are using a generally accepted WAR rating?

Hoosier Red
01-04-2011, 08:42 PM
Ok, so I was right to assume that not all WAR is created equal. So when you start to see "player X is a 3 WAR player," how do I know if they are using a generally accepted WAR rating?

You mean the GAWR?:)

That's a good point, I had been using the B-ref WAR Ratings, but perhaps it's worth further investigation?

dougdirt
01-04-2011, 09:59 PM
Ok, so I was right to assume that not all WAR is created equal. So when you start to see "player X is a 3 WAR player," how do I know if they are using a generally accepted WAR rating?

Well, for future reference, if I use WAR, I am quoting Fangraphs :thumbup:

Like all statistics, there is always some leeway that needs to be accounted for.

jojo
01-05-2011, 09:26 PM
When using WAR, you have to be careful to make comparisons using the same version because they handle estimates of offense and defense differently. For instance BR.com uses total zone data for defense and thus can calculate WAR across eras while fan graphs uses UZR base upon play by play data and can only go back as far as 2002.

So which version you use depends somewhat on the question you're asking. If you want to compare Phillips to Morgan, you would use BR for both players and accept the deficiencies in the total zone approach in order to compare apples to apples. If you wanted to compare Phillips to Utley, you'd probably want to use fan graphs because UZR is a better way to estimate defense.

I'm pretty comfortable with the notion that Phillips has been a significantly above average defender.

RedsManRick
01-06-2011, 01:37 PM
Ok, so I was right to assume that not all WAR is created equal. So when you start to see "player X is a 3 WAR player," how do I know if they are using a generally accepted WAR rating?

As far as I'm aware, the only two widely used WAR models are those at B-Ref and at Fangraphs. Like Doug, I prefer the Fangraphs implementation.

It would be a good practice for all of us to use fWAR (Fangraphs) and rWAR (Baseball-Reference) as appropriate instead of the generic WAR.

I would STRONGLY urge anybody who has issues with WAR not to throw the baby out with the bathwater. Both implmentations have their weaknesses, but I have little doubt that WAR in general is the best framework for comprehensive player valuation that we have currently. There are and will continue to be cases in which the system doesn't work as well as we would like it to, but the core tenet of the sabermetric approach is objectivity. Using a consistent approach at least takes personal bias out of the equation.

Feel free to use the basic WAR framework to create tWAR, in which you use a different set of inputs which you feel more comfortable with. Alternately, become comfortable with the idea that the numbers are just best guesses and include an unstated range of uncertainty. I like to assume a 0.5 WAR confidence internal on either side of the listed WAR figure. Occasionally, I forget this and argue with a bit too much implied certainty using WAR. But I stick to my belief that while it's not perfect, it's still the best we've got right now.