1. ## Predicting attendance rates

I don't know if we're allowed to plug our own work here (apologies if not...) but:

We often hear ballplayers and front office officials talk about how their fans are the greatest in baseball. But who are the best fans in baseball? How to Reds' fans stack up?

I've just posted the second part of an investigation I've been doing into the factors that affect ballclub attendance on my little blog. The ultimate goal was to generate a model that predicts how much attendance a club should have, and then see which teams do better or worse than predicted. Quick preview of findings: the analysis indicates that Reds fans are quite average in terms of their interest and loyalty, while St. Louis fans are extraordinary and Toronto fans are extremely sparse.

I think it's pretty interesting stuff, so I thought I'd post a link here in case others are curious:
Part 1: http://jinaz-reds.blogspot.com/2006/...rest-pt-1.html
Part 2: http://jinaz-reds.blogspot.com/2006/...rest-pt-2.html

-JinAZ
...if nothing else, it's nice to see that Yankee and Boston fans don't top the charts, much to ESPN's surprise...

3. ## Re: Predicting attendance rates

One thing leaped out at me in Part 1:

As an initial stab, I just divided the city population by two. Unfortunately, this resulted in an overall decrease in the predictive power of the regression model: the R^2 decreased from 0.34 to 0.27. Therefore, this adjustment seems inappropriate and hurts the fit of our attendance model.
It's not an inappropriate adjustment just because it reduced the correlation. The idea is to make reasonable assumptions and then see what the correlation is, not to latch onto some preliminary figure as "correct" and discard any ensuing assumptions that don't fit the conclusion already drawn. I don't think you'd find many people who think it's more accurate to treat two-team markets as one-team markets. If making a reasonable guess for splitting two-team markets means that it's a little less possible to know attendance solely from population, that is what it is.

I think it would also be more accurate to use metro MSA rather than just the city population.

The BP gang worked up a lot of this sort of thing in Baseball Between The Numbers. Good read.

4. ## Re: Predicting attendance rates

Originally Posted by IslandRed
One thing leaped out at me in Part 1:
It's not an inappropriate adjustment just because it reduced the correlation. The idea is to make reasonable assumptions and then see what the correlation is, not to latch onto some preliminary figure as "correct" and discard any ensuing assumptions that don't fit the conclusion already drawn. I don't think you'd find many people who think it's more accurate to treat two-team markets as one-team markets. If making a reasonable guess for splitting two-team markets means that it's a little less possible to know attendance solely from population, that is what it is.
Good points, though I disagree. From my perspective, the goal is to construct a model that is good at predicting attendance rates. If splitting the city population in half, or splitting it by the ratio of attendance results in a poorer fit of the model than just taking city populations as-is, then I'm going to go with the model that does a better job at prediction. Just because it seems reasonable to assume that two ballclubs sharing the same metro area will detract from each others' available market doesn't mean that it is the case.

I think it would also be more accurate to use metro MSA rather than just the city population.
Thanks for that. As I mentioned my blog's comments, I've been looking for such a dataset and wasn't able to track it down. I'll see if I can get a hold of these numbers and if they make a difference.

The BP gang worked up a lot of this sort of thing in Baseball Between The Numbers. Good read.
Thanks also for this, hadn't see it. I'll check it out.
-JinAZ

5. ## Re: Predicting attendance rates

Originally Posted by JinAZ
Good points, though I disagree. From my perspective, the goal is to construct a model that is good at predicting attendance rates. If splitting the city population in half, or splitting it by the ratio of attendance results in a poorer fit of the model than just taking city populations as-is, then I'm going to go with the model that does a better job at prediction. Just because it seems reasonable to assume that two ballclubs sharing the same metro area will detract from each others' available market doesn't mean that it is the case.

Which of these two statements reflects what you're trying to do with respect to the relationship between market size and attendance?

1. Use the best set of real-world data and assumptions you can, and discover the correlation. Call it X.

2. Find the set of assumptions that makes X as high as possible.

If it's #1, then you simply have to adjust for two-team markets. It's just not realistic to treat the Mets as if the Yankees don't exist, or the A's as if the Giants don't exist. That doesn't mean the split has to be even, or there can't be overlap, but some adjustment needs to be made.

If your goal is #2, then that's not just a valid approach, IMHO. It smacks of working backwards from the result, which is not good technique, unless you're being paid by someone who wants the result. If your real goal is to discover the correlation, you shouldn't have a stake in the answer.

Anyway, wasn't the whole point of your article to measure fans? There's no reason to pump up measurables like market size and make them a bigger piece of the puzzle than deserved, because their only purpose in that exercise is to determine what percentage of attendance can be explained by known factors, with the remainder attributed to the fan base. Whether that fan-base variable is 25% or 50% or whatever is immaterial -- the math that compares actual to expected results would be done the same.

6. ## Re: Predicting attendance rates

I'll answer with a pair of questions of my own: is it better to let the data have a say in the assumptions one sticks to, or is it better to stick to one's preconceptions and ignore anything that is not consistent with those preconceptions? I side with the former approach, though not blindly.

I don't disagree for a second that there has to be some effect of two teams sharing a city. If two teams are in the same location, it seems very reasonable to assume that attendance would decrease to a degree vs. a situation in which those teams had the city to themselves. However, there may also be positive effects; having two teams may increase awareness of baseball in a market (competition, etc), and thus could enhance overall interest (and attendance) in both teams. Furthermore, it's also possible that a second team would attract fans who were not fans of the first team, and therefore we might not actually see the attendance reduction we'd expect.

It was not clear at all to me how to adjust for this. My approach was to come up with some alternative approaches and test them to see which best explained how attendance actually varied last year. The process of using R^2 (as well as other metrics like MSE, Mallow's Cp, etc) to select a regression model is a well established procedure in statistics, and is one that I routinely have seen and used in peer reviewed publications (I'm a scientist in my day job).

In this case, I tried three models: a) the simplest approach, which was to not do an adjustment, b) splitting the city population in half, and c) splitting the city population along the ratio of attendance. I could have tried other approaches (say reduce overall city population by 75% or something?), but anything else I could come up with seemed very arbitrary. One could go through and iteratively try a ton of different approaches. However, doing so is bad practice because it enhances the probability that you'll stumble onto something that, just by chance, works for your dataset. Often that approach does not result in models that are good at predicting future datasets. This is not to say that I'm not open to trying other adjustments -- if you have a specific alternative suggestion, please do let me know and I'll consider giving it a go. I just want to have a reasonable justification for thinking that approach might reflect reality.

I fully expected that option c would best match the data, because it even goes to the extreme of forcing the city population data to reflect the attendance data (not something I was comfortable with). However, to my surprise, the data indicated that, of the three approaches I used, not adjusting city population data resulted in the model that best matched what actually happened. I wouldn't argue that this is the best possible approach--I'm sure it can be improved--it's just that it was the best of the three reasonable approaches I tried. In a way, this should not have been that surprising. It's often the case that a simpler model can result in just as good, if not better explanatory power as a more complex one.

Ultimately, my goal in this work was to find a model that best predicted variation in attendance. My current model uses past team performance (wins from '02 through '05) as well as unadjusted city population size, and it explains 60% of variation in '05 attendance rates (a figure I'd like to improve moving forward; one thing I'll definitely try is to use the population figures you recommended in your earlier post).

After constructing that model, I could then turn my attention to looking at the residuals and seeing which teams have greater or lesser attendance rates than would be predicted. Given that the analysis showed the Yankees, Cubs, Dodgers, and Angels as all having above-expected fan interest, and shows the Mets and White Sox as having below-expected fan interest, it seems to me that my little model is working pretty darn well -- better, in fact, that I'd dared to hope it would at the outset.
-j

7. ## Re: Predicting attendance rates

Originally Posted by JinAZ
I'll answer with a pair of questions of my own: is it better to let the data have a say in the assumptions one sticks to, or is it better to stick to one's preconceptions and ignore anything that is not consistent with those preconceptions? I side with the former approach, though not blindly.
Depends on which data you're talking about. It seemed to me -- and I just read what you wrote -- that you saw the initial r-squared of .34 and then decided that any further refinements of the assumptions were "inappropriate" if that number went down. That struck me as losing focus of the goal, which shouldn't have been to maximize the relationship between market size and attendance, but to simply discover what the relationship was. A correlation is meaningless if the data fed into it isn't accurate, and not adjusting for two-team markets seemed like a less than optimal set of data points.

Now, as to how to adjust for two-team markets, yeah, that's tricky. Most attempts of the sort just do a 50-50 split. That's not an unreasonable assumption for New York and Chicago, not so much for L.A. and the Bay area, where one team is notably better situated with respect to the population center. I've seen one study that even accounts for out-of-town population slices when determining "market size."

Who knows, you might tweak for two-team markets and MSAs and other market-size factors and end up right where you started.

The BP article I cited earlier was even more specific, focusing on attendance revenue rather than simply on tickets sold. But for what it's worth, they determined that the following seven factors were primary in determining a team's attendance revenue, in decreasing order of significance:

2. Market size
3. Honeymoon effect
4. Games won in previous season
5. Playoff appearances in past ten years
6. Games won in current season
7. Per-capita income

But even having accounted for all that, there's still room for what you're figuring -- some teams' fans simply show up better than others.

8. ## Re: Predicting attendance rates

This was a helpful post, and it seems to me that we're not terribly far apart in our thinking. Our difference, of course, lies in what approaches we find most appropriate for handling the two market city situations. I'm open--in fact, searching for--alternative approaches to handling this issue, but I'm not yet convinced that not adjusting for two-market cities is a worse approach than splitting them 50/50 or by their attendance ratios. When there are several possible approaches, and none seem particularly better than others, I think it's very appropriate to let the data be one's guide into which path to take. I see your point about the danger of bogus correlations when faulty data are used, but I don't agree that this is necessarily what is going on here. There are so many factors that go into defining what a market is, and I see significant flaws with all three approaches I tried out. I'm sure that the ballclubs themselves have very specific ideas as to what their market size is, and those would probably be the best data for this purpose. Unfortunately, I haven't found a source for these data.

I *have* found the census data you mentioned, and will start fiddling with that dataset at some point this weekend (time permitting). I'm enthusiastic about it--it's not perfect, but it gives a much clearer idea of local population sizes than the city population data I had been using. Two-market areas are even more common this time around (e.g. baltimore/DC, SF/Oakland, etc), which hopefully will help me better determine a way to deal with them. In the end, however, I may not come to an ideal solution; at some point one has to make do with the best they can come up with.

The BP article sounds great, and is something I'm going to have to track down. I'm not surprised to find that others have done the same thing; I like to dork around with numbers, and in all honesty this little project of mine started as I was assembling a fictional league for out of the park baseball (computer game). The revenue figures are definitely better than attendance rates, but I'm limited in what data I've been able to find thus far. Attendance was easy to come by, so I started with that. Maybe I can glean some data from the BP article (though unfortunately I haven't found these sorts of groups to be terribly forthcoming with raw data; damn financial interests). I know Forbes releases financial data each year too...

One thing I'll note that's reassuring (to me at least); of the variables I've looked at thus far, my data are generally agreement with the BP article. Market size seems to be very important (R^2=.27-.34), as do wins the prior year, which explains 40% of attendance variation by itself. Wins in the current year, to my surprise, had a much weaker relationship to attendance.
-j

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•

 Board Moderators may, at their discretion and judgment, delete and/or edit any messages that violate any of the following guidelines: 1. Explicit references to alleged illegal or unlawful acts. 2. Graphic sexual descriptions. 3. Racial or ethnic slurs. 4. Use of edgy language (including masked profanity). 5. Direct personal attacks, flames, fights, trolling, baiting, name-calling, general nuisance, excessive player criticism or anything along those lines. 6. Posting spam. 7. Each person may have only one user account. It is fine to be critical here - that's what this board is for. But let's not beat a subject or a player to death, please. Thank you, and most importantly, enjoy yourselves!

RedsZone.com is a privately owned website and is not affiliated with the Cincinnati Reds or Major League Baseball

Contact us: Boss | GIK | BCubb2003 | dabvu2498 | Gallen5862 | LexRedsFan | Plus Plus | RedlegJake | redsfan1995 | The Operator | Tommyjohn25