Napoleon once said, “An army marches on its stomach”. Or was that Frederick the Great? I get my 18th century warmongers mixed up. No matter, his meaning was that the quantity and quality of food for one’s troops is of critical importance when fighting a war.
In the case of the Lone Economist, the quantity and quality of data — preferably free and publicly-available, not every masked vigilante can be as loaded as Bruce Wayne — is the vital fuel for crusading against bias, confounding and all other manner of inappropriate statistical inference. And don’t even get me started about post-randomization, sub-group analyses!
Fortunately, baseball (and for that matter healthcare) generates a huge volume of free data. I have data, courtesy of Retrosheet, covering 173,947 MLB games that date back to 1918. These games include 13,561,443 plate appearances by 13,196 different batters. I even have data on every pitch thrown since 1988, all 21,734,609 of them.
And yet, it isn’t enough.
Every masked vigilante has his or her nemesis and the Lone Economist is no exception. Mine is the Curse of Dimensionality. My fists clench at the mere thought of it. Curse you Curse of Dimensionality!
Can one curse a curse?
What am I saying? I’m a masked vigilante. I can do anything.
This curse refers to a data scarcity problem encountered often in statistical analysis. No matter how much data one possesses, dividing the data into even a small number of dimensions will quickly exhaust the supply.
For example, I have identified only six dimensions of baseball games: outs, three bases, and called balls and strikes. Since each of these dimensions has but a few discrete values, there are just 288 possible “locations” a ballgame can find itself in.
Whenever the count is three balls and less than two strikes and the bases are loaded, the pitcher is at a distinct disadvantage. He can’t afford to throw another pitch outside of the strike zone. The batter knows this and can count on the next pitch being where it can be hit hard. It’s known as a cripple pitch. The worst cripple pitch from the pitcher’s perspective is when there are no strikes and no outs. Put these two situations together and we get the Ultimate Cripple Pitch (UCP).
For the average batter facing the average pitcher, the number of runs scored from that point until the end of the half-inning is 2.88, higher than any other game location. So, if the batter were say Mike Trout, the best hitter in the major leagues today and arguably the sixth best hitter over the last hundred years, the average number of runs scored from that location would be even higher. Right? I imagine this nightmare scenario has caused many pitchers to wake up screaming in the night.
There’s only one problem with this assessment. Mike Trout has never faced a UCP and he probably never will. Out of over 21 million pitches thrown since 1988, only 781 of them have been UCPs. That’s less than four thousandths of one percent.
If I added just one more dimension to my list of six, e.g. the identity of the batter, a huge number of gaps would appear in the data. Are we then to conclude that when, if ever, Mike Trout finds himself in an UCP situation, that it is completely unknown what is likely to happen next? Is he just as likely to strike out as the average batter?
Of course not. Just because it has never happened before, doesn’t mean we know nothing about what is likely to happen. We know how well lesser hitters do in that situation and we know how well Mike Trout does in situations less advantageous to the batter. It doesn’t take a crystal ball to conclude that the average pitcher would be in extremely deep doodoo.
Well, Curse of Dimensionality, the Lone Economist has a silver bullet with your name on it and it’s called “extrapolation”.
Although, if the bullet is called “extrapolation”, shouldn’t that be written on it? These mythology idioms can be so confusing.
I’ll explain how extrapolation works in my next post.