In my last post, I stated that extrapolation is how we fill in gaps in the data created by the Curse of Dimensionality. It is a blending of observation and theory, a tradeoff between accuracy and simplicity. As much art as it is science, its partial reliance on subjective judgement can make it susceptible to manipulation. That is why Mark Twain once said, “There are three kinds of lies: lies, damned lies and statistics.” Four, if you count economic impact studies.
It is, therefore, with great care and transparency that one should approach extrapolation. Economists, in particular, are guilty of creating arcane statistical models that no one outside our profession can understand, much less believe.
This is the reason why medical journals tend to have a low tolerance for studies that rely on complex statistical modeling. The more complex the model, the more difficult it is to be sure one is drawing the correct inference. Unlike economics, medical research is an experimental science. A well-designed experiment need not rely on a confusing array of equations.
For statistical analyses, the medical profession has, in effect, adopted the KISS principle: keep it simple, stupid. In keeping with this principle, the Lone Economist designed the 6-D baseball statistics using simple ratios, like runs per plate appearance, and a relatively simple rule for extrapolation.
I divided the observed data into six dimensions that results in 288 separate game locations. But every baseball fan knows there are many other bits of information by which to predict outcomes. There is the identity of the batter, the pitcher, and the stadium. There are even individual characteristics of the player that are useful predictors, such as handedness (i.e. which side of the plate the batter swings his bat) and age. Some players might be better at night, some during daylight. The list is quite lengthy.
The Curse of Dimensionality prevents the data from being subdivided into all possible combinations of predictors, so extrapolation must be used for the non-dimensional factors. I call them effect modifiers.
The objective of 6-D Baseball is to predict what happens next during a live game. I want to make it easy for the casual spectator to know when a team is most likely to score without resorting to a hand calculator. The first step is to look at the separate effect of the current batter.
In an earlier post, I introduced the Individual Run Production (IRP) statistic. This statistic provides a value by which we can assess the scoring potential during a plate appearance for each of the 288 game positions. Although it can be used to compare batters across time, its main purpose is to measure the likelihood of scoring in a live game.
Every batter has two objectives. The first is to drive in runs during his plate appearance. The second is to setup the situation for the next batter. Consequently, the individual run production statistic (IRP) is the sum of two components: the runs scored during the plate appearance (RBI) and the value of the change in the game location.
The average or expected number of runs scored until the end of the half-inning is a function of the IRPs of the current and subsequent batters. In equation form it looks like this:
I’ll explain what these equations mean, one by one.
L represents “location” and refers to the count of balls and strikes, the disposition of each base and the number of outs. Specific values are six digits long in the following order: outs, 3rd, 2nd, 1st, Balls, and Strikes. For example, at the start of each half-inning, there are no outs, the bases are empty and the count of balls and strikes are 0 and 0. Therefore the value of L would be 000000. If there were two outs, a man on 1st and the count of balls and strikes was three and two, the value of L would be 200132.
Hb(L) is the average number of runs scored by the end of the half-inning when the game location is L. He(L) is the average number of runs scored after the current plate appearance.
Equations 1 and 2 say that Hb(L) is simply the average number of runs scored during the current PA, R(L), plus He(L), all the runs scored after the current PA.
Equation 3 says the average number of runs when the current batter is j and location is L is in part Hb(L) times the ratio of the average runs scored during the current PA regardless of the batter, R(L), to Hb(L). This ratio is denoted s(L) and is divided by the overall average value of s regardless of L, A(s). Since there are 288 values of L, there are 288 values of s(L). They range from 116.3% for L = 211130 to 3.1% for L = 000002. The overall average value of s(L), A(s), is 35.8%.
The last component of equation 3, sj, is specific to batter j. It is the batter’s average ratio of runs scored during the PA to Hb(L), the first part of the IRP statistic.
Equation 4 determines the average number of runs scored after the PA, the situation the current batter sets up for the next batter. d(L) is the average ratio of He(L) to Hb(L) at location L. A(d) is the average value of d(L) over all values of L and dj is the average ratio for player j. d(L) ranges from 160.2% for L = 200030 to 12.5% for L = 211102. The value of A(d) is 64.9%.
Equations 5 – 8 show that the expected number of runs scored is a function of the current location, L, and the components of the IRP’s for the current and subsequent batters.
Next time I’ll provide some illustrative examples using the IRP’s of active batters.