Big Data is cited by many prognosticators as a major growth area in computer science over the next decade. While definitions of Big Data abound, the basic idea is that data is being collected as such a rate and with such volume now that traditional ways of saving and analyzing it no longer suffice.
(Neal Ford once told me about a bioengineering project that collects roughly 5 exabytes of data every few days, and as scientists they never want to throw anything away. Their simplest solution for storing it all may be to build a dedicated hard drive factory next door.)
The real issue, though, is not how much data you have. It’s what you do with it. How do you understand your situation when the data is voluminous, but may contain errors, omissions, and other problems?
This is not a new problem, and fortunately we have a great field of examples to use as a model. If you want to understand how to get insight from large amounts of data, look at baseball.
Even a brief summary of the history of sabermetrics (a term coined by Bill James himself, based on SABR — the Society for American Baseball Research) would go well beyond the scope of a simple blog post. If you’ve seen the Moneyball movie or read the book, you’ve got the general idea. Suffice it to say that sabermetrics is an attempt to understand how to win baseball games by focusing on the data, rather than on conventional wisdom.
Some of the interesting observations made by the stats community include ideas that seem blindingly obvious but were ignored for years:
1. Outs are everything, so don’t give them away unnecessarily. Baseball doesn’t have a time limit, and you can’t stall or play a prevent defense. You keep playing until 27 outs are recorded.
2. That means if you steal at less than about a 70% success rate, don’t run. It doesn’t matter how many bases you steal if you’re giving up too many by getting caught.
3. Similarly, bunting is done way, way too often. Moving a runner over from first to second (or sometimes from second to third) isn’t worth giving away an out. Or, as the stat guys say, if you play for one run, that’s likely all you’ll get, and most of the time not even that.
4. Pitcher wins are highly dependent on elements outside the pitcher’s control, like runs scored by his offense or errors made by the defense. Wins are a lousy measure of pitcher skill. Frankly, ERA (earned run average) isn’t that much better, because there are many subjective judgements in there, too (try looking at the difference between earned and unearned runs sometime and the arbitrariness of it all will make your head spin).
5. RBIs are so dependent on where you are in the batting order that they’re a lousy measure of hitting skill. If you are a major leaguer and you bat fourth, fifth, or sixth, you’re going to get a lot of RBIs no matter what you do. That’s why OBP (on base percentage, which measures how often you avoid giving up an out) is far better correlated with wins.
Incidentally, that was one of themes of the Moneyball movie. The idea was to get the players to get on base as much as possible, placing a higher than normal value on walks. Since OBP was much more important to wins than RBIs, the Oakland A’s could replace stars that left by filling in their on-base contributions rather than worrying about runs batted in.
As it turned out, they placed too little emphasis on defense. In the movie they moved Scott Hatteberg to first base where he had no defensive skills at all just to keep his bat in the lineup. Nowadays they would care more about the defense they gave up.
Actually, the real theme of Moneyball was to identify players that were undervalued based on traditional metrics and use them to build a successful team cheaply. At the time, OBP was a market inefficiency.
Rather than go on, I want to mention that there’s a reason I’m writing this post today. Tonight is the first game of the World Series between the Detroit Tigers and the San Francisco Giants, and from a sabermetric point of view it has all the elements of a potential classic.
I want to mention a couple baseball writers and what they’ve said about the upcoming series. First, Jonah Keri (author of The Extra 2%) wrote a great preview article today at Grantland discussing the Giants. Rany Jazayerli (a medical doctor cursed by following the Kansas City Royals) wrote a similar article on the Tigers. Both are well worth reading.
My favorite writer of all, however, is Joe Sheehan. He’s brilliant, and controversial, and fascinating to follow on Twitter. He writes a newsletter that I’m very happy to subscribe to, and his World Series preview came out today. I hope he won’t mind if I quote from it a bit, just to demonstrate the difference between mindlessly quoting statistics and drawing true insight from them.
The Tigers beat you by striking you out and not letting you exploit their poor defense.
He builds up to this by showing how the strike-out rate of the Detroit pitching staff is second in the league, and that they exceeded the league record in total strikeouts (coming in second to the Rays, actually). He also points out that their defense is so bad (they are the worst defense ever to reach the World Series) that the high strike out numbers are partly padded by having to face more batters than they should have had to, given the bad defense.
The Giants beat you by putting the ball in play and making you chase it.
The Giants were last in MLB in home runs. There were second in batting average and fourth on OBP. They have a park that suppresses homers and they play accordingly. On the other hand, they have a very low strike out rate. By the stat called equivalent average, they were the third best offense in baseball despite never hitting home runs.
As it turned out, the Tigers — after winning a weak division — caught two postseason opponents ill-equipped to take advantage of their poor defense.
The A’s struck out a lot and the Yankees were almost entirely driven by homers. The Giants will be completely opposite.
There’s plenty more, but this is the flavor of the observations. Note that none of the articles I mentioned are full of statistics. There are few choice stats presented with the goal of making a clear argument. That’s insight.
By the way, most organizations in baseball know this. The conflict between the scouts and the stats guys demonstrated in Moneyball is largely over (with the great exception being the Kansas City Royals, who not coincidentally keep coming in last). There’s one major group who still doesn’t “get it”, though, and that’s the media.
I cannot fathom why the networks continually put “analysts” in the broadcast booth who are completely unaware of the last twenty years of baseball research. In fact, they often disdain anything learned from studying the game as tricks with statistics. The greatest irony (as mentioned by Joe Sheehan many times) is that no stats guy is anywhere near as wedded to a particular metric as the so-called “traditionalists” care about RBIs or pitcher wins. If you watch the game broadcasts, you’ll see meaningless statistic after meaningless statistic based on small sample size (batter A is 2 for 7 against pitcher B) paraded out as though it meant anything, completely ignoring what’s now viewed as truly important.
For example, you’re guaranteed tonight to hear tons about the Giants’ “momentum” (which demonstrably doesn’t exist) or the effects of the Tigers’ long layoff since winning the ALCS, neither of which matter at all. Instead, the real story is whether the Giants can put enough balls in play to take advantage of the Tigers’ poor defense, or whether the Tigers’ great strike-out pitchers can keep them from doing it. It’s also whether the Tigers’ line up, which is dominated by a few stars, will be able to beat the Giants’ pitching, which has relatively weak starters but lots of bullpen depth, and the excellent San Francisco defense.
I have no vested interest in either team. Frankly, if it wasn’t for what I learn by reading those writers and others like them, I probably wouldn’t care at all. Now I’m excited to watch the interplay of two diametrically contrasting styles. Even better, I’m looking forward to the snarky comments by the great writers I follow on Twitter during the games (nobody does snark like a good baseball writer).
Do I have a prediction? Please. I’m just happy the Yankees got swept. That almost, but not quite, made up for the disaster that was the Red Sox season. Still, one of the Red Sox owners claims part of the team’s problem was that they didn’t listen to Bill James enough, so that gives me hope for next year.
Also, I’m really looking forward to the articles written after each game, which will be beautiful demonstrations of how to make decisions based on insight rather than just quoting statistics as though they were significant in themselves. I’ll just have the mute button ready whenever Tim McCarver starts talking.
Joe Sheehan’s Newsletter is sent via email. An annual subscription is available at http://joesheehanbaseball.blogspot.com.