Getting Insight from Data: Baseball

Big Data is cited by many prognosticators as a major growth area in computer science over the next decade. While definitions of Big Data abound, the basic idea is that data is being collected as such a rate and with such volume now that traditional ways of saving and analyzing it no longer suffice.

(Neal Ford once told me about a bioengineering project that collects roughly 5 exabytes of data every few days, and as scientists they never want to throw anything away. Their simplest solution for storing it all may be to build a dedicated hard drive factory next door.)

The real issue, though, is not how much data you have. It’s what you do with it. How do you understand your situation when the data is voluminous, but may contain errors, omissions, and other problems?

This is not a new problem, and fortunately we have a great field of examples to use as a model. If you want to understand how to get insight from large amounts of data, look at baseball.

Even a brief summary of the history of sabermetrics (a term coined by Bill James himself, based on SABR — the Society for American Baseball Research) would go well beyond the scope of a simple blog post. If you’ve seen the Moneyball movie or read the book, you’ve got the general idea. Suffice it to say that sabermetrics is an attempt to understand how to win baseball games by focusing on the data, rather than on conventional wisdom.

Some of the interesting observations made by the stats community include ideas that seem blindingly obvious but were ignored for years:

1. Outs are everything, so don’t give them away unnecessarily. Baseball doesn’t have a time limit, and you can’t stall or play a prevent defense. You keep playing until 27 outs are recorded.

2. That means if you steal at less than about a 70% success rate, don’t run. It doesn’t matter how many bases you steal if you’re giving up too many by getting caught.

3. Similarly, bunting is done way, way too often. Moving a runner over from first to second (or sometimes from second to third) isn’t worth giving away an out. Or, as the stat guys say, if you play for one run, that’s likely all you’ll get, and most of the time not even that.

4. Pitcher wins are highly dependent on elements outside the pitcher’s control, like runs scored by his offense or errors made by the defense. Wins are a lousy measure of pitcher skill. Frankly, ERA (earned run average) isn’t that much better, because there are many subjective judgements in there, too (try looking at the difference between earned and unearned runs sometime and the arbitrariness of it all will make your head spin).

5. RBIs are so dependent on where you are in the batting order that they’re a lousy measure of hitting skill. If you are a major leaguer and you bat fourth, fifth, or sixth, you’re going to get a lot of RBIs no matter what you do. That’s why OBP (on base percentage, which measures how often you avoid giving up an out) is far better correlated with wins.

Incidentally, that was one of themes of the Moneyball movie. The idea was to get the players to get on base as much as possible, placing a higher than normal value on walks. Since OBP was much more important to wins than RBIs, the Oakland A’s could replace stars that left by filling in their on-base contributions rather than worrying about runs batted in.

As it turned out, they placed too little emphasis on defense. In the movie they moved Scott Hatteberg to first base where he had no defensive skills at all just to keep his bat in the lineup. Nowadays they would care more about the defense they gave up.

Actually, the real theme of Moneyball was to identify players that were undervalued based on traditional metrics and use them to build a successful team cheaply. At the time, OBP was a market inefficiency.

Rather than go on, I want to mention that there’s a reason I’m writing this post today. Tonight is the first game of the World Series between the Detroit Tigers and the San Francisco Giants, and from a sabermetric point of view it has all the elements of a potential classic.

I want to mention a couple baseball writers and what they’ve said about the upcoming series. First, Jonah Keri (author of The Extra 2%) wrote a great preview article today at Grantland discussing the Giants. Rany Jazayerli (a medical doctor cursed by following the Kansas City Royals) wrote a similar article on the Tigers. Both are well worth reading.

My favorite writer of all, however, is Joe Sheehan. He’s brilliant, and controversial, and fascinating to follow on Twitter. He writes a newsletter that I’m very happy to subscribe to, and his World Series preview came out today. I hope he won’t mind if I quote from it a bit, just to demonstrate the difference between mindlessly quoting statistics and drawing true insight from them.

The Tigers beat you by striking you out and not letting you exploit their poor defense.

He builds up to this by showing how the strike-out rate of the Detroit pitching staff is second in the league, and that they exceeded the league record in total strikeouts (coming in second to the Rays, actually). He also points out that their defense is so bad (they are the worst defense ever to reach the World Series) that the high strike out numbers are partly padded by having to face more batters than they should have had to, given the bad defense.

The Giants beat you by putting the ball in play and making you chase it.

The Giants were last in MLB in home runs. There were second in batting average and fourth on OBP. They have a park that suppresses homers and they play accordingly. On the other hand, they have a very low strike out rate. By the stat called equivalent average, they were the third best offense in baseball despite never hitting home runs.

As it turned out, the Tigers — after winning a weak division — caught two postseason opponents ill-equipped to take advantage of their poor defense.

The A’s struck out a lot and the Yankees were almost entirely driven by homers. The Giants will be completely opposite.

There’s plenty more, but this is the flavor of the observations. Note that none of the articles I mentioned are full of statistics. There are few choice stats presented with the goal of making a clear argument. That’s insight.

By the way, most organizations in baseball know this. The conflict between the scouts and the stats guys demonstrated in Moneyball is largely over (with the great exception being the Kansas City Royals, who not coincidentally keep coming in last). There’s one major group who still doesn’t “get it”, though, and that’s the media.

I cannot fathom why the networks continually put “analysts” in the broadcast booth who are completely unaware of the last twenty years of baseball research. In fact, they often disdain anything learned from studying the game as tricks with statistics. The greatest irony (as mentioned by Joe Sheehan many times) is that no stats guy is anywhere near as wedded to a particular metric as the so-called “traditionalists” care about RBIs or pitcher wins. If you watch the game broadcasts, you’ll see meaningless statistic after meaningless statistic based on small sample size (batter A is 2 for 7 against pitcher B) paraded out as though it meant anything, completely ignoring what’s now viewed as truly important.

For example, you’re guaranteed tonight to hear tons about the Giants’ “momentum” (which demonstrably doesn’t exist) or the effects of the Tigers’ long layoff since winning the ALCS, neither of which matter at all. Instead, the real story is whether the Giants can put enough balls in play to take advantage of the Tigers’ poor defense, or whether the Tigers’ great strike-out pitchers can keep them from doing it. It’s also whether the Tigers’ line up, which is dominated by a few stars, will be able to beat the Giants’ pitching, which has relatively weak starters but lots of bullpen depth, and the excellent San Francisco defense.

I have no vested interest in either team. Frankly, if it wasn’t for what I learn by reading those writers and others like them, I probably wouldn’t care at all. Now I’m excited to watch the interplay of two diametrically contrasting styles. Even better, I’m looking forward to the snarky comments by the great writers I follow on Twitter during the games (nobody does snark like a good baseball writer).

Do I have a prediction? Please. I’m just happy the Yankees got swept. That almost, but not quite, made up for the disaster that was the Red Sox season. Still, one of the Red Sox owners claims part of the team’s problem was that they didn’t listen to Bill James enough, so that gives me hope for next year.

Also, I’m really looking forward to the articles written after each game, which will be beautiful demonstrations of how to make decisions based on insight rather than just quoting statistics as though they were significant in themselves. I’ll just have the mute button ready whenever Tim McCarver starts talking.

—-

Joe Sheehan’s Newsletter is sent via email. An annual subscription is available at http://joesheehanbaseball.blogspot.com.

Twitter handles:
Joe Sheehan (@joe_sheehan)
Rany Jazayerli (@jazayerli)
Johan Keri (@jonahkeri)
Keith Law (@keithlaw)
Joe Posnanski (@JPosnanski)

Never miss a ballgame

As Tim Kurkjian famously said, “Never miss the opportunity to go to a baseball game.  You might see something you’ve never seen before.”

This week I’m in Asheville, NC.  I’m very busy with my Securing Java Web Applications class while other issues keep coming up, but the bottom line is that the Asheville Tourists (the class A affiliate of the Colorado Rockies) are nearby and are in town.  I was debating whether to go or not when I spoke to my wife on the phone.  As usual, she encouraged me to go.  She’s claims I’m always in a better mood after I’ve attended a ball game, so who can blame her?

Even better, minor league baseball team names in North Carolina are great.  I really liked the Lehigh Valley Iron Pigs when I was in Allentown a couple weeks ago, but NC has great names in abundance.  You’ve got the Greensboro Grasshoppers, the Winston-Salem Warthogs, the Kannapolis Intimidators, the Carolina Mudcats, and even tonight’s opponent, the Hickory Crawdads.  That doesn’t even mention the classic Durham Bulls.  But honestly, how can you not go to a game between the Tourists (who have had that name since 1914!) and the Crawdads?  It’s just not possible.

So I did my usual practice, which is to show up at the box office about a half hour before game time, told them I needed only one ticket and asked for the best available seat in the house.  In Asheville, that turned out to be a special “Home Deck Suite” right behind the on-deck circle (probability of a foul ball: zero), which cost a fortune ($45, an insane amount for a minor league game) but included all you can eat on the menu, delivered for seven innings by a helpful staff person.

That’s right — all you can eat.  The guy kept coming back asking if I wanted more, and I kept doing massive rationalizations justifying horrible overeating in order to consume enough to make the ticket worthwhile.  Let’s say that I think I managed to do so (er, hot dogs, popcorn, cheese nachos, a giant pretzel, and an endless supply of sodas, but I showed some restraint — no crackerjacks, though I was tempted), which I’m already regretting and surely will regret more tomorrow.  I even got lucky and sat next to a charming couple who were in town on business and had tons of minor league baseball stories to tell.  The guy next to me also reminded me that the manager of the Tourists is good old Joe Mikulik, the immortal star of this classic YouTube video featuring a managerial meltdown that is topped only by this one by Phil Wellman, and I saw Earl Weaver in his prime.

As for the game, the Crawdads won 7 – 1, but I definitely saw some things I’d never seen before:

  • Hickory’s Bobby Spain went 4 for 5 with a home run, but he was outdone by his teammate Andrew Walker, who went 4 for 5 with two home runs.  They even went back-to-back in the top of the 2nd inning.  Is it too obscure a reference to think their slogan should be Walker and Spain and Pray for Rain?
  • Hickory’s Harrison Bishop and Tom Boleska combined to strike out six batters in a row from the bottom of the sixth to the bottom of the eighth.  I was surprised when they took out Bishop after striking out four in a row, but then Boleska came in and struck out two more before the next guy grounded out weakly to second.
  • The two teams combined for a total of seven (!) errors (Hickory made 4 and still won), which is more than I’ve seen in some Little League games.
  • The catcher’s name on the Crawdads is Lars Davis.  Yes, he’s the catcher.  Don’t they therefore, by law, HAVE to call him Crash?
  • The guy who sang the National Anthem was an excellent operatic singer.  Every anthem singer in Connecticut thinks they have to sing with a country twang or like they have vocal diarrhea (see Aguilera, Christina, or lament the sad, pathetic American Idolization of singing), but here I am in North Carolina and I get a trained voice with a fine instrument.  Go figure.

The weather was great, the crowd was small (2872) but enthusiastic.  Asheville is the champion of the first half of the season of the Northern Division of the South Atlantic league (an odd but interesting achievement), so on the way out they were giving away general admission tickets to any future game.

That means I have a free ticket to the game tomorrow, even if it’s not for a very good seat and I still have work to do.  Still, you should never miss going to a ballgame…

Minor league baseball rocks

I’ve been traveling a lot lately.  Fortunately, this is baseball season, so sometimes I get a chance to visit a park I’ve never been to before.

Last week I was in Allentown, PA.  Actually, that’s not quite true — I was actually in Schnecksville, PA, a small suburb of Allentown.  It turns out that this year Allentown has a new baseball team.  The Lehigh Valley Iron Pigs are playing their inaugural season as the AAA affiliate of the Philadelphia Phillies.

(Two years ago, as part of an extended weekend road trip, my son Xander and I did the Phillies circuit.  We got tickets to see the Phillies at Citizens Bank park (a huge improvement over the old Veteran’s Stadium, but, then again, almost anything would be), then we saw the Reading Phillies (their AA affiliate), and finally swung around to see the Scranton/Wilkes-Barre Red Barons, who at the time were the Phillies AAA team.  Now that Scranton is the AAA team for the Yankees, we won’t be going back any time soon (I’d link to their web site, but hey, if you’re a Yankee fan, go find it yourself).  As a final aside, we were hoping to do a similar circuit for the Red Sox (Red Sox at Fenway, Portland Sea Dogs, Lowell Spinners), but couldn’t get tickets to any of them.  That’s right — the Single A Lowell Spinners were sold out, too.  Baseball is king in New England.)

I was teaching a private class last week, and the client was a major sponsor of the Iron Pigs.  That meant I was able to join a group of people in a good balcony section of Coca-Cola Park (an awful name, but there it is).  The whole pig theme was obvious, from the kids hanging out on the freshly mowed lawn in left-center, which was called Pigs on a Blanket, to the Pig Pen in right-center field.  Their program was even called Pork Illustrated.

We had a lot of fun, even though the Iron Pigs lost 5-4.  Still, the park was charming, we had excellent weather, and the people were friendly.  (Mostly — I did have an extended baseball discussion with a long suffering Cleveland Indians fan who hates all things Boston, which is probably understandable under the circumstances. ;))

This week I’m in Austin, TX.  Last night I drove out to Round Rock and got to see the Round Rock Express, the AA affiliate of the Houston Astros.  Yesterday the temperature peaked at 102, but there was a warm breeze and it cooled off a bit as the sun went down.  The stadium wasn’t terribly full, but the people who were there were quite enthusiastic.  The Express even won 3-0 and hit two home runs.  Other than taking forever to find my rental car in the parking lot (a sign of traveling too much is that you forget what your rental car looks like), I had a great time.

I’ve now added baseball caps from the Iron Pigs and the Express to my collection.  I used to get T-shirts everywhere for my son Xander, but he told me he doesn’t want them any more.  Now that he’s 16, all he wears are T-shirts with various rock bands on them.  So be it.  Be sure, though, to check out his band’s excellent studio recording of their song “Don’t Tell Me” at their MySpace page.

Next week I’ll be in Asheville, NC, and it looks like the Asheville Tourists (the class A affiliate of the Colorado Rockies) will be in town.  Maybe I’ll be able to buy another hat. 🙂

MLB playoffs from a TV Networks perspective

When the playoffs began, there was a chance that the championship series would have involved teams from [Note: TV market size in square brackets]

It turned out that the cities actually involved are

Snicker. I imagine the executives at Fox are not the happiest people in the world right now. Since they still plan to inflict Joe Buck and Tim McCarver on a helpless baseball public, I’m glad they’re suffering.

(Note that I’m carefully not gloating about the Yankees loss last night. I know that pain. I’m glad the Yankees are gone, but I know how much it hurts to see a team you live and die for over the long months of a baseball season fall apart in a short series.)

(Although I must say that I will never — NEVER — forgive Johnny Damon.)

Baseball playoffs start (yay!)

I know, I know.  The Rockies – Padres game wasn’t technically in the playoffs.  The stats counted as regular season stats, which meant the batting title was still at risk and Jake Peavy had a chance to win his 20th game (which, of course, didn’t happen).

But still, that was some game.  Some quick observations:

  • I haven’t seen outfield play that bad in years, and I regularly attend AA minor league games.  Whew, that was cover-your-eyes awful.  Coco Crisp would have made every one of those catches with ease.
  • Despite the above, apparently the official scorers have forgotten how to put a check mark in the errors column.  Every bungled outfield play but one was listed as a hit.  No darn wonder errors are a misleading measure of defensive efficiency.  Worse, they contribute to ERA, which is also a mess.
  • I can’t remember who said it (probably Earl Weaver — he said practically everything else), but it’s still true: if you keep changing pitchers, sooner or later you’ll find one who is having a bad day.  Yesterday it was Jorge Julio for the Rockies.  It’s simply amazing the Rockies got away with it.
  • I’d heard about Troy Tulowitski before seeing that game, but I had no idea how good this kid is (worst picture at ESPN I’ve ever seen, btw — see the link above).   As a Red Sox fan, I can say that the Rockies have basically found their own Derek Jeter, except that Troy is a much better fielder.  Wow.
  • Matt Holliday is really good, but his defense contributed to the outfielding nightmare.  I’ll have to check the VORP stats at Baseball Prospectus to see how he really compares to Jimmy Rollins.

Okay, I just checked.  Holliday, 75.0.  Rollins, 66.1.  The real surprise, though, is that Rollins isn’t even the highest VORP on the Phillies.  Chase Utley is at 68.8.  Wow.

  • If anyone needs to know why you shouldn’t slide head first, that last play is Exhibit A.
  • There is no way Holliday touched the plate.  No way.  That means that instead of the game being over, it should have been tied, with two outs and a man on 2nd in the bottom of the 13th.  That means Trevor Hoffman might — just might — have gotten out of the inning.  We could still be playing that game now.
  • It felt really weird to see a game that exciting without having a serious rooting interest.  I kind of liked both teams.  I remember thinking over and over that it was shame either one had to lose.  Still, I’ll be rooting for the Phillies in the division series.
  • I SO enjoyed the announcers last night.  These guys (Don Orsillo, who I’ve listened to for years, and the other guy whose name I forget) were excellent.  Knowing that sooner or later I’m going to have Tim McCarver and Joe Buck inflicted on me made this brief respite all the sweeter.

Wednesday is going to be tough.  I’m teaching an online Ajax class and we have students on the west coast, so I’m committed until at least 5:30 pm and maybe 6 pm.  The Phillies – Rockies game starts at 3, the Sox are on at 6:30, and the Cubs start at 10 pm.  And I still have to teach Thursday morning.
Of course, I always have a tough time during the MLB playoffs.  I just hope the Indians beat the Yankees quickly and the Sox sweep the Angels.  Then we’ll see.

Groovier Box Scores

I made a couple more fixes to my box scores script to make it a bit groovier. First is a trivial one, but it’s much more in the Groovy idiom than in Java.

I replaced

def cal = Calendar.getInstance()

with

def cal = Calendar.instance

Groovy automatically uses the getter if you access a property of a class, as long as the property itself is private. Properties in Groovy are private by default, too, which is much more intuitive than Java’s “package-private” access. Of course, methods are public by default.

The other modification I made had to do with the fact that I was concerned about reading the remote XML file line by line. I thought it might be more appropriate to read the entire file into a local variable and then parse the file.

To do that, I found that the URL class had a getText() method (or, more in the Groovy spirit, a text property). That meant I could read the entire page by writing

def gamePage = new URL(url).text

Now the matching can be done all at once via

def m = gamePage =~ pattern

which results in a collection of matches. The only complication is that the pattern I’m searching for (/${day}_(\w*)mlb_(\w*)mlb_(\d) /) appears twice in each line, once as the text value of the <a> tag and once as it’s href attribute. I figured the easiest way to deal with that was to use eachWithIndex and only worry about the even-numbered matches:

def m = gamePage =~ pattern
if (m) {
    (0..<m.count).eachWithIndex { line, i ->
      if (i % 2) {
          away = m[line][1]
          home = m[line][2]
          num = m[line][3]

etc. The rest is essentially the same.

A good source for figuring out the Groovy way to do things is the PLEAC Groovy page. It rocks.

Groovy Box Scores (minor correction)

I noticed running the Groovy code I posted the other day that I accidentally reversed home and away. It’s not critical, because I still got the URL right, but it’s better to be right.

The fix was just to switch the groups:

away = m.group(1)
home = m.group(2)

and then to update the ${away} and ${home} in the URL link for the individual games.

I’m not sure that the best way to go is to use the eachLine method on the open stream, either. It’s probably better to download the whole page and then process it. I’m not sure how eachLine is working under the hood. If it’s sending a new HTTP request per line, it’s going to be pretty slow.

I also did some very rudimentary Date processing, always an ugly and awkward thing in Java. The URL’s for each game need the day, month, and year, where the day and month have two digits and the year has four. Just to keep things simple, I did it this way:

def cal = Calendar.getInstance()
def year = cal.get(Calendar.YEAR)
def m = cal.get(Calendar.MONTH) + 1  // Ugly off-by-one correction
def d = cal.get(Calendar.DAY_OF_MONTH)
def month = (m < 10)? "0" + m : m
def day = (d < 10) ? "0" + d : d

Now I can run the script without arguments and it checks on the status of the current day’s games. I’ll update it soon so that I can enter in a date, but dates are always awkward so I’m hesitating. When I turn all this into a web app (probably using Grails), I try to insert some calendar widget with some Ajaxy goodness.

%d bloggers like this: