Getting Insight from Data: Baseball

Big Data is cited by many prognosticators as a major growth area in computer science over the next decade. While definitions of Big Data abound, the basic idea is that data is being collected as such a rate and with such volume now that traditional ways of saving and analyzing it no longer suffice.

(Neal Ford once told me about a bioengineering project that collects roughly 5 exabytes of data every few days, and as scientists they never want to throw anything away. Their simplest solution for storing it all may be to build a dedicated hard drive factory next door.)

The real issue, though, is not how much data you have. It’s what you do with it. How do you understand your situation when the data is voluminous, but may contain errors, omissions, and other problems?

This is not a new problem, and fortunately we have a great field of examples to use as a model. If you want to understand how to get insight from large amounts of data, look at baseball.

Even a brief summary of the history of sabermetrics (a term coined by Bill James himself, based on SABR — the Society for American Baseball Research) would go well beyond the scope of a simple blog post. If you’ve seen the Moneyball movie or read the book, you’ve got the general idea. Suffice it to say that sabermetrics is an attempt to understand how to win baseball games by focusing on the data, rather than on conventional wisdom.

Some of the interesting observations made by the stats community include ideas that seem blindingly obvious but were ignored for years:

1. Outs are everything, so don’t give them away unnecessarily. Baseball doesn’t have a time limit, and you can’t stall or play a prevent defense. You keep playing until 27 outs are recorded.

2. That means if you steal at less than about a 70% success rate, don’t run. It doesn’t matter how many bases you steal if you’re giving up too many by getting caught.

3. Similarly, bunting is done way, way too often. Moving a runner over from first to second (or sometimes from second to third) isn’t worth giving away an out. Or, as the stat guys say, if you play for one run, that’s likely all you’ll get, and most of the time not even that.

4. Pitcher wins are highly dependent on elements outside the pitcher’s control, like runs scored by his offense or errors made by the defense. Wins are a lousy measure of pitcher skill. Frankly, ERA (earned run average) isn’t that much better, because there are many subjective judgements in there, too (try looking at the difference between earned and unearned runs sometime and the arbitrariness of it all will make your head spin).

5. RBIs are so dependent on where you are in the batting order that they’re a lousy measure of hitting skill. If you are a major leaguer and you bat fourth, fifth, or sixth, you’re going to get a lot of RBIs no matter what you do. That’s why OBP (on base percentage, which measures how often you avoid giving up an out) is far better correlated with wins.

Incidentally, that was one of themes of the Moneyball movie. The idea was to get the players to get on base as much as possible, placing a higher than normal value on walks. Since OBP was much more important to wins than RBIs, the Oakland A’s could replace stars that left by filling in their on-base contributions rather than worrying about runs batted in.

As it turned out, they placed too little emphasis on defense. In the movie they moved Scott Hatteberg to first base where he had no defensive skills at all just to keep his bat in the lineup. Nowadays they would care more about the defense they gave up.

Actually, the real theme of Moneyball was to identify players that were undervalued based on traditional metrics and use them to build a successful team cheaply. At the time, OBP was a market inefficiency.

Rather than go on, I want to mention that there’s a reason I’m writing this post today. Tonight is the first game of the World Series between the Detroit Tigers and the San Francisco Giants, and from a sabermetric point of view it has all the elements of a potential classic.

I want to mention a couple baseball writers and what they’ve said about the upcoming series. First, Jonah Keri (author of The Extra 2%) wrote a great preview article today at Grantland discussing the Giants. Rany Jazayerli (a medical doctor cursed by following the Kansas City Royals) wrote a similar article on the Tigers. Both are well worth reading.

My favorite writer of all, however, is Joe Sheehan. He’s brilliant, and controversial, and fascinating to follow on Twitter. He writes a newsletter that I’m very happy to subscribe to, and his World Series preview came out today. I hope he won’t mind if I quote from it a bit, just to demonstrate the difference between mindlessly quoting statistics and drawing true insight from them.

The Tigers beat you by striking you out and not letting you exploit their poor defense.

He builds up to this by showing how the strike-out rate of the Detroit pitching staff is second in the league, and that they exceeded the league record in total strikeouts (coming in second to the Rays, actually). He also points out that their defense is so bad (they are the worst defense ever to reach the World Series) that the high strike out numbers are partly padded by having to face more batters than they should have had to, given the bad defense.

The Giants beat you by putting the ball in play and making you chase it.

The Giants were last in MLB in home runs. There were second in batting average and fourth on OBP. They have a park that suppresses homers and they play accordingly. On the other hand, they have a very low strike out rate. By the stat called equivalent average, they were the third best offense in baseball despite never hitting home runs.

As it turned out, the Tigers — after winning a weak division — caught two postseason opponents ill-equipped to take advantage of their poor defense.

The A’s struck out a lot and the Yankees were almost entirely driven by homers. The Giants will be completely opposite.

There’s plenty more, but this is the flavor of the observations. Note that none of the articles I mentioned are full of statistics. There are few choice stats presented with the goal of making a clear argument. That’s insight.

By the way, most organizations in baseball know this. The conflict between the scouts and the stats guys demonstrated in Moneyball is largely over (with the great exception being the Kansas City Royals, who not coincidentally keep coming in last). There’s one major group who still doesn’t “get it”, though, and that’s the media.

I cannot fathom why the networks continually put “analysts” in the broadcast booth who are completely unaware of the last twenty years of baseball research. In fact, they often disdain anything learned from studying the game as tricks with statistics. The greatest irony (as mentioned by Joe Sheehan many times) is that no stats guy is anywhere near as wedded to a particular metric as the so-called “traditionalists” care about RBIs or pitcher wins. If you watch the game broadcasts, you’ll see meaningless statistic after meaningless statistic based on small sample size (batter A is 2 for 7 against pitcher B) paraded out as though it meant anything, completely ignoring what’s now viewed as truly important.

For example, you’re guaranteed tonight to hear tons about the Giants’ “momentum” (which demonstrably doesn’t exist) or the effects of the Tigers’ long layoff since winning the ALCS, neither of which matter at all. Instead, the real story is whether the Giants can put enough balls in play to take advantage of the Tigers’ poor defense, or whether the Tigers’ great strike-out pitchers can keep them from doing it. It’s also whether the Tigers’ line up, which is dominated by a few stars, will be able to beat the Giants’ pitching, which has relatively weak starters but lots of bullpen depth, and the excellent San Francisco defense.

I have no vested interest in either team. Frankly, if it wasn’t for what I learn by reading those writers and others like them, I probably wouldn’t care at all. Now I’m excited to watch the interplay of two diametrically contrasting styles. Even better, I’m looking forward to the snarky comments by the great writers I follow on Twitter during the games (nobody does snark like a good baseball writer).

Do I have a prediction? Please. I’m just happy the Yankees got swept. That almost, but not quite, made up for the disaster that was the Red Sox season. Still, one of the Red Sox owners claims part of the team’s problem was that they didn’t listen to Bill James enough, so that gives me hope for next year.

Also, I’m really looking forward to the articles written after each game, which will be beautiful demonstrations of how to make decisions based on insight rather than just quoting statistics as though they were significant in themselves. I’ll just have the mute button ready whenever Tim McCarver starts talking.


Joe Sheehan’s Newsletter is sent via email. An annual subscription is available at

Twitter handles:
Joe Sheehan (@joe_sheehan)
Rany Jazayerli (@jazayerli)
Johan Keri (@jonahkeri)
Keith Law (@keithlaw)
Joe Posnanski (@JPosnanski)


Never miss a ballgame

As Tim Kurkjian famously said, “Never miss the opportunity to go to a baseball game.  You might see something you’ve never seen before.”

This week I’m in Asheville, NC.  I’m very busy with my Securing Java Web Applications class while other issues keep coming up, but the bottom line is that the Asheville Tourists (the class A affiliate of the Colorado Rockies) are nearby and are in town.  I was debating whether to go or not when I spoke to my wife on the phone.  As usual, she encouraged me to go.  She’s claims I’m always in a better mood after I’ve attended a ball game, so who can blame her?

Even better, minor league baseball team names in North Carolina are great.  I really liked the Lehigh Valley Iron Pigs when I was in Allentown a couple weeks ago, but NC has great names in abundance.  You’ve got the Greensboro Grasshoppers, the Winston-Salem Warthogs, the Kannapolis Intimidators, the Carolina Mudcats, and even tonight’s opponent, the Hickory Crawdads.  That doesn’t even mention the classic Durham Bulls.  But honestly, how can you not go to a game between the Tourists (who have had that name since 1914!) and the Crawdads?  It’s just not possible.

So I did my usual practice, which is to show up at the box office about a half hour before game time, told them I needed only one ticket and asked for the best available seat in the house.  In Asheville, that turned out to be a special “Home Deck Suite” right behind the on-deck circle (probability of a foul ball: zero), which cost a fortune ($45, an insane amount for a minor league game) but included all you can eat on the menu, delivered for seven innings by a helpful staff person.

That’s right — all you can eat.  The guy kept coming back asking if I wanted more, and I kept doing massive rationalizations justifying horrible overeating in order to consume enough to make the ticket worthwhile.  Let’s say that I think I managed to do so (er, hot dogs, popcorn, cheese nachos, a giant pretzel, and an endless supply of sodas, but I showed some restraint — no crackerjacks, though I was tempted), which I’m already regretting and surely will regret more tomorrow.  I even got lucky and sat next to a charming couple who were in town on business and had tons of minor league baseball stories to tell.  The guy next to me also reminded me that the manager of the Tourists is good old Joe Mikulik, the immortal star of this classic YouTube video featuring a managerial meltdown that is topped only by this one by Phil Wellman, and I saw Earl Weaver in his prime.

As for the game, the Crawdads won 7 – 1, but I definitely saw some things I’d never seen before:

  • Hickory’s Bobby Spain went 4 for 5 with a home run, but he was outdone by his teammate Andrew Walker, who went 4 for 5 with two home runs.  They even went back-to-back in the top of the 2nd inning.  Is it too obscure a reference to think their slogan should be Walker and Spain and Pray for Rain?
  • Hickory’s Harrison Bishop and Tom Boleska combined to strike out six batters in a row from the bottom of the sixth to the bottom of the eighth.  I was surprised when they took out Bishop after striking out four in a row, but then Boleska came in and struck out two more before the next guy grounded out weakly to second.
  • The two teams combined for a total of seven (!) errors (Hickory made 4 and still won), which is more than I’ve seen in some Little League games.
  • The catcher’s name on the Crawdads is Lars Davis.  Yes, he’s the catcher.  Don’t they therefore, by law, HAVE to call him Crash?
  • The guy who sang the National Anthem was an excellent operatic singer.  Every anthem singer in Connecticut thinks they have to sing with a country twang or like they have vocal diarrhea (see Aguilera, Christina, or lament the sad, pathetic American Idolization of singing), but here I am in North Carolina and I get a trained voice with a fine instrument.  Go figure.

The weather was great, the crowd was small (2872) but enthusiastic.  Asheville is the champion of the first half of the season of the Northern Division of the South Atlantic league (an odd but interesting achievement), so on the way out they were giving away general admission tickets to any future game.

That means I have a free ticket to the game tomorrow, even if it’s not for a very good seat and I still have work to do.  Still, you should never miss going to a ballgame…


Minor league baseball rocks

I’ve been traveling a lot lately.  Fortunately, this is baseball season, so sometimes I get a chance to visit a park I’ve never been to before.

Last week I was in Allentown, PA.  Actually, that’s not quite true — I was actually in Schnecksville, PA, a small suburb of Allentown.  It turns out that this year Allentown has a new baseball team.  The Lehigh Valley Iron Pigs are playing their inaugural season as the AAA affiliate of the Philadelphia Phillies.

(Two years ago, as part of an extended weekend road trip, my son Xander and I did the Phillies circuit.  We got tickets to see the Phillies at Citizens Bank park (a huge improvement over the old Veteran’s Stadium, but, then again, almost anything would be), then we saw the Reading Phillies (their AA affiliate), and finally swung around to see the Scranton/Wilkes-Barre Red Barons, who at the time were the Phillies AAA team.  Now that Scranton is the AAA team for the Yankees, we won’t be going back any time soon (I’d link to their web site, but hey, if you’re a Yankee fan, go find it yourself).  As a final aside, we were hoping to do a similar circuit for the Red Sox (Red Sox at Fenway, Portland Sea Dogs, Lowell Spinners), but couldn’t get tickets to any of them.  That’s right — the Single A Lowell Spinners were sold out, too.  Baseball is king in New England.)

I was teaching a private class last week, and the client was a major sponsor of the Iron Pigs.  That meant I was able to join a group of people in a good balcony section of Coca-Cola Park (an awful name, but there it is).  The whole pig theme was obvious, from the kids hanging out on the freshly mowed lawn in left-center, which was called Pigs on a Blanket, to the Pig Pen in right-center field.  Their program was even called Pork Illustrated.

We had a lot of fun, even though the Iron Pigs lost 5-4.  Still, the park was charming, we had excellent weather, and the people were friendly.  (Mostly — I did have an extended baseball discussion with a long suffering Cleveland Indians fan who hates all things Boston, which is probably understandable under the circumstances. ;))

This week I’m in Austin, TX.  Last night I drove out to Round Rock and got to see the Round Rock Express, the AA affiliate of the Houston Astros.  Yesterday the temperature peaked at 102, but there was a warm breeze and it cooled off a bit as the sun went down.  The stadium wasn’t terribly full, but the people who were there were quite enthusiastic.  The Express even won 3-0 and hit two home runs.  Other than taking forever to find my rental car in the parking lot (a sign of traveling too much is that you forget what your rental car looks like), I had a great time.

I’ve now added baseball caps from the Iron Pigs and the Express to my collection.  I used to get T-shirts everywhere for my son Xander, but he told me he doesn’t want them any more.  Now that he’s 16, all he wears are T-shirts with various rock bands on them.  So be it.  Be sure, though, to check out his band’s excellent studio recording of their song “Don’t Tell Me” at their MySpace page.

Next week I’ll be in Asheville, NC, and it looks like the Asheville Tourists (the class A affiliate of the Colorado Rockies) will be in town.  Maybe I’ll be able to buy another hat. 🙂


MLB playoffs from a TV Networks perspective

When the playoffs began, there was a chance that the championship series would have involved teams from [Note: TV market size in square brackets]

It turned out that the cities actually involved are

Snicker. I imagine the executives at Fox are not the happiest people in the world right now. Since they still plan to inflict Joe Buck and Tim McCarver on a helpless baseball public, I’m glad they’re suffering.

(Note that I’m carefully not gloating about the Yankees loss last night. I know that pain. I’m glad the Yankees are gone, but I know how much it hurts to see a team you live and die for over the long months of a baseball season fall apart in a short series.)

(Although I must say that I will never — NEVER — forgive Johnny Damon.)


Baseball playoffs start (yay!)

I know, I know.  The Rockies – Padres game wasn’t technically in the playoffs.  The stats counted as regular season stats, which meant the batting title was still at risk and Jake Peavy had a chance to win his 20th game (which, of course, didn’t happen).

But still, that was some game.  Some quick observations:

  • I haven’t seen outfield play that bad in years, and I regularly attend AA minor league games.  Whew, that was cover-your-eyes awful.  Coco Crisp would have made every one of those catches with ease.
  • Despite the above, apparently the official scorers have forgotten how to put a check mark in the errors column.  Every bungled outfield play but one was listed as a hit.  No darn wonder errors are a misleading measure of defensive efficiency.  Worse, they contribute to ERA, which is also a mess.
  • I can’t remember who said it (probably Earl Weaver — he said practically everything else), but it’s still true: if you keep changing pitchers, sooner or later you’ll find one who is having a bad day.  Yesterday it was Jorge Julio for the Rockies.  It’s simply amazing the Rockies got away with it.
  • I’d heard about Troy Tulowitski before seeing that game, but I had no idea how good this kid is (worst picture at ESPN I’ve ever seen, btw — see the link above).   As a Red Sox fan, I can say that the Rockies have basically found their own Derek Jeter, except that Troy is a much better fielder.  Wow.
  • Matt Holliday is really good, but his defense contributed to the outfielding nightmare.  I’ll have to check the VORP stats at Baseball Prospectus to see how he really compares to Jimmy Rollins.

Okay, I just checked.  Holliday, 75.0.  Rollins, 66.1.  The real surprise, though, is that Rollins isn’t even the highest VORP on the Phillies.  Chase Utley is at 68.8.  Wow.

  • If anyone needs to know why you shouldn’t slide head first, that last play is Exhibit A.
  • There is no way Holliday touched the plate.  No way.  That means that instead of the game being over, it should have been tied, with two outs and a man on 2nd in the bottom of the 13th.  That means Trevor Hoffman might — just might — have gotten out of the inning.  We could still be playing that game now.
  • It felt really weird to see a game that exciting without having a serious rooting interest.  I kind of liked both teams.  I remember thinking over and over that it was shame either one had to lose.  Still, I’ll be rooting for the Phillies in the division series.
  • I SO enjoyed the announcers last night.  These guys (Don Orsillo, who I’ve listened to for years, and the other guy whose name I forget) were excellent.  Knowing that sooner or later I’m going to have Tim McCarver and Joe Buck inflicted on me made this brief respite all the sweeter.

Wednesday is going to be tough.  I’m teaching an online Ajax class and we have students on the west coast, so I’m committed until at least 5:30 pm and maybe 6 pm.  The Phillies – Rockies game starts at 3, the Sox are on at 6:30, and the Cubs start at 10 pm.  And I still have to teach Thursday morning.
Of course, I always have a tough time during the MLB playoffs.  I just hope the Indians beat the Yankees quickly and the Sox sweep the Angels.  Then we’ll see.

Baseball Groovy

Groovier Box Scores

I made a couple more fixes to my box scores script to make it a bit groovier. First is a trivial one, but it’s much more in the Groovy idiom than in Java.

I replaced

def cal = Calendar.getInstance()


def cal = Calendar.instance

Groovy automatically uses the getter if you access a property of a class, as long as the property itself is private. Properties in Groovy are private by default, too, which is much more intuitive than Java’s “package-private” access. Of course, methods are public by default.

The other modification I made had to do with the fact that I was concerned about reading the remote XML file line by line. I thought it might be more appropriate to read the entire file into a local variable and then parse the file.

To do that, I found that the URL class had a getText() method (or, more in the Groovy spirit, a text property). That meant I could read the entire page by writing

def gamePage = new URL(url).text

Now the matching can be done all at once via

def m = gamePage =~ pattern

which results in a collection of matches. The only complication is that the pattern I’m searching for (/${day}_(\w*)mlb_(\w*)mlb_(\d) /) appears twice in each line, once as the text value of the <a> tag and once as it’s href attribute. I figured the easiest way to deal with that was to use eachWithIndex and only worry about the even-numbered matches:

def m = gamePage =~ pattern
if (m) {
    (0..<m.count).eachWithIndex { line, i ->
      if (i % 2) {
          away = m[line][1]
          home = m[line][2]
          num = m[line][3]

etc. The rest is essentially the same.

A good source for figuring out the Groovy way to do things is the PLEAC Groovy page. It rocks.

Baseball Groovy

Groovy Box Scores (minor correction)

I noticed running the Groovy code I posted the other day that I accidentally reversed home and away. It’s not critical, because I still got the URL right, but it’s better to be right.

The fix was just to switch the groups:

away =
home =

and then to update the ${away} and ${home} in the URL link for the individual games.

I’m not sure that the best way to go is to use the eachLine method on the open stream, either. It’s probably better to download the whole page and then process it. I’m not sure how eachLine is working under the hood. If it’s sending a new HTTP request per line, it’s going to be pretty slow.

I also did some very rudimentary Date processing, always an ugly and awkward thing in Java. The URL’s for each game need the day, month, and year, where the day and month have two digits and the year has four. Just to keep things simple, I did it this way:

def cal = Calendar.getInstance()
def year = cal.get(Calendar.YEAR)
def m = cal.get(Calendar.MONTH) + 1  // Ugly off-by-one correction
def d = cal.get(Calendar.DAY_OF_MONTH)
def month = (m < 10)? "0" + m : m
def day = (d < 10) ? "0" + d : d

Now I can run the script without arguments and it checks on the status of the current day’s games. I’ll update it soon so that I can enter in a date, but dates are always awkward so I’m hesitating. When I turn all this into a web app (probably using Grails), I try to insert some calendar widget with some Ajaxy goodness.

Baseball Groovy

Groovy Box Scores

Long ago I decided the best thing about Ruby on Rails was Ruby. Ruby is a great language, with a friendly community and lots of samples to learn from. Still, it’s quite a radical change from Java, which is the language where I am most comfortable.

That brought me to Grails, on a journey I’ve discussed here before. Since Rails taught me about Ruby, I suspected that the coolest aspect of Grails was going to be Groovy. As I spend more and more time with both Groovy and Grails, I’m not sure I want to downplay Grails while praising Groovy, but Groovy sure is a lot of fun.

As we get deep into the baseball pennant races, I’ve been spending more time on the game online. Recently, to my surprise, I discovered that MLB actually makes the box scores from each game available online in XML format. In other words, just by processing some XML, I can access whatever game data I like.

I found out about this from the interesting book Baseball Hacks, by Joseph Adler. I got the book when it came out but didn’t get very far into it because the language of choice in the book was Perl. I’m really not a Perl hacker by any means, so I kind of lost interest. Then I saw the hacks on accessing data online, and I was hooked all over again.

Processing XML with Java is never a fun thing to do. The programming model is awkward at best, and filled with indirection (you have to get a factory to get a DomBuilder / SAXParser / TransformerFactory, then set properties on it, then get the object you really wanted, etc). Then getting the element you wanted isn’t terribly fun, either. It wasn’t until Java 5 that the language finally introduced an XPath processor.

(Incidentally, I usually say that my two least favorite things to do in programming are debugging JavaScript and traversing DOM trees in Java. Ajax gives me the chance to do both at the same time! Fortunately, JavaScript is much friendlier to XML than Java is, and the great Ajax libraries like Prototype, Dojo, and Scriptaculous make everything easier. But I digress…)

The JSP Standard Tag Libraries (JSTL) makes all of that much easier, too. Not only are the tags simple (imports and transforms and the like), but you can just use a JavaScript-like EL dot notation to traverse the tree. Unfortunately, though, that doesn’t seem to have made its way into Java yet.

Enter Groovy. Since the data is online already in XML form, I wondered how I could access it in Groovy. It turns out that accessing and parsing the data takes about two lines:

def url = ... // whatever the url is, online or otherwise
def boxscore = new XmlParser().parse(url)

and we’re done. (Note an XmlSlurper works just as well for an alternative.)

Traversing the resulting tree is also trivial.

To see an example, I was going to paste a box score here, but it’s probably just as easy to see it online. Here’s a link to the box score for the game Boston at Chicago from today, which the Red Sox won 14 — 2.

(Gee, I wonder why I picked that game? :))

The Baseball Hacks book shows how to examine all files of that form using Perl. Since I’m trying to learn more Groovy, I’m redoing the examples. Of course, since Groovy is object-oriented, the next step will be to create actual classes and objects out of these things, not just live on functional programming, but that will come later.

Here’s a snippet to grab the box score and do some basic processing with it.

// Just a sample for the moment:
def year = '2007'
def month =  '08'
def day = '25'
def num = '1' // 1 for single game, 1 or 2 for double header

// Build the URL
def base = ''
def url = base + "year_${year}/month_${month}/day_${day}/"
url += "gid_${year}_${month}_${day}_${away}mlb_${home}mlb_${num}/boxscore.xml"

// Read and parse the box score
def boxscore = new XmlParser().parse(url)

// Collect all the <batter> elements inside all the <batting> elements
def batters = boxscore.batting.batter
for (b in batters) {
    println b.'@name' + ' went ' + b.'@h' + ' for ' + b.'@ab'
println batters.size() + " total batters"
println 'Total hits: ' + batters.'@h'*.toInteger().sum()

println "Batters with at least one hit:"
println batters.findAll {
    it.'@h'.toInteger() > 0
}.collect {
    it.'@name' + '(' + it.'@h' + ')'

Note how easy it is to access child elements and even attributes (prefaced by the @ sign). I also love the spread operator (*.) which allows me to grab the “hits” attribute of each batter, convert them all into integers, and then add them up. I also get to use closures to find all the batters with at least one hit, collect them into a list, and print their names. There may be a more elegant (read “groovier”) way to do that, but this worked for me.

The URL for each individual game in a directory corresponding to the string above that begins with “gid”. The parent directory for that date lists all the games for that day. In order to process all the games for a given date, somehow I need a list of those directories.

Adler does the Perl equivalent of screen scraping to get those values. In other words, he basically reads the HTML page and looks for the link tags that have that href in them. Of course, as a Perl hacker, he uses regular expressions.

I’m a relatively normal Java programmer, which means I’ve spent most of my career avoiding regular expressions unless absolutely necessary. One of my absolute favorite programming quotes is in the Groovy in Action book (GinA), p. 76:

Once a programmer had a problem. He thought he could solve it with a regular expression. Now he had two problems.

That slays me. Unfortunately (or not, since I really do need to learn this stuff), the best way I could find to solve the same problem was still a regular expression. Since regex’s have been a part of Java for a couple of versions now, it’s high time I got better at them, especially if I want to make any progress in Groovy.

It took some time for me to realize it, but the key to making my program work was “grouping”. I hadn’t realized that if you put parentheses in a regular expression, you can easily get at the grouped values. In this particular case, the base URL for the day is a web page that contains a series of links in the form I want:

<a href=“gid_2007_08_25_atlmlb_slnmlb_1/”>gid_2007_08_25_atlmlb_slnmlb_1/</a>

and so on for each game. Here’s what I ultimately did:

println "Games for ${month}/${day}/${year}"
def url = base + "year_${year}/month_${month}/day_${day}/"
def gamePage = new URL(url)

def pattern = ~/${day}_(\w*)mlb_(\w*)mlb_(\d)/

gamePage.openStream().eachLine() { line ->
    def m = pattern.matcher(line)
    if (m) {
        home =  // group 1 is the home team abbrev
        away =  // group 2 is the away team abbrev
        num =   // group 3 is the num (1 or 2)
        def game = "gid_${year}_${month}_${day}_${home}mlb_${away}mlb_${num}/boxscore.xml"

        // if the game hasn't started, the box score won't be there
        // Use a try/catch block for this situation
        try {
            def boxscore = new XmlParser().parse(url + game)

            // Team names are attributes of <boxscore>
            // Run totals are attributes of the single <linescore> child of <boxscore>
            def awayName = boxscore.'@away_fname'
            def awayScore = boxscore.linescore[0].'@away_team_runs'
            def homeName = boxscore.'@home_fname'
            def homeScore = boxscore.linescore[0].'@home_team_runs'
            println awayName + " " + awayScore + ", " +  homeName + " " + homeScore +
                 " (game " + num + ")"

            // Winning and losing pitchers are in a "note" attribute of <pitcher>
           def pitchers = boxscore.pitching.pitcher
           pitchers.each { p ->
               if (p.'@note' && p.'@note' =~ /W|L|S/) {
                   println "  " + p.'@name' + " " + p.'@note'
        } catch (Exception e) {
           println abbrevs[away] + " at " +  abbrevs[home] + " not started yet"

At the top of my script I have a map called “abbrevs” which looks like:

def abbrevs = [atl:"Atlanta", bos:"Boston",
    sln:"St. Louis", cha:"Chicago (A)", chn:"Chicago (N)"  // ... and so on ...

The result is a listing like:

Games for 08/25/2007
Atlanta Braves 3, St. Louis Cardinals 0 (game 1)
Boston Red Sox 14, Chicago White Sox 2 (game 1)
    Wakefield (W, 16-10)
    Buehrle (L, 9-9)
Arizona at Chicago (N) not started yet

and so on.

The next step is to use a builder to convert the box score into HTML. I did that, but since this post is already getting out of hand I think I’ll save that for my next one. I also did some very rudimentary date processing so that I could get the box scores for the current date without having to hard-wire anything.

It’s amazing how much easier this is than basic Java processing, but what also makes it so cool is that I was able to use my Java knowledge to help. For example, to get the basic web page for processing I already knew about the URL class and its openStream() method. The rest I got from Groovy.

Next time I’ll get into the builder and the date processing. Then I can start developing a much more object-oriented version, which will probably contain classes called Boxscore, Pitcher, Batter, and so on.


It’s not all Gagne’s fault

I’ll keep this short and sweet. Yes, if it wasn’t for Eric Gagne, the Red Sox probably would have swept all three games against the Orioles rather than lose two of three. Yes, the Yankees are playing video game baseball right now and may win no matter what we do. But somebody please, please explain to Terry Francona that it’s okay to bring on your closer in a tie game on the road? Kyle freakin’ Snyder instead of Jonathan Papelbon? That’s just insane.

Bill James actually works for the Red Sox. Can’t he sit down with Theo and Terry and explain elementary bullpen usage? Please? He’s already on the payroll — use him!

Sorry, I had to get that off my chest. This is the hardest time to live in Connecticut. I really like living here, despite my mildly deprecating comments about it (“I married a local girl, so now I’m stuck,” and such things), but there are Yankee fans everywhere. And there are Red Sox fans everywhere. To paraphrase Jeffrey Pelt in The Hunt for Red October, “it would be well to consider that having your [fans] and ours in such proximity is inherently dangerous — wars have started that way!”

If we both make the playoffs again … blech. If the Sox miss the playoffs … no, I’m not even going to go there yet.


I got Potterred

Here I am, minding my own business, digging into Struts 2.0, when an owl from delivered a package on my doorstep on Saturday.

There was a book inside.

To be honest, it wasn’t completely unexpected. I stopped reading the Harry Potter series after book 5, because I really didn’t like the extended scenes of cruelty and nastiness, not to mention the fact that it took over 500 pages before Harry started to fight back successfully. Still, Ginger read book 6 and we pre-ordered book 7 when it first became available.

The release of book 7 coincided with the release of the fifth movie. I was reluctant to go see it, for the same reasons, but I figured I’d give it a try. Besides, Xander was away at camp and Ginger really wanted to go.

I really liked the movie. They downplayed the torture aspects (almost too much, may Dolores Umbridge rot in hell for all eternity — yes, I know it’s fiction, but nevertheless) and did a great job on the rest of the action and character development. That meant I had to dig into book 6, knowing that book 7 was coming on Saturday.

Book 6 was really good. I finished it Saturday afternoon and started book 7 that evening, resigned to the fact that nothing productive was going to happen in my life until that book was finished.

I finally finished last night. All I can say is, wow. That was fun, and amazing, and all that. I have no idea how they’ll make a movie out of it; it felt like there was material for at least three of them.

That lead me to the Wikipedia pages on the book, which were already quite complete and answered a few questions I had about horcruxes, hallows, and certain character biographies. Be sure to avoid the whole section until you’ve read the book.

I know this is a rather odd blog posting for my company site, but if anyone has tried to reach me and found me unavailable for the last few days, well, now you know.

As long as I’m off-topic, a couple of quick asides:

1. What I wouldn’t give for portkeys, or the floo network, or frankly anything to help me avoid the airlines, which are clearly in the service of the Dark Lord.

2. MyEclipse 6.0 should be out any day now. I’m really looking forward to it. I’ve been using the M1 release for almost a month, but I don’t want to get too heavily invested in it before the GA release comes out.

3. Now that the Yankees are in the midst of about 1000 games against bad teams, it’s a good thing the Red Sox picked now to go on a five-game winning streak. And who knew that Julio Lugo was a hitting machine? I hope that lasts a bit longer.

4. Xander is clearly having more fun this summer than I did at age 15. He spent a week and a half with my parents, then a couple days at home, followed by a week at camp as a counselor, a week with a friend’s family at their vacation house in Rhode Island, then overnight with friends last weekend, a party on Sunday, a sleep-over last night, and he’s got another week in camp next week. Not to mention that his end of semester grades were good enough (barely) to keep his new guitar. Yeah, he’s really got it tough.

5. Soon I’ll have to write a post about my new part-time job scoring minor league baseball games for Baseball Info Solutions. I’ve got a CT Defenders game tonight and tomorrow, though, so that will have to wait.