Groovy Box Scores

Long ago I decided the best thing about Ruby on Rails was Ruby. Ruby is a great language, with a friendly community and lots of samples to learn from. Still, it’s quite a radical change from Java, which is the language where I am most comfortable.

That brought me to Grails, on a journey I’ve discussed here before. Since Rails taught me about Ruby, I suspected that the coolest aspect of Grails was going to be Groovy. As I spend more and more time with both Groovy and Grails, I’m not sure I want to downplay Grails while praising Groovy, but Groovy sure is a lot of fun.

As we get deep into the baseball pennant races, I’ve been spending more time on the game online. Recently, to my surprise, I discovered that MLB actually makes the box scores from each game available online in XML format. In other words, just by processing some XML, I can access whatever game data I like.

I found out about this from the interesting book Baseball Hacks, by Joseph Adler. I got the book when it came out but didn’t get very far into it because the language of choice in the book was Perl. I’m really not a Perl hacker by any means, so I kind of lost interest. Then I saw the hacks on accessing data online, and I was hooked all over again.

Processing XML with Java is never a fun thing to do. The programming model is awkward at best, and filled with indirection (you have to get a factory to get a DomBuilder / SAXParser / TransformerFactory, then set properties on it, then get the object you really wanted, etc). Then getting the element you wanted isn’t terribly fun, either. It wasn’t until Java 5 that the language finally introduced an XPath processor.

(Incidentally, I usually say that my two least favorite things to do in programming are debugging JavaScript and traversing DOM trees in Java. Ajax gives me the chance to do both at the same time! Fortunately, JavaScript is much friendlier to XML than Java is, and the great Ajax libraries like Prototype, Dojo, and Scriptaculous make everything easier. But I digress…)

The JSP Standard Tag Libraries (JSTL) makes all of that much easier, too. Not only are the tags simple (imports and transforms and the like), but you can just use a JavaScript-like EL dot notation to traverse the tree. Unfortunately, though, that doesn’t seem to have made its way into Java yet.

Enter Groovy. Since the data is online already in XML form, I wondered how I could access it in Groovy. It turns out that accessing and parsing the data takes about two lines:

def url = ... // whatever the url is, online or otherwise
def boxscore = new XmlParser().parse(url)

and we’re done. (Note an XmlSlurper works just as well for an alternative.)

Traversing the resulting tree is also trivial.

To see an example, I was going to paste a box score here, but it’s probably just as easy to see it online. Here’s a link to the box score for the game Boston at Chicago from today, which the Red Sox won 14 — 2.

(Gee, I wonder why I picked that game? :))

The Baseball Hacks book shows how to examine all files of that form using Perl. Since I’m trying to learn more Groovy, I’m redoing the examples. Of course, since Groovy is object-oriented, the next step will be to create actual classes and objects out of these things, not just live on functional programming, but that will come later.

Here’s a snippet to grab the box score and do some basic processing with it.

// Just a sample for the moment:
def year = '2007'
def month =  '08'
def day = '25'
def num = '1' // 1 for single game, 1 or 2 for double header

// Build the URL
def base = 'http://gd2.mlb.com/components/game/mlb/'
def url = base + "year_${year}/month_${month}/day_${day}/"
url += "gid_${year}_${month}_${day}_${away}mlb_${home}mlb_${num}/boxscore.xml"

// Read and parse the box score
def boxscore = new XmlParser().parse(url)

// Collect all the <batter> elements inside all the <batting> elements
def batters = boxscore.batting.batter
for (b in batters) {
    println b.'@name' + ' went ' + b.'@h' + ' for ' + b.'@ab'
}
println batters.size() + " total batters"
println 'Total hits: ' + batters.'@h'*.toInteger().sum()

println "Batters with at least one hit:"
println batters.findAll {
    it.'@h'.toInteger() > 0
}.collect {
    it.'@name' + '(' + it.'@h' + ')'
}

Note how easy it is to access child elements and even attributes (prefaced by the @ sign). I also love the spread operator (*.) which allows me to grab the “hits” attribute of each batter, convert them all into integers, and then add them up. I also get to use closures to find all the batters with at least one hit, collect them into a list, and print their names. There may be a more elegant (read “groovier”) way to do that, but this worked for me.

The URL for each individual game in a directory corresponding to the string above that begins with “gid”. The parent directory for that date lists all the games for that day. In order to process all the games for a given date, somehow I need a list of those directories.

Adler does the Perl equivalent of screen scraping to get those values. In other words, he basically reads the HTML page and looks for the link tags that have that href in them. Of course, as a Perl hacker, he uses regular expressions.

I’m a relatively normal Java programmer, which means I’ve spent most of my career avoiding regular expressions unless absolutely necessary. One of my absolute favorite programming quotes is in the Groovy in Action book (GinA), p. 76:

Once a programmer had a problem. He thought he could solve it with a regular expression. Now he had two problems.

That slays me. Unfortunately (or not, since I really do need to learn this stuff), the best way I could find to solve the same problem was still a regular expression. Since regex’s have been a part of Java for a couple of versions now, it’s high time I got better at them, especially if I want to make any progress in Groovy.

It took some time for me to realize it, but the key to making my program work was “grouping”. I hadn’t realized that if you put parentheses in a regular expression, you can easily get at the grouped values. In this particular case, the base URL for the day is a web page that contains a series of links in the form I want:

<li>
<a href=“gid_2007_08_25_atlmlb_slnmlb_1/”>gid_2007_08_25_atlmlb_slnmlb_1/</a>
</li>

and so on for each game. Here’s what I ultimately did:

println "Games for ${month}/${day}/${year}"
def url = base + "year_${year}/month_${month}/day_${day}/"
def gamePage = new URL(url)

def pattern = ~/${day}_(\w*)mlb_(\w*)mlb_(\d)/

gamePage.openStream().eachLine() { line ->
    def m = pattern.matcher(line)
    if (m) {
        home = m.group(1)  // group 1 is the home team abbrev
        away = m.group(2)  // group 2 is the away team abbrev
        num = m.group(3)   // group 3 is the num (1 or 2)
        def game = "gid_${year}_${month}_${day}_${home}mlb_${away}mlb_${num}/boxscore.xml"

        // if the game hasn't started, the box score won't be there
        // Use a try/catch block for this situation
        try {
            def boxscore = new XmlParser().parse(url + game)

            // Team names are attributes of <boxscore>
            // Run totals are attributes of the single <linescore> child of <boxscore>
            def awayName = boxscore.'@away_fname'
            def awayScore = boxscore.linescore[0].'@away_team_runs'
            def homeName = boxscore.'@home_fname'
            def homeScore = boxscore.linescore[0].'@home_team_runs'
            println awayName + " " + awayScore + ", " +  homeName + " " + homeScore +
                 " (game " + num + ")"

            // Winning and losing pitchers are in a "note" attribute of <pitcher>
           def pitchers = boxscore.pitching.pitcher
           pitchers.each { p ->
               if (p.'@note' && p.'@note' =~ /W|L|S/) {
                   println "  " + p.'@name' + " " + p.'@note'
               }
           }
        } catch (Exception e) {
           println abbrevs[away] + " at " +  abbrevs[home] + " not started yet"
        }
    }
}

At the top of my script I have a map called “abbrevs” which looks like:

def abbrevs = [atl:"Atlanta", bos:"Boston",
    sln:"St. Louis", cha:"Chicago (A)", chn:"Chicago (N)"  // ... and so on ...
]

The result is a listing like:

Games for 08/25/2007
Atlanta Braves 3, St. Louis Cardinals 0 (game 1)
Boston Red Sox 14, Chicago White Sox 2 (game 1)
    Wakefield (W, 16-10)
    Buehrle (L, 9-9)
Arizona at Chicago (N) not started yet

and so on.

The next step is to use a builder to convert the box score into HTML. I did that, but since this post is already getting out of hand I think I’ll save that for my next one. I also did some very rudimentary date processing so that I could get the box scores for the current date without having to hard-wire anything.

It’s amazing how much easier this is than basic Java processing, but what also makes it so cool is that I was able to use my Java knowledge to help. For example, to get the basic web page for processing I already knew about the URL class and its openStream() method. The rest I got from Groovy.

Next time I’ll get into the builder and the date processing. Then I can start developing a much more object-oriented version, which will probably contain classes called Boxscore, Pitcher, Batter, and so on.

It’s not all Gagne’s fault

I’ll keep this short and sweet. Yes, if it wasn’t for Eric Gagne, the Red Sox probably would have swept all three games against the Orioles rather than lose two of three. Yes, the Yankees are playing video game baseball right now and may win no matter what we do. But somebody please, please explain to Terry Francona that it’s okay to bring on your closer in a tie game on the road? Kyle freakin’ Snyder instead of Jonathan Papelbon? That’s just insane.

Bill James actually works for the Red Sox. Can’t he sit down with Theo and Terry and explain elementary bullpen usage? Please? He’s already on the payroll — use him!

Sorry, I had to get that off my chest. This is the hardest time to live in Connecticut. I really like living here, despite my mildly deprecating comments about it (“I married a local girl, so now I’m stuck,” and such things), but there are Yankee fans everywhere. And there are Red Sox fans everywhere. To paraphrase Jeffrey Pelt in The Hunt for Red October, “it would be well to consider that having your [fans] and ours in such proximity is inherently dangerous — wars have started that way!”

If we both make the playoffs again … blech. If the Sox miss the playoffs … no, I’m not even going to go there yet.

I got Potterred

Here I am, minding my own business, digging into Struts 2.0, when an owl from Amazon.com delivered a package on my doorstep on Saturday.

There was a book inside.

To be honest, it wasn’t completely unexpected. I stopped reading the Harry Potter series after book 5, because I really didn’t like the extended scenes of cruelty and nastiness, not to mention the fact that it took over 500 pages before Harry started to fight back successfully. Still, Ginger read book 6 and we pre-ordered book 7 when it first became available.

The release of book 7 coincided with the release of the fifth movie. I was reluctant to go see it, for the same reasons, but I figured I’d give it a try. Besides, Xander was away at camp and Ginger really wanted to go.

I really liked the movie. They downplayed the torture aspects (almost too much, may Dolores Umbridge rot in hell for all eternity — yes, I know it’s fiction, but nevertheless) and did a great job on the rest of the action and character development. That meant I had to dig into book 6, knowing that book 7 was coming on Saturday.

Book 6 was really good. I finished it Saturday afternoon and started book 7 that evening, resigned to the fact that nothing productive was going to happen in my life until that book was finished.

I finally finished last night. All I can say is, wow. That was fun, and amazing, and all that. I have no idea how they’ll make a movie out of it; it felt like there was material for at least three of them.

That lead me to the Wikipedia pages on the book, which were already quite complete and answered a few questions I had about horcruxes, hallows, and certain character biographies. Be sure to avoid the whole section until you’ve read the book.

I know this is a rather odd blog posting for my company site, but if anyone has tried to reach me and found me unavailable for the last few days, well, now you know.

As long as I’m off-topic, a couple of quick asides:

1. What I wouldn’t give for portkeys, or the floo network, or frankly anything to help me avoid the airlines, which are clearly in the service of the Dark Lord.

2. MyEclipse 6.0 should be out any day now. I’m really looking forward to it. I’ve been using the M1 release for almost a month, but I don’t want to get too heavily invested in it before the GA release comes out.

3. Now that the Yankees are in the midst of about 1000 games against bad teams, it’s a good thing the Red Sox picked now to go on a five-game winning streak. And who knew that Julio Lugo was a hitting machine? I hope that lasts a bit longer.

4. Xander is clearly having more fun this summer than I did at age 15. He spent a week and a half with my parents, then a couple days at home, followed by a week at camp as a counselor, a week with a friend’s family at their vacation house in Rhode Island, then overnight with friends last weekend, a party on Sunday, a sleep-over last night, and he’s got another week in camp next week. Not to mention that his end of semester grades were good enough (barely) to keep his new guitar. Yeah, he’s really got it tough.

5. Soon I’ll have to write a post about my new part-time job scoring minor league baseball games for Baseball Info Solutions. I’ve got a CT Defenders game tonight and tomorrow, though, so that will have to wait.

Burned by MLB and DRM

I know this isn’t really business related, but I thought I’d write a post about this problem just in case anybody knows a solution.

In late 2004, after the Red Sox won the World Series, I purchased the downloadable broadcasts of all of their playoff games from Major League Baseball. That came to about 20 gigs of downloads covering 14 games, each of which was playable in Windows Media Player only (the only supported player at the time).

In each case, the first time I tried to play the file on a new computer, the program contacts MLB for a license file, which is then stored locally. After that I can play the files without a problem.

Well, I’m not really an “early adopter,” but that is almost three years ago and I’ve gotten new systems since then. The field of digital downloads has also, shall we say, moved on. Anyway, the other day I tried to play one of those files on my current computer, only to discover that the license download site no longer exists.

I contacted MLB about this at their 800 number. I eventually had to talk to a manager in order to find somebody who understood the situation.

He informed me that their digital download service is down and therefore unavailable. I explained that the files had already been downloaded years ago and I only needed the licenses, which I’d bought and paid for long ago.

That didn’t matter. I still need to contact the digital download service, which is not available. Worse, it’s been down all year, and, believe it or not, he had no idea when — or even if — it would ever be back up. He suggested I keep checking periodically, because he’s doing the same thing. And no, there is no one else I’m allowed to talk to about this.

The bottom line is that I now have a complete set of video files that I can’t play. I guess this is yet another example of the evils of digital rights management. Probably serves me right for going through legal channels to get them in the first place.

Any help would be greatly appreciated.

Quick moment to gloat before the Yankees start hitting again…

Everyone says you should stop and smell the roses, right? My own variation on that line is, whenever you find yourself in a good position, be sure to enjoy it, because everything changes.

Well, the Red Sox are currently 36-15, the Yankees are 21-29, and the difference between them is a whopping 14 1/2 games. That’s so cool I can hardly stand it.

Reality (in the form of the adjusted standings at Baseball Prospectus) says that the Sox are playing slightly over their heads (1.5 games above their third-order adjusted wins) while the Yankees are way below (by a huge 7 games). That’s not going to last. Part of that is due to the Sox closing games better than the Yankees these days, but that should level out some, too. Once the Yankees start hitting the cover off the ball again, watch out.

Still, this is really a fun time to be a Sox fan. I want to make sure I enjoy it, so that later in the season after the Yankees have won 10 in a row and the Sox have dropped six, I remember this.

20 days and counting…

In honor of the first Yankees/Red Sox game of the season (okay, pre-season), let me just remind everyone that opening day is only 20 days away. Sweet.

Just to prove to myself that my son (Xander, age 14) doesn’t read my blog, I’ll let you in on a little secret. I’ve told him that we’re going out together on April 15. He doesn’t know where. Well, as an official member of Red Sox Nation (a Monster member, no less), I was able to acquire a pair of front row tickets on the Green Monster in Fenway Park.

I’ve always said to my wife that someday before I die, I was going to get monster seats. That someday is April 15. Hey, why wait, even if I did have to take out a second mortgage to afford them?

(That was an exaggeration, if only a small one.)

Don’t tell anybody about this. If the boy comes to me tomorrow and knows about the game, then somebody here said something. 😉

Baseball Tour 2006

Normally here I’d rather spend time talking about technologies I’m working with and the process I’m going through learning them, but I thought I’d take a small moment as an aside to mention the baseball tour I just finished with my son Xander, 14.

A week ago Sunday (8/6) we went to McCoy Stadium to see the Pawtucket Red Sox, the Boston AAA affliate, generally known as the PawSox. We were very fortunate to get two tickets. The stadium was packed and enthusiastic. We had a great time, even though the PawSox lost 12 to 2.

We originally planned to spend the following weekend checking out the Lowell Spinners (the Sox’s A league affliate) and the Portland Sea Dogs (AA for the Sox), but both were sold out. That came as quite a surprise to me — I’m not used to minor league teams being sold out, but there it was. Instead we decided at the last minute Friday morning (8/11) to make a trip south. I was able to get tickets to the Philadelphia Phillies in Citizens Bank Park. We drove all the way to Philadelphia, which I now realize is not an easy thing to do on a Friday afternoon. We left at about 12:30 pm and made it to the part at 6:30 pm for a 7 pm game. Whew. Then the game (against Cincinnatti) went 14 innings (!) before the Phillies won.

We spent the night at my sister’s, then traveled to Reading on Saturday (8/12) to see the Reading Phillies take on the Harrisburg Senators. That game, too, went extra innings, but Reading won in the 10th. That stadium was rocking, too. It was practically full and loud. Probably the fact that it was Harley night didn’t hurt.

On Sunday we then went to see the Scranton-Wilkes Barre Red Barons, who were playing, interestingly enough, Pawtucket. That was the first place we went where the crowds were small and not really involved, but we had a good time anyway. That was also the first stadium Xander had ever been to that had artificial turf. After the game the Red Barons let kids run around the bases for five minutes, so we were able to see how spongy the turf was first hand.

A long drive later we were home. That ended that particular trip, but on Tuesday we went down to see the Connecticut Defenders (the former Norwich Navigators) defeat Altoona 2 to 1. I splurged at that game, paying the extra $5 for sky box seats, which were great. They even had fireworks after the game, which were very good, except for the fact that the smoke was thick and hovered over the field, making it hard to see the fireworks after a while.

So in the end it was five games in about a week and a half, including the Phillies and their AA and AAA affliates, the AA affliate for Boston (once home and once away), and the San Francisco AA affliate (the Defenders). Most amazing, in every case the weather was absolutely perfect. Hopefully we’ll be able to say the same next year.

I’m trying not to think about the fact that the Red Sox are two games behind the Yankees, who are coming into town for five games in four days.  Jason Veritek is still on the DL, as are Trot Nixon and Tim Wakefield.  The pitching is very shaky these days.  This could be an ugly, ugly weekend.  Or maybe not.

%d bloggers like this: