Long ago I decided the best thing about Ruby on Rails was Ruby. Ruby is a great language, with a friendly community and lots of samples to learn from. Still, it’s quite a radical change from Java, which is the language where I am most comfortable.
That brought me to Grails, on a journey I’ve discussed here before. Since Rails taught me about Ruby, I suspected that the coolest aspect of Grails was going to be Groovy. As I spend more and more time with both Groovy and Grails, I’m not sure I want to downplay Grails while praising Groovy, but Groovy sure is a lot of fun.
As we get deep into the baseball pennant races, I’ve been spending more time on the game online. Recently, to my surprise, I discovered that MLB actually makes the box scores from each game available online in XML format. In other words, just by processing some XML, I can access whatever game data I like.
I found out about this from the interesting book Baseball Hacks, by Joseph Adler. I got the book when it came out but didn’t get very far into it because the language of choice in the book was Perl. I’m really not a Perl hacker by any means, so I kind of lost interest. Then I saw the hacks on accessing data online, and I was hooked all over again.
Processing XML with Java is never a fun thing to do. The programming model is awkward at best, and filled with indirection (you have to get a factory to get a DomBuilder / SAXParser / TransformerFactory, then set properties on it, then get the object you really wanted, etc). Then getting the element you wanted isn’t terribly fun, either. It wasn’t until Java 5 that the language finally introduced an XPath processor.
(Incidentally, I usually say that my two least favorite things to do in programming are debugging JavaScript and traversing DOM trees in Java. Ajax gives me the chance to do both at the same time! Fortunately, JavaScript is much friendlier to XML than Java is, and the great Ajax libraries like Prototype, Dojo, and Scriptaculous make everything easier. But I digress…)
The JSP Standard Tag Libraries (JSTL) makes all of that much easier, too. Not only are the tags simple (imports and transforms and the like), but you can just use a JavaScript-like EL dot notation to traverse the tree. Unfortunately, though, that doesn’t seem to have made its way into Java yet.
Enter Groovy. Since the data is online already in XML form, I wondered how I could access it in Groovy. It turns out that accessing and parsing the data takes about two lines:
def url = ... // whatever the url is, online or otherwise def boxscore = new XmlParser().parse(url)
and we’re done. (Note an XmlSlurper
works just as well for an alternative.)
Traversing the resulting tree is also trivial.
To see an example, I was going to paste a box score here, but it’s probably just as easy to see it online. Here’s a link to the box score for the game Boston at Chicago from today, which the Red Sox won 14 — 2.
(Gee, I wonder why I picked that game? :))
The Baseball Hacks book shows how to examine all files of that form using Perl. Since I’m trying to learn more Groovy, I’m redoing the examples. Of course, since Groovy is object-oriented, the next step will be to create actual classes and objects out of these things, not just live on functional programming, but that will come later.
Here’s a snippet to grab the box score and do some basic processing with it.
// Just a sample for the moment: def year = '2007' def month = '08' def day = '25' def num = '1' // 1 for single game, 1 or 2 for double header // Build the URL def base = 'http://gd2.mlb.com/components/game/mlb/' def url = base + "year_${year}/month_${month}/day_${day}/" url += "gid_${year}_${month}_${day}_${away}mlb_${home}mlb_${num}/boxscore.xml" // Read and parse the box score def boxscore = new XmlParser().parse(url) // Collect all the <batter> elements inside all the <batting> elements def batters = boxscore.batting.batter for (b in batters) { println b.'@name' + ' went ' + b.'@h' + ' for ' + b.'@ab' } println batters.size() + " total batters" println 'Total hits: ' + batters.'@h'*.toInteger().sum() println "Batters with at least one hit:" println batters.findAll { it.'@h'.toInteger() > 0 }.collect { it.'@name' + '(' + it.'@h' + ')' }
Note how easy it is to access child elements and even attributes (prefaced by the @ sign). I also love the spread operator (*.) which allows me to grab the “hits” attribute of each batter, convert them all into integers, and then add them up. I also get to use closures to find all the batters with at least one hit, collect them into a list, and print their names. There may be a more elegant (read “groovier”) way to do that, but this worked for me.
The URL for each individual game in a directory corresponding to the string above that begins with “gid”. The parent directory for that date lists all the games for that day. In order to process all the games for a given date, somehow I need a list of those directories.
Adler does the Perl equivalent of screen scraping to get those values. In other words, he basically reads the HTML page and looks for the link tags that have that href in them. Of course, as a Perl hacker, he uses regular expressions.
I’m a relatively normal Java programmer, which means I’ve spent most of my career avoiding regular expressions unless absolutely necessary. One of my absolute favorite programming quotes is in the Groovy in Action book (GinA), p. 76:
Once a programmer had a problem. He thought he could solve it with a regular expression. Now he had two problems.
That slays me. Unfortunately (or not, since I really do need to learn this stuff), the best way I could find to solve the same problem was still a regular expression. Since regex’s have been a part of Java for a couple of versions now, it’s high time I got better at them, especially if I want to make any progress in Groovy.
It took some time for me to realize it, but the key to making my program work was “grouping”. I hadn’t realized that if you put parentheses in a regular expression, you can easily get at the grouped values. In this particular case, the base URL for the day is a web page that contains a series of links in the form I want:
<li>
<a href=“gid_2007_08_25_atlmlb_slnmlb_1/”>gid_2007_08_25_atlmlb_slnmlb_1/</a>
</li>
and so on for each game. Here’s what I ultimately did:
println "Games for ${month}/${day}/${year}" def url = base + "year_${year}/month_${month}/day_${day}/" def gamePage = new URL(url) def pattern = ~/${day}_(\w*)mlb_(\w*)mlb_(\d)/ gamePage.openStream().eachLine() { line -> def m = pattern.matcher(line) if (m) { home = m.group(1) // group 1 is the home team abbrev away = m.group(2) // group 2 is the away team abbrev num = m.group(3) // group 3 is the num (1 or 2) def game = "gid_${year}_${month}_${day}_${home}mlb_${away}mlb_${num}/boxscore.xml" // if the game hasn't started, the box score won't be there // Use a try/catch block for this situation try { def boxscore = new XmlParser().parse(url + game) // Team names are attributes of <boxscore> // Run totals are attributes of the single <linescore> child of <boxscore> def awayName = boxscore.'@away_fname' def awayScore = boxscore.linescore[0].'@away_team_runs' def homeName = boxscore.'@home_fname' def homeScore = boxscore.linescore[0].'@home_team_runs' println awayName + " " + awayScore + ", " + homeName + " " + homeScore + " (game " + num + ")" // Winning and losing pitchers are in a "note" attribute of <pitcher> def pitchers = boxscore.pitching.pitcher pitchers.each { p -> if (p.'@note' && p.'@note' =~ /W|L|S/) { println " " + p.'@name' + " " + p.'@note' } } } catch (Exception e) { println abbrevs[away] + " at " + abbrevs[home] + " not started yet" } } }
At the top of my script I have a map called “abbrevs” which looks like:
def abbrevs = [atl:"Atlanta", bos:"Boston", sln:"St. Louis", cha:"Chicago (A)", chn:"Chicago (N)" // ... and so on ... ]
The result is a listing like:
Games for 08/25/2007 Atlanta Braves 3, St. Louis Cardinals 0 (game 1) Boston Red Sox 14, Chicago White Sox 2 (game 1) Wakefield (W, 16-10) Buehrle (L, 9-9) Arizona at Chicago (N) not started yet
and so on.
The next step is to use a builder to convert the box score into HTML. I did that, but since this post is already getting out of hand I think I’ll save that for my next one. I also did some very rudimentary date processing so that I could get the box scores for the current date without having to hard-wire anything.
It’s amazing how much easier this is than basic Java processing, but what also makes it so cool is that I was able to use my Java knowledge to help. For example, to get the basic web page for processing I already knew about the URL class and its openStream() method. The rest I got from Groovy.
Next time I’ll get into the builder and the date processing. Then I can start developing a much more object-oriented version, which will probably contain classes called Boxscore, Pitcher, Batter, and so on.
Leave a Reply