Groovier Box Scores

I made a couple more fixes to my box scores script to make it a bit groovier. First is a trivial one, but it’s much more in the Groovy idiom than in Java.

I replaced

def cal = Calendar.getInstance()


def cal = Calendar.instance

Groovy automatically uses the getter if you access a property of a class, as long as the property itself is private. Properties in Groovy are private by default, too, which is much more intuitive than Java’s “package-private” access. Of course, methods are public by default.

The other modification I made had to do with the fact that I was concerned about reading the remote XML file line by line. I thought it might be more appropriate to read the entire file into a local variable and then parse the file.

To do that, I found that the URL class had a getText() method (or, more in the Groovy spirit, a text property). That meant I could read the entire page by writing

def gamePage = new URL(url).text

Now the matching can be done all at once via

def m = gamePage =~ pattern

which results in a collection of matches. The only complication is that the pattern I’m searching for (/${day}_(\w*)mlb_(\w*)mlb_(\d) /) appears twice in each line, once as the text value of the <a> tag and once as it’s href attribute. I figured the easiest way to deal with that was to use eachWithIndex and only worry about the even-numbered matches:

def m = gamePage =~ pattern
if (m) {
    (0..<m.count).eachWithIndex { line, i ->
      if (i % 2) {
          away = m[line][1]
          home = m[line][2]
          num = m[line][3]

etc. The rest is essentially the same.

A good source for figuring out the Groovy way to do things is the PLEAC Groovy page. It rocks.

Groovy Box Scores (minor correction)

I noticed running the Groovy code I posted the other day that I accidentally reversed home and away. It’s not critical, because I still got the URL right, but it’s better to be right.

The fix was just to switch the groups:

away =
home =

and then to update the ${away} and ${home} in the URL link for the individual games.

I’m not sure that the best way to go is to use the eachLine method on the open stream, either. It’s probably better to download the whole page and then process it. I’m not sure how eachLine is working under the hood. If it’s sending a new HTTP request per line, it’s going to be pretty slow.

I also did some very rudimentary Date processing, always an ugly and awkward thing in Java. The URL’s for each game need the day, month, and year, where the day and month have two digits and the year has four. Just to keep things simple, I did it this way:

def cal = Calendar.getInstance()
def year = cal.get(Calendar.YEAR)
def m = cal.get(Calendar.MONTH) + 1  // Ugly off-by-one correction
def d = cal.get(Calendar.DAY_OF_MONTH)
def month = (m < 10)? "0" + m : m
def day = (d < 10) ? "0" + d : d

Now I can run the script without arguments and it checks on the status of the current day’s games. I’ll update it soon so that I can enter in a date, but dates are always awkward so I’m hesitating. When I turn all this into a web app (probably using Grails), I try to insert some calendar widget with some Ajaxy goodness.

Groovy Box Scores

Long ago I decided the best thing about Ruby on Rails was Ruby. Ruby is a great language, with a friendly community and lots of samples to learn from. Still, it’s quite a radical change from Java, which is the language where I am most comfortable.

That brought me to Grails, on a journey I’ve discussed here before. Since Rails taught me about Ruby, I suspected that the coolest aspect of Grails was going to be Groovy. As I spend more and more time with both Groovy and Grails, I’m not sure I want to downplay Grails while praising Groovy, but Groovy sure is a lot of fun.

As we get deep into the baseball pennant races, I’ve been spending more time on the game online. Recently, to my surprise, I discovered that MLB actually makes the box scores from each game available online in XML format. In other words, just by processing some XML, I can access whatever game data I like.

I found out about this from the interesting book Baseball Hacks, by Joseph Adler. I got the book when it came out but didn’t get very far into it because the language of choice in the book was Perl. I’m really not a Perl hacker by any means, so I kind of lost interest. Then I saw the hacks on accessing data online, and I was hooked all over again.

Processing XML with Java is never a fun thing to do. The programming model is awkward at best, and filled with indirection (you have to get a factory to get a DomBuilder / SAXParser / TransformerFactory, then set properties on it, then get the object you really wanted, etc). Then getting the element you wanted isn’t terribly fun, either. It wasn’t until Java 5 that the language finally introduced an XPath processor.

(Incidentally, I usually say that my two least favorite things to do in programming are debugging JavaScript and traversing DOM trees in Java. Ajax gives me the chance to do both at the same time! Fortunately, JavaScript is much friendlier to XML than Java is, and the great Ajax libraries like Prototype, Dojo, and Scriptaculous make everything easier. But I digress…)

The JSP Standard Tag Libraries (JSTL) makes all of that much easier, too. Not only are the tags simple (imports and transforms and the like), but you can just use a JavaScript-like EL dot notation to traverse the tree. Unfortunately, though, that doesn’t seem to have made its way into Java yet.

Enter Groovy. Since the data is online already in XML form, I wondered how I could access it in Groovy. It turns out that accessing and parsing the data takes about two lines:

def url = ... // whatever the url is, online or otherwise
def boxscore = new XmlParser().parse(url)

and we’re done. (Note an XmlSlurper works just as well for an alternative.)

Traversing the resulting tree is also trivial.

To see an example, I was going to paste a box score here, but it’s probably just as easy to see it online. Here’s a link to the box score for the game Boston at Chicago from today, which the Red Sox won 14 — 2.

(Gee, I wonder why I picked that game? :))

The Baseball Hacks book shows how to examine all files of that form using Perl. Since I’m trying to learn more Groovy, I’m redoing the examples. Of course, since Groovy is object-oriented, the next step will be to create actual classes and objects out of these things, not just live on functional programming, but that will come later.

Here’s a snippet to grab the box score and do some basic processing with it.

// Just a sample for the moment:
def year = '2007'
def month =  '08'
def day = '25'
def num = '1' // 1 for single game, 1 or 2 for double header

// Build the URL
def base = ''
def url = base + "year_${year}/month_${month}/day_${day}/"
url += "gid_${year}_${month}_${day}_${away}mlb_${home}mlb_${num}/boxscore.xml"

// Read and parse the box score
def boxscore = new XmlParser().parse(url)

// Collect all the <batter> elements inside all the <batting> elements
def batters = boxscore.batting.batter
for (b in batters) {
    println b.'@name' + ' went ' + b.'@h' + ' for ' + b.'@ab'
println batters.size() + " total batters"
println 'Total hits: ' + batters.'@h'*.toInteger().sum()

println "Batters with at least one hit:"
println batters.findAll {
    it.'@h'.toInteger() > 0
}.collect {
    it.'@name' + '(' + it.'@h' + ')'

Note how easy it is to access child elements and even attributes (prefaced by the @ sign). I also love the spread operator (*.) which allows me to grab the “hits” attribute of each batter, convert them all into integers, and then add them up. I also get to use closures to find all the batters with at least one hit, collect them into a list, and print their names. There may be a more elegant (read “groovier”) way to do that, but this worked for me.

The URL for each individual game in a directory corresponding to the string above that begins with “gid”. The parent directory for that date lists all the games for that day. In order to process all the games for a given date, somehow I need a list of those directories.

Adler does the Perl equivalent of screen scraping to get those values. In other words, he basically reads the HTML page and looks for the link tags that have that href in them. Of course, as a Perl hacker, he uses regular expressions.

I’m a relatively normal Java programmer, which means I’ve spent most of my career avoiding regular expressions unless absolutely necessary. One of my absolute favorite programming quotes is in the Groovy in Action book (GinA), p. 76:

Once a programmer had a problem. He thought he could solve it with a regular expression. Now he had two problems.

That slays me. Unfortunately (or not, since I really do need to learn this stuff), the best way I could find to solve the same problem was still a regular expression. Since regex’s have been a part of Java for a couple of versions now, it’s high time I got better at them, especially if I want to make any progress in Groovy.

It took some time for me to realize it, but the key to making my program work was “grouping”. I hadn’t realized that if you put parentheses in a regular expression, you can easily get at the grouped values. In this particular case, the base URL for the day is a web page that contains a series of links in the form I want:

<a href=“gid_2007_08_25_atlmlb_slnmlb_1/”>gid_2007_08_25_atlmlb_slnmlb_1/</a>

and so on for each game. Here’s what I ultimately did:

println "Games for ${month}/${day}/${year}"
def url = base + "year_${year}/month_${month}/day_${day}/"
def gamePage = new URL(url)

def pattern = ~/${day}_(\w*)mlb_(\w*)mlb_(\d)/

gamePage.openStream().eachLine() { line ->
    def m = pattern.matcher(line)
    if (m) {
        home =  // group 1 is the home team abbrev
        away =  // group 2 is the away team abbrev
        num =   // group 3 is the num (1 or 2)
        def game = "gid_${year}_${month}_${day}_${home}mlb_${away}mlb_${num}/boxscore.xml"

        // if the game hasn't started, the box score won't be there
        // Use a try/catch block for this situation
        try {
            def boxscore = new XmlParser().parse(url + game)

            // Team names are attributes of <boxscore>
            // Run totals are attributes of the single <linescore> child of <boxscore>
            def awayName = boxscore.'@away_fname'
            def awayScore = boxscore.linescore[0].'@away_team_runs'
            def homeName = boxscore.'@home_fname'
            def homeScore = boxscore.linescore[0].'@home_team_runs'
            println awayName + " " + awayScore + ", " +  homeName + " " + homeScore +
                 " (game " + num + ")"

            // Winning and losing pitchers are in a "note" attribute of <pitcher>
           def pitchers = boxscore.pitching.pitcher
           pitchers.each { p ->
               if (p.'@note' && p.'@note' =~ /W|L|S/) {
                   println "  " + p.'@name' + " " + p.'@note'
        } catch (Exception e) {
           println abbrevs[away] + " at " +  abbrevs[home] + " not started yet"

At the top of my script I have a map called “abbrevs” which looks like:

def abbrevs = [atl:"Atlanta", bos:"Boston",
    sln:"St. Louis", cha:"Chicago (A)", chn:"Chicago (N)"  // ... and so on ...

The result is a listing like:

Games for 08/25/2007
Atlanta Braves 3, St. Louis Cardinals 0 (game 1)
Boston Red Sox 14, Chicago White Sox 2 (game 1)
    Wakefield (W, 16-10)
    Buehrle (L, 9-9)
Arizona at Chicago (N) not started yet

and so on.

The next step is to use a builder to convert the box score into HTML. I did that, but since this post is already getting out of hand I think I’ll save that for my next one. I also did some very rudimentary date processing so that I could get the box scores for the current date without having to hard-wire anything.

It’s amazing how much easier this is than basic Java processing, but what also makes it so cool is that I was able to use my Java knowledge to help. For example, to get the basic web page for processing I already knew about the URL class and its openStream() method. The rest I got from Groovy.

Next time I’ll get into the builder and the date processing. Then I can start developing a much more object-oriented version, which will probably contain classes called Boxscore, Pitcher, Batter, and so on.

Green Monster time

Xander and I had our trip to Fenway today, with my first time ever sitting on the Green Monster.  The weather was great (low 80s, partly cloudy, with a nice breeze), the seats were excellent (Section 5, Row 1, Seats 9 and 10 — basically middle of the front row), and the Sox scored six runs in the bottom of the 1st on their way to an 8-4 win.

I’ll probably say more about this later, but in the meantime I finally have a picture to upload here.  The guy sitting next to us had an iPhone and was kind enough to take this photo and email it to me.

Green Monster seats at Fenway

The photo here is a thumbnail view.  Click it to see full size.  I’m the one on the left. 😉

Everything would have been perfect, but when we got home we saw the Sox go ahead 5-4 in the 8th inning of the nightcap, only to see Erik Gagne blow the save in the top of the 9th.  Oh well.  With the Yankees win today, the lead is back to 5 games.

It’s not all Gagne’s fault

I’ll keep this short and sweet. Yes, if it wasn’t for Eric Gagne, the Red Sox probably would have swept all three games against the Orioles rather than lose two of three. Yes, the Yankees are playing video game baseball right now and may win no matter what we do. But somebody please, please explain to Terry Francona that it’s okay to bring on your closer in a tie game on the road? Kyle freakin’ Snyder instead of Jonathan Papelbon? That’s just insane.

Bill James actually works for the Red Sox. Can’t he sit down with Theo and Terry and explain elementary bullpen usage? Please? He’s already on the payroll — use him!

Sorry, I had to get that off my chest. This is the hardest time to live in Connecticut. I really like living here, despite my mildly deprecating comments about it (“I married a local girl, so now I’m stuck,” and such things), but there are Yankee fans everywhere. And there are Red Sox fans everywhere. To paraphrase Jeffrey Pelt in The Hunt for Red October, “it would be well to consider that having your [fans] and ours in such proximity is inherently dangerous — wars have started that way!”

If we both make the playoffs again … blech. If the Sox miss the playoffs … no, I’m not even going to go there yet.

%d bloggers like this: