Using Groovy to determine Unicode characters

(Technically speaking, this post doesn’t require Groovy. You could do the same thing in Java. Still, as usual, Groovy is easier.)

I’m teaching a Groovy course this week and having a great time doing it. One of the exercises I put together is to create a concordance, which is a map relating individual words to the lines in a file on which they appear. The program is a variation on a similar one shown in Barclay and Savage’s Groovy Programming book, and is a good illustration of how easy it is to work with maps in Groovy.

A concordance needs to be based on some text somewhere, so I decided to use section 8.00 of the Official Rules of Baseball, which deals with the pitcher.

(And even after reading it again, I couldn’t really explain what a balk really is and what it isn’t, but so be it.)

I copied the text from the web page and pasted it into a text file. Then the exercise code reads the file line by line, breaks each line into words, and then adds them as keys to a map where the values are lists of line numbers. It’s a good example of using eachWithIndex, and map.get(word,[]) + 1, and so on. Once we’ve made each line lower case (so that ‘Pitcher’ and ‘pitcher’ are the same) and coerced the list values to Sets in order to eliminate duplicates, we’re pretty close to a reasonable solution.

The passage is filled with punctuation, however. Fortunately, the tokenize() method in String is overloaded to take a String argument representing the delimiters. Most of the delimiters are obvious and no problem at all (i.e., " .,;:()\'\").

It turns out, however, that the passage also includes sections like:

Pitchers are constantly attempting to “beat the rule” in their efforts to hold runners on bases and in cases where the pitcher fails to make a complete “stop” called for in the rules, the umpire should immediately call a “Balk.”

which are using so-called “smart” quotes. They don’t match the double-quotes in my delimiter string. In other places, there are also possessives which use “smart” apostrophes. How can I add those to my delimiters?

What I needed was the Unicode equivalents for the punctuation. If I know the Unicode values, I can add them as hex values to my delimiters string, like \uXXXX.

After some discussions, I decided to parse the entire passage character by character, and add all non-word characters to a map with their Unicode values. The code looks like this:


def delimiters = [:]
def data = new File('pitcherrules.txt').text
data.each { c ->
    def str = Integer.toHexString(c as int)
    if (!(c =~ /\w/)) {
        delimiters[c] = str
    }
}
println delimiters

It’s pretty straightforward once you know what to look for. Java supplies the Integer.toHexString() method, which takes an int. I read the entire passage into the data variable, then iterated over it, passing each character to the toHexString method. The key was to coerce the character to an int, otherwise I get a MissingMethodException.

I originally had a different expression in the if statement. I was using (c < 'A' || c > 'z') instead. The result included the numbers 0 to 9. By matching against a regular expression consisting of \w, though, I check for all word characters, which is equivalent to [A-Za-z0-9].

The output of the code is

[" ":"20", ":":"3a", ".":"2e", "\r":"d", "\n":"a", ",":"2c",
"(":"28", ")":"29", "’":"2019", "“":"201c", "”":"201d",
"-":"2d", "—":"2014", ";":"3b"]

which tells me that the Unicode values I need are \u2019, \u201c, \u201d, and \u2014.

It’s only a small part of a larger problem, but it’s an easy, useful, interesting script that was probably as much of a learning experience as the original lab. It’s all good. 🙂

Now the real question is how much of this will actually render properly in this blog post.

%d bloggers like this: