(Technically speaking, this post doesn’t require Groovy. You could do the same thing in Java. Still, as usual, Groovy is easier.)
I’m teaching a Groovy course this week and having a great time doing it. One of the exercises I put together is to create a concordance, which is a map relating individual words to the lines in a file on which they appear. The program is a variation on a similar one shown in Barclay and Savage’s Groovy Programming book, and is a good illustration of how easy it is to work with maps in Groovy.
A concordance needs to be based on some text somewhere, so I decided to use section 8.00 of the Official Rules of Baseball, which deals with the pitcher.
(And even after reading it again, I couldn’t really explain what a balk really is and what it isn’t, but so be it.)
I copied the text from the web page and pasted it into a text file. Then the exercise code reads the file line by line, breaks each line into words, and then adds them as keys to a map where the values are lists of line numbers. It’s a good example of using eachWithIndex
, and map.get(word,[]) + 1
, and so on. Once we’ve made each line lower case (so that ‘Pitcher’ and ‘pitcher’ are the same) and coerced the list values to Set
s in order to eliminate duplicates, we’re pretty close to a reasonable solution.
The passage is filled with punctuation, however. Fortunately, the tokenize()
method in String
is overloaded to take a String
argument representing the delimiters. Most of the delimiters are obvious and no problem at all (i.e., " .,;:()\'\"
).
It turns out, however, that the passage also includes sections like:
Pitchers are constantly attempting to “beat the rule” in their efforts to hold runners on bases and in cases where the pitcher fails to make a complete “stop” called for in the rules, the umpire should immediately call a “Balk.”
which are using so-called “smart” quotes. They don’t match the double-quotes in my delimiter string. In other places, there are also possessives which use “smart” apostrophes. How can I add those to my delimiters?
What I needed was the Unicode equivalents for the punctuation. If I know the Unicode values, I can add them as hex values to my delimiters string, like \uXXXX
.
After some discussions, I decided to parse the entire passage character by character, and add all non-word characters to a map with their Unicode values. The code looks like this:
def delimiters = [:]
def data = new File('pitcherrules.txt').text
data.each { c ->
def str = Integer.toHexString(c as int)
if (!(c =~ /\w/)) {
delimiters[c] = str
}
}
println delimiters
It’s pretty straightforward once you know what to look for. Java supplies the Integer.toHexString()
method, which takes an int
. I read the entire passage into the data
variable, then iterated over it, passing each character to the toHexString
method. The key was to coerce the character to an int
, otherwise I get a MissingMethodException
.
I originally had a different expression in the if statement. I was using (c < 'A' || c > 'z')
instead. The result included the numbers 0 to 9. By matching against a regular expression consisting of \w
, though, I check for all word characters, which is equivalent to [A-Za-z0-9]
.
The output of the code is
[" ":"20", ":":"3a", ".":"2e", "\r":"d", "\n":"a", ",":"2c",
"(":"28", ")":"29", "’":"2019", "“":"201c", "”":"201d",
"-":"2d", "—":"2014", ";":"3b"]
which tells me that the Unicode values I need are \u2019
, \u201c
, \u201d
, and \u2014
.
It’s only a small part of a larger problem, but it’s an easy, useful, interesting script that was probably as much of a learning experience as the original lab. It’s all good. 🙂
Now the real question is how much of this will actually render properly in this blog post.
Leave a Reply