Monday, April 30, 2018

Regular Expressions With Go: Part 2

Regular Expressions With Go: Part 2

Overview

This is part two of a two-part series of tutorials about regular expressions in Go. In part one we learned what regular expressions are, how to express them in Go, and the basics of using the Go regexp library to match text against regular expression patterns. 

In part two, we will focus on using the regexp library to its full extent, including compiling regular expressions, finding one or more matches in the text, replacing regular expressions, grouping submatches, and dealing with new lines.

Using the Regexp Library

The regexp library provides full-fledged support for regular expressions as well as the ability to compile your patterns for more efficient execution when using the same pattern to match against multiple texts. You can also find indices of matches, replace matches, and use groups. Let's dive in.

Compiling Your Regex

There are two methods for compiling regexes: Compile() and MustCompile(). Compile() will return an error if the provided pattern is invalid. MustCompile() will panic. Compilation is recommended if you care about performance and plan to use the same regex multiple times. Let's change our match() helper function to take a compiled regex. Note that there is no need to check for errors because the compiled regex must be valid.

Here is how to compile and use the same compiled regex multiple times:

Finding

The Regexp object has a lot of FindXXX() methods. Some of them return the first match, others return all matches, and yet others return an index or indexes. Interestingly enough, the names of all 16 methods of functions match the following regex: Find(All)?(String)?(Submatch)?(Index)?

If 'All' is present then all matches are returned vs. the leftmost one. If 'String' is present then the target text and the return values are strings vs. byte arrays. If 'Submatch' is present then submatches (groups) are returned vs. just simple matches. If 'Index' is present then indexes within the target text are returned vs. the actual matches.

Let's take one of the more complex functions to task and use the FindAllStringSubmatch() method. It takes a string and a number n. If n is -1, it will return all matching indices. If n is a non-negative integer then it will return the n leftmost matches. The result is a slice of string slices. 

The result of each submatch is the full match followed by the captured group. For example, consider a list of names where some of them have titles such "Mr.", "Mrs.", or "Dr.". Here is a regex that captures the title as a submatch and then the rest of the name after a space: \b(Mr\.|Mrs\.|Dr\.) .*.

As you can see in the output, the full match is captured first and then just the title. For each line, the search resets.

Replacing

Finding matches is great, but often you may need to replace the match with something else. The Regexp object has several ReplaceXXX() methods as usual for dealing with strings vs. byte arrays and literal replacements vs. expansions. In the great book 1984 by George Orwell, the slogans of the party are inscribed on the white pyramid of the ministry of truth: 

  • War is Peace 
  • Freedom is Slavery 
  • Ignorance is Strength 

I found a little essay on The Price of Freedom that uses some of these terms. Let's correct a snippet of it according to the party doublespeak using Go regexes. Note that some of the target words for replacement use different capitalization. The solution is to add the case-insensitive flag (i?) at the beginning of the regex. 

Since the translation is different depending on the case, we need a more sophisticated approach then literal replacement. Luckily (or by design), the Regexp object has a replace method that accepts a function it uses to perform the actual replacement. Let's define our replacer function that returns the translation with the correct case.

Now, we can perform the actual replacement:

The output is somewhat incoherent, which is the hallmark of good propaganda.

Grouping

We saw how to use grouping with submatches earlier. But it is sometimes difficult to handle multiple submatches. Named groups can help a lot here. Here is how to name your submatch groups and populate a dictionary for easy access by name:

Dealing With New Lines

If you remember, I said that the dot special character matches any character. Well, I lied. It doesn't match the newline (\n) character by default. That means that your matches will not cross lines unless you specify it explicitly with the special flag (?s) that you can add to the beginning of your regex. Here is an example with and without the flag.

Another consideration is whether to treat the ^ and $ special characters as the beginning and end of the whole text (the default) or as the beginning and end of each line with the (?m) flag.  

Conclusion

Regular expressions are a powerful tool when working with semi-structured text. You can use them to validate textual input, clean it up, transform it, normalize it, and in general deal with a lot of diversity using concise syntax. 

Go provides a library with an easy-to-use interface that consists of a Regexp object with many methods. Give it a try, but beware of the pitfalls.


No comments:

Post a Comment