Friday, April 27, 2018

Regular Expressions With Go: Part 1

Regular Expressions With Go: Part 1

Overview

Regular expressions (AKA regex) are a formal language that defines a sequence of characters with some pattern. In the real world they can be used to solve a lot of problems with semi-structured text. You can extract the important bits and pieces from text with a lot of decorations or unrelated content. Go has a strong regex package in its standard library that lets you slice and dice text with regexes. 

In this two-part series, you'll learn what regular expressions are and how to use regular expressions effectively in Go to accomplish many common tasks. If you're not familiar with regular expressions at all, there are lots of great tutorials. Here is a good one.

Understanding Regular Expressions

Let's start with a quick example. You have some text, and you want to check if it contains an email address. An email address is specified rigorously in RFC 822. In short, it has a local part followed by an @ symbol followed by a domain. The mail address will be separated from the rest of the text by space. 

To figure out if it contains an email address, the following regex will do: ^\w+@\w+\.\w+$. Note that this regex is a little permissive and will allow some invalid email addresses through. But it's good enough to demonstrate the concept. Let's try it on a couple of potential email addresses before explaining how it works:

Our regular expression works on this little sample. The first two addresses were rejected because the domain didn't have a dot or didn't have any characters after the dot. The third email was formatted correctly. The last candidate had two @ symbols.

Let's break this regex down: ^\w+@\w+\.\w+$

Character/Symbol Meaning
^ Beginning of the target text
\w Any word characters [0-9A-Za-z_]
+ At least one of the previous characters
@ Literally the @ character 
\. The literal dot character. Must be escaped with \
$ End of target text

Altogether, this regex will match pieces of text that start with one or more word characters, followed by the "@" character, followed again by one or more word characters, followed by a dot and followed by yet again one or more word characters.  

Dealing With Special Characters

The following characters have special meanings in regular expressions: .+*?()|[]{}^$\. We have already saw many of them in the email example. If we want to match them literally, we need to escape them with a backslash. Let's introduce a little helper function called match() that will save us a lot of typing. It takes a pattern and some text, uses the regexp.Match() method to match the pattern to the text (after converting the text to a byte array), and prints the results:

Here's an example of matching a regular character like z vs. matching a special character like ?:

The regex pattern \? contains a backslash that must be escaped with another backslash when represented as a regular Go string. The reason is that backslash is also used to escape special characters in Go strings like newline (\n). If you want to match the backslash character itself, you'll need four slashes! 

The solution is to use Go raw strings with the backtick (`) instead of double quotes. Of course, if you want to match the newline character, you must go back to regular strings and deal with multiple backslash escapes.

Placeholders and Repetitions

In most cases, you don't try to literally match a sequence of specific characters like "abc", but a sequence of unknown length with maybe some known characters injected somewhere. Regexes support this use case with the dot  . special character that stands for any character whatsoever. The * special character repeats the previous character (or group) zero or more times. If you combine them, as in .*, then you match anything because it simply means zero or more characters. The + is very similar to *, but it matches one or more of the previous characters or groups. So .+ will match any non-empty text.

Using Boundaries

There are three types of boundaries: the start of the text denoted by ^, the end of the text denoted by $, and the word boundary denoted by \b. For example, consider this text from the classic movie The Princess Bride: "My name is Inigo Montoya. You killed my father. Prepare to die." If you match just "father" you get a match, but if you're looking for "father" at the end of the text, you need to add the $ character, and then there will be no match. On the other hand, matching "Hello" at the beginning works well.

Word boundaries look at each word. You can start and/or end a pattern with the \b. Note that punctuation marks like commas are considered a boundary and not part of the word. Here are a few examples:

Using Classes

It's often useful to treat all groups of characters together like all digits, whitespace characters, or all alphanumeric characters. Golang supports the POSIX classes, which are:

Character Class Meaning
[:alnum:]
alphanumeric (≡ [0-9A-Za-z])
[:alpha:]
alphabetic (≡ [A-Za-z])
[:ascii:] 
ASCII (≡ [\x00-\x7F])
[:blank:] 
blank (≡ [\t ])
[:cntrl:]
control (≡ [\x00-\x1F\x7F])
[:digit:]
digits (≡ [0-9])
[:graph:]
graphical (≡ [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~])
[:lower:] 
lower case (≡ [a-z])
[:print:] 
printable (≡ [ -~] == [ [:graph:]])
[:punct:]
punctuation (≡ [!-/:-@[-`{-~])
[:space:]
whitespace (≡ [\t\n\v\f\r ])
[:upper:]
upper case (≡ [A-Z])
[:word:]
word characters (≡ [0-9A-Za-z_])
[:xdigit:]
hex digit (≡ [0-9A-Fa-f])

In the following example, I'll use the [:digit:] class to look for numbers in the text. Also, I show here how to search for an exact number of characters by adding the requested number in curly braces.

You can define your own classes too by putting characters in square brackets. For example, if you want to check if some text is a valid DNA sequence that contains only the characters ACGT then use the ^[ACGT]*$ regex:

Using Alternatives

In some cases, there are multiple viable alternatives. Matching HTTP URLs may be characterized by a protocol schema, which is either http:// or https://. The pipe character | lets you choose between alternatives. Here is a regex that will sort them out: (http)|(https)://\w+\.\w{2,}. It translates to a string that starts with http:// or https:// followed by at least one word character followed by a dot followed by at least two word characters.

Conclusion

In this part of the tutorial, we covered a lot of ground and learned a lot about regular expressions, with hands-on examples using the Golang regexp library. We focused on pure matching and how to express our intentions using regular expressions. 

In part two, we will focus on using regular expressions to work with text, including fuzzy finding, replacements, grouping, and dealing with new lines.


No comments:

Post a Comment