Monday, March 12, 2018

Cleaning Up Your Data With Go: Part 2

Cleaning Up Your Data With Go: Part 2

Overview

This is part two out of two in a series on cleaning up data using Go. In part one, we covered the basic text facilities of Go and working with CSV files. In this tutorial, we'll dive into actual data cleaning. 

We'll start by understanding the problem of messy data and coming up with a strategy, and then we'll look into verifying individual fields, fixing the data where possible, and deciding what to do about missing values.

Data Cleaning Strategy

A strategy for cleaning up data should dictate what to do when encountering invalid, messy, partial, or missing data. It should also determine what level of reporting is needed about the cleanup process. 

The data we're focusing on here is tabular data, where each row is independent. There are no nested hierarchies or connections between different rows of data. A lot of real-world datasets have this nice property. 

Remove

The simplest approach for dealing with invalid data is to remove it. If any field is missing or contains invalid data, just get rid of the whole row. This is very easy, and sometimes it is the right thing to do. If the problematic field is critical and you have no way to recover it then all you can do is drop the entire record.

Fix

The best solution is fixing the bad field. In some cases, it's easy to detect the problem and fix it. In the UFO sightings dataset, the state field can be one of the 52 states of the US. 

If the value must be all uppercase and some rows contain lowercase letters, you can just make them uppercase.

Report

Reporting on invalid rows, either dropped or fixed, is important. The organization may decide to let people try to fix dropped data. It may be necessary to run fixed data by QA to ensure the automatic fixes didn't introduce invalid data.

Stats

Collecting statistics on the cleanup process is necessary to evaluate the quality of the source data and sometimes to determine if the cleaned-up data is even worth processing. The stats can include the number of dropped and fixed rows and the number of bad and missing fields for each column.

Live Cleanup

So far I've described a pre-processing approach for data cleanup. However, it is possible to perform cleanup during processing. Each row is checked just before it is processed. This is sometimes useful, if there is no point in pre-processing because no one can fix bad data ahead of time for later analysis or if the processing is time-sensitive. 

In this scenario, the main purpose of the cleanup is to make sure bad data rows don't break the entire processing pipeline and can be skipped or fixed as necessary. 

Verifying Fields

How do you go about verifying fields? You need to know exactly what type of data is supposed to be there and sometimes what values. Here are a few examples. 

Verifying Numeric Fields

Numeric fields are very common in data sets. Beyond the type of number (integer, real, complex), some fields are more specialized. For example, a price field may require exactly two decimal points and be positive. Here is a function that checks if a string represents a price:

Verifying URL Fields

Sometimes you need to go above and beyond. If you need to verify that a URL is valid then there are two approaches:

  1. Parse the URL.
  2. Try and fetch it (or at least get the headers).

If you only care if the URL is well-formed then the first approach works. But if you want to make sure the URL actually points to a real destination, you need to use the second approach. Since the second approach is a superset of the first approach, let's just use it:

Verifying Custom Format Fields

If the values must respect a custom format, you can usually either match it using simple strings functions like Split() or in more complex cases use regular expressions. For example, if your dataset contains social security numbers (I hope not) in the format XXX-XX-XXXX then you can split by "-" and ensure there are three tokens where the first is three digits long, the second is two digits long, and the third is four digits long. But it's more concise to use a regex like ^\d{3}-?\d{2}-?\d{4}$.

Fixing Invalid Values

Fixing invalid values is not a trivial thing. If your fixing method is incorrect, you can end up with corrupt data. You should consider carefully the importance of the field, the range of possible valid values, and how confident you are that you can really fix any invalid value automatically.

Fixing Case

This is a pretty safe fix. If a text field is supposed to be all uppercase, you can fix it without risking much, because which characters were originally lowercase is not an important piece of information. There is no need to write special code as the strings package has a ToUpper() function. There are also ToLower() and even ToTitle() and ToTitleSpecific() functions to capitalize text properly.

Stripping Unwanted Characters

Another common easy fix is removing leading and trailing whitespace. You'll be surprised how many people add spaces or new lines when entering data. The strings package has a selection of TrimXXX() functions that can take care of most situations:

  • Trim()
  • TrimFunc()
  • TrimLeft()
  • TrimLeftFunc()
  • TrimPrefix()
  • TrimRight()
  • TrimRightFunc()
  • TrimSpace()
  • TrimSuffix()

Dropping Invalid Characters

In some cases, it's OK to drop invalid characters. I recommend only doing it for non-critical and optional fields. For example, you may have a description or notes field that contains free text, and you want to make sure it doesn't contain certain symbols like quotes or double quotes. Here is how to do it:

Fixing Numeric Values

Numeric values are often easy to fix. If you require a precision of two decimal digits, you can truncate or round additional digits. In the same fashion, it's easy to convert integers to floating-point numbers. 

Sometimes, there is a range of valid values, and you can bring too large or too small numbers to fit the range. The following function takes a string and a range of integers and returns a string that represents an integer within the range. Too large values become the maximum value, and too small become the minimum value.

Fixing URL Values

URLs can often be fixed safely by trying different schemes ("http" or "https") or adding or dropping "www" sub-domains. Combining the options with trying to fetch the candidates can give you confidence that the fix was correct. 

Dealing With Missing Values

Missing values are very common when ingesting real-world data. If the missing value is required, there are two primary ways to handle it (without rejecting the row altogether)—use default values or recover the value from an alternative source.

Applying Default Values

Default values are helpful because the processing code doesn't have to check if a value is present or not. The data cleaning code ensures that there is always a value in place. In many cases, the default is so common that it is also a helper for data input where you don't have to enter the same default value again and again.

Using Alternative Data

This approach is a little more involved. The idea is to consult another data source that has the requested information. For example, if you have a user's email, but the first and last name is missing, you may consult your user database and extract the user's name. This saves the processing code from accessing the DB or even being aware of this dependency.

Putting Everything Together

Let's clean up a little dataset of products. The fields are:

Column Name Column Description
Id PRD-XXXX-XXXX (where X is a digit) 
Name up to 40 characters long
Price fixed precision numeric field (two decimal points)
Description up to 500 characters long (optional)

Here is the dataset in a readable form (whitespace will be trimmed during cleanup):

The first two products are valid. The third product, "PRD-1234-666", is missing a digit in its id. The next product, "PRD-1234-7777", is missing a price. The last product, "prd-1234-8888", has an invalid product id, but it can be safely fixed (make it uppercase). 

The following code will clean up the data, fix what can be fixed, drop the rows that can't be fixed, and produce a clean dataset and a report that can be used to manually correct the invalid data.

To verify the product id and the price, I'll use regular expressions. Here are the two helper functions:

Once the data is cleaned up and all the invalid rows of data have been dropped, the following function will write the clean data to a new CSV file called "clean.csv" and print it to the screen.

The main() function does most of the work. It iterates over the original dataset, eliminates redundant whitespace, fixes what it can, keeps track of dropped data rows, writes the clean data to file, and finally reports on the dropped lines.

Conclusion

Go has well-designed packages for text processing. Unlike in most languages, the string object is really just a slice of bytes. All the string handling logic is in separate packages such as "strings" and "strconv". 

In the second part of the tutorial, we utilized a lot of concepts to accomplish a common real-world task of cleaning up a CSV formatted dataset before analysis.


No comments:

Post a Comment