Thursday, December 14, 2017

Testing Data-Intensive Code With Go, Part 4

Testing Data-Intensive Code With Go, Part 4

Overview

This is part four out of five in a tutorial series on testing data-intensive code with Go. In part three, I covered testing against a local complex data layer that includes a relational DB and a Redis cache.

In this tutorial, I'll go over testing against remote data stores using shared test databases, using production data snapshots, and generating your own test data.

Testing Against Remote Data Stores

So far, all our tests were conducted locally. Sometimes, that's not enough. You may need to test against data that is hard to generate or obtain locally. The test data may be very large or change frequently (e.g. production data snapshot).

In these cases, it may be too slow and expensive for each developer to copy the latest test data to their machine. Sometimes, the test data is sensitive, and especially remote developers shouldn't have it on their laptop.

There are several options here to consider. You may use one or more of these options in different situations.

Shared Test Database

This is a very common option. There is a shared test database that all developers can connect to and test against. This shared test DB is managed as a shared resource and often gets populated periodically with some baseline data, and then developers can run tests against it that query the existing data. They may also create, update, and delete their own test data.

In this case, you need a lot of discipline and a good process in place. If two developers run the same test at the same time that creates and deletes the same objects then both tests will fail. Note that even if you're the only developer and one of your tests doesn't clean up after itself properly, your next test might fail because the DB now has some extra data from the previous test that can break your current test. 

Running the Tests Remotely

This is how CI/CD pipelines or even just automated build systems work. A developer commits a change, and an automated build and test start running. But you can also just connect to a remote machine that has your code and run your tests there.

The benefit is that you can replicate the exact local setup, but have access to data that's already available in the remote environment. The downside is that you can't use your favorite tools for debugging.

Ad-Hoc Remote Test Instance

Launching a remote ad-hoc test instance ensures that you are still isolated from other developers. It is pretty similar conceptually to running a local instance. You still need to launch your data store (or stores). You still need to populate them (remotely). However, your test code runs locally, and you can debug and troubleshoot using your favorite IDE (Gogland in my case). It can be difficult to manage operationally if developers keep test instances running after the tests are done.

Using Production Data Snapshots

When using a shared test data store, it is often populated with production data snapshots. Depending how sensitive and critical the data is, some of the following pros and cons may be relevant.

Pros and Cons of Using Production Data for Testing

Pros:

  • You test against real data. If it works, you're good.
  • You can load and performance test data that represent an actual load.
  • You don't need to write data generators that try to simulate real production data.

Cons:

  • It may not be easy to test error conditions.
  • Production data might be sensitive and require special treatment.
  • You need to write some code or manually synchronize your snapshot periodically.
  • You have to deal with format or schema changes.
  • It can be difficult to isolate issues that show up with messy production data.

Anonymizing Production Data

OK. You've made the leap and decided to use a production data snapshot. If your data involves humans in any shape or form, you may have to anonymize the data. This is surprisingly difficult.

You can't just replace all names and be done with it. There are many ways to recover PII (personally identifiable information) and PHI (protected health information) from badly anonymized data snapshots. Check out Wikipedia as a starting point if you're curious.

I work for Helix where we develop a personal genomics platform that deals with the most private data—the sequenced DNA of people. We have some serious safeguards against accidental (and malicious) data breaches.

Updating Tests and Data Snapshots

When using production data snapshots, you'll have to periodically refresh your snapshots and correspondingly your tests. The timing is up to you, but definitely do it whenever there is a schema or format change. 

Ideally, your tests shouldn't test for properties of a particular snapshot. For example, if you refresh your snapshots daily and you have a test that verifies the number of records in the snapshot, then you'll have to update this test every day. It's much better to write your tests in a more generic way, so you need to update them only when the code under test changes. 

Generating Test Data

Another approach is generating your own test data. The pros and cons are the exact opposites of using production data snapshots. Note that you can also combine the two approaches and run some tests on production data snapshots and other tests using generated data.

Random Test Data Generation

How would you go about generating your test data? You can go wild and use totally random data. For example, for Songify we can just generate totally random strings for user email, URL, description, and labels. The result will be chaotic, but valid data since Songify doesn't do any data validation.

Here is a simple function for generating random strings:

Let's write a function that adds five random users and then adds 100 random songs distributed randomly between the five users. We must generate users because songs don't live in a vacuum. Each song is always associated with at least one user.

Now, we can write some tests that operate a lot of data. For example, here is a test that verifies we can get all 100 songs in one call. Note that the test calls PopulateWithRandomData() before making the call. 

Rule-Based Test Data Generation

Usually, completely random data is not acceptable. Every data store has constraints you must respect and complex relationships that must be followed in order to create valid data the system can operate on. You may want to generate some invalid data too to test how the system handle it, but those will be specific errors you'll inject.

The approach will be similar to the random data generation except that you'll have more logic to enforce the rules. 

For example, let's say we want to enforce the rule that a user can have at most 30 songs. Instead of randomly creating 100 songs and assigning them to users, we can decide that each user will have exactly 20 songs, or maybe create one user with no songs and four other users with 25 songs each. 

Narrative-Based Test Data Generation

In some cases, generating test data is very complicated. I recently worked on a project that had to inject test data to four different micro-services, each one managing its own database with the data in each database related to the data in other databases. It was pretty challenging and labor intensive to keep everything in sync.

Usually, in such situations it is easier to utilize the systems APIs and existing tools that create data instead of directly going into multiple data stores and praying that you don't tear the fabric of the universe. We couldn't take this approach because we actually needed to create some invalid data intentionally to test various error conditions and to skip some side effects regarding external systems that happen during the normal workflow. 

Conclusion

In this tutorial, we covered testing against remote data stores, using shared test databases, using production data snapshots, and generating your own test data.

In part five we will focus on fuzz testing, testing your cache, testing data integrity, testing idempotency, and missing data. Stay tuned.


No comments:

Post a Comment