Monday, August 29, 2016

Serialization and Deserialization of Python Objects: Part 2

Serialization and Deserialization of Python Objects: Part 2

This is part two of a tutorial on serializing and deserializing Python objects. In part one, you learned the basics and then dove into the ins and outs of Pickle and JSON. 

In this part you'll explore YAML (make sure to have the running example from part one), discuss performance and security considerations, get a review of additional serialization formats, and finally learn how to choose the right scheme.

YAML

YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:

pip install yaml

The yaml module has only load() and dump() functions. By default they work with strings like loads() and dumps(), but can take a second argument, which is an open stream and then can dump/load to/from files.

Note how readable YAML is compared to Pickle or even JSON. And now for the coolest part about YAML: it understands Python objects! No need for custom encoders and decoders. Here is the complex serialization/deserialization using YAML:

As you can see, YAML has its own notation to tag Python objects. The output is still very human readable. The datetime object doesn't require any special tagging because YAML inherently supports datetime objects. 

Performance

Before you start thinking of performance, you need to think if performance is a concern at all. If you serialize/deserialize a small amount of data relatively infrequently (e.g. reading a config file at the beginning of a program) then performance is not really a concern and you can move on.

But, assuming you profiled your system and discovered that serialization and/or deserialization are causing performance issues, here are the things to address.

The are two aspects for performance: how fast can you serialize/deserialize, and how big is the serialized representation?

To test the performance of the various serialization formats, I'll create a largish data structure and serialize/deserialize it using Pickle, YAML, and JSON. The big_data list contains 5,000 complex objects.

Pickle

I'll use IPython here for its convenient %timeit magic function that measures execution times.

The default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialized size is 747,328 bytes.

Let's try with the highest protocol.

Interesting results. The serialization time shrank to only 21.2 milliseconds, but the deserialization time increased a little to 25.2 milliseconds. The serialized size shrank significantly to 394,350 bytes (52%).

JSON

Ok. Performance seems to be a little worse than Pickle for encoding, but much, much worse for decoding: 6 times slower. What's going on? This is an artifact of the object_hook function that needs to run for every dictionary to check if it needs to convert it to an object. Running without the object hook is much faster.

The lesson here is that when serializing and deserializing to JSON, consider very carefully any custom encodings because they may have a major impact on the overall performance.

YAML

Ok. YAML is really, really slow. But, note something interesting: the serialized size is just 200,091 bytes. Much better than both Pickle and JSON. Let's look inside real quick:

YAML is being very clever here. It identified that all 5,000 dicts share the same value for the 'a' key, so it stores it only once and references it using *id001 for all objects.

Security

Security is an often a critical concern. Pickle and YAML, by virtue of constructing Python objects, are vulnerable to code execution attacks. A cleverly formatted file can contain arbitrary code that will be executed by Pickle or YAML. There is no need to be alarmed. This is by design and is documented in Pickle's documentation:

Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

As well as in YAML's documentation:

Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function.

You just need to understand that you shouldn't load serialized data received from untrusted sources using Pickle or YAML. JSON is OK, but again if you have custom encoders/decoders than you may be exposed, too.

The yaml module provides the yaml.safe_load() function that will load only simple objects, but then you lose a lot of YAML's power and maybe opt to just use JSON.

Other Formats

There are many other serialization formats available. Here are a few of them.

Protobuf

Protobuf, or protocol buffers, is Google's data interchange format. It is implemented in C++ but has Python bindings. It has a sophisticated schema and packs data efficiently. Very powerful, but not very easy to use.

MessagePack

MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it doesn't require a schema. It has a type system that's similar to JSON, but a little richer. Keys can be any type, and not just strings and non-UTF8 strings are supported.

CBOR

CBOR stands for Concise Binary Object Representation. Again, it supports the JSON data model. CBOR is not as well-known as Protobuf or MessagePack but is interesting for two reasons: 

  1. It is an official Internet standard: RFC 7049.
  2. It was designed specifically for the Internet of Things (IoT).

How to Choose?

This is the big question. With so many options, how do you choose? Let's consider the various factors that should be taken into account:

  1. Should the serialized format be human-readable and/or human-editable?
  2. Is serialized content going to be received from untrusted sources?
  3. Is serialization/deserialization a performance bottleneck?
  4. Does serialized data need to be exchanged with non-Python environments?

I'll make it very easy for you and cover several common scenarios and which format I recommend for each one:

Auto-Saving Local State of a Python Program

Use pickle (cPickle) here with the HIGHEST_PROTOCOL. It's fast, efficient and can store and load most Python objects without any special code. It can be used as a local persistent cache also.

Configuration Files

Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It's used successfully by Ansible and many other projects. In some situations, you may prefer to use straight Python modules as configuration files. This may be the right choice, but then it's not serialization, and it's really part of the program and not a separate configuration file.

Web APIs

JSON is the clear winner here. These days, Web APIs are consumed most often by JavaScript web applications that speak JSON natively. Some Web APIs may return other formats (e.g. csv for dense tabular result sets), but I would argue that you can package csv data into JSON with minimal overhead (no need to repeat each row as an object with all the column names). 

High-Volume / Low-Latency Large-Scale Communication

Use one of the binary protocols: Protobuf (if you need a schema), MessagePack, or CBOR. Run your own tests to verify the performance and the representative power of each option.

Conclusion

Serialization and deserialization of Python objects is an important aspect of distributed systems. You can't send Python objects directly over the wire. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage. 

Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Being aware of all the options and the pros and cons of each one will let you choose the best method for your situation.


No comments:

Post a Comment