This is part two of a tutorial on serializing and deserializing Python objects. In part one, you learned the basics and then dove into the ins and outs of Pickle and JSON.
In this part you'll explore YAML (make sure to have the running example from part one), discuss performance and security considerations, get a review of additional serialization formats, and finally learn how to choose the right scheme.
YAML
YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:
pip install yaml
The yaml module has only load()
and dump()
functions. By default they work with strings like loads()
and dumps()
, but can take a second argument, which is an open stream and then can dump/load to/from files.
import yaml print yaml.dump(simple) boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string
Note how readable YAML is compared to Pickle or even JSON. And now for the coolest part about YAML: it understands Python objects! No need for custom encoders and decoders. Here is the complex serialization/deserialization using YAML:
> serialized = yaml.dump(complex) > print serialized a: !!python/object:__main__.A simple: boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string when: 2016-03-07 00:00:00 > deserialized = yaml.load(serialized) > deserialized == complex True
As you can see, YAML has its own notation to tag Python objects. The output is still very human readable. The datetime object doesn't require any special tagging because YAML inherently supports datetime objects.
Performance
Before you start thinking of performance, you need to think if performance is a concern at all. If you serialize/deserialize a small amount of data relatively infrequently (e.g. reading a config file at the beginning of a program) then performance is not really a concern and you can move on.
But, assuming you profiled your system and discovered that serialization and/or deserialization are causing performance issues, here are the things to address.
The are two aspects for performance: how fast can you serialize/deserialize, and how big is the serialized representation?
To test the performance of the various serialization formats, I'll create a largish data structure and serialize/deserialize it using Pickle, YAML, and JSON. The big_data
list contains 5,000 complex objects.
big_data = [dict(a=simple, when=datetime.now().replace(microsecond=0)) for i in range(5000)]
Pickle
I'll use IPython here for its convenient %timeit
magic function that measures execution times.
import cPickle as pickle In [190]: %timeit serialized = pickle.dumps(big_data) 10 loops, best of 3: 51 ms per loop In [191]: %timeit deserialized = pickle.loads(serialized) 10 loops, best of 3: 24.2 ms per loop In [192]: deserialized == big_data Out[192]: True In [193]: len(serialized) Out[193]: 747328
The default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialized size is 747,328 bytes.
Let's try with the highest protocol.
In [195]: %timeit serialized = pickle.dumps(big_data, protocol=pickle.HIGHEST_PROTOCOL) 10 loops, best of 3: 21.2 ms per loop In [196]: %timeit deserialized = pickle.loads(serialized) 10 loops, best of 3: 25.2 ms per loop In [197]: len(serialized) Out[197]: 394350
Interesting results. The serialization time shrank to only 21.2 milliseconds, but the deserialization time increased a little to 25.2 milliseconds. The serialized size shrank significantly to 394,350 bytes (52%).
JSON
In [253] %timeit serialized = json.dumps(big_data, cls=CustomEncoder) 10 loops, best of 3: 34.7 ms per loop In [253] %timeit deserialized = json.loads(serialized, object_hook=decode_object) 10 loops, best of 3: 148 ms per loop In [255]: len(serialized) Out[255]: 730000
Ok. Performance seems to be a little worse than Pickle for encoding, but much, much worse for decoding: 6 times slower. What's going on? This is an artifact of the object_hook
function that needs to run for every dictionary to check if it needs to convert it to an object. Running without the object hook is much faster.
%timeit deserialized = json.loads(serialized) 10 loops, best of 3: 36.2 ms per loop
The lesson here is that when serializing and deserializing to JSON, consider very carefully any custom encodings because they may have a major impact on the overall performance.
YAML
In [293]: %timeit serialized = yaml.dump(big_data) 1 loops, best of 3: 1.22 s per loop In[294]: %timeit deserialized = yaml.load(serialized) 1 loops, best of 3: 2.03 s per loop In [295]: len(serialized) Out[295]: 200091
Ok. YAML is really, really slow. But, note something interesting: the serialized size is just 200,091 bytes. Much better than both Pickle and JSON. Let's look inside real quick:
In [300]: print serialized[:211] - a: &id001 boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string when: 2016-03-13 00:11:44 - a: *id001 when: 2016-03-13 00:11:44 - a: *id001 when: 2016-03-13 00:11:44
YAML is being very clever here. It identified that all 5,000 dicts share the same value for the 'a' key, so it stores it only once and references it using *id001
for all objects.
Security
Security is an often a critical concern. Pickle and YAML, by virtue of constructing Python objects, are vulnerable to code execution attacks. A cleverly formatted file can contain arbitrary code that will be executed by Pickle or YAML. There is no need to be alarmed. This is by design and is documented in Pickle's documentation:
Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
As well as in YAML's documentation:
Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function.
You just need to understand that you shouldn't load serialized data received from untrusted sources using Pickle or YAML. JSON is OK, but again if you have custom encoders/decoders than you may be exposed, too.
The yaml module provides the yaml.safe_load()
function that will load only simple objects, but then you lose a lot of YAML's power and maybe opt to just use JSON.
Other Formats
There are many other serialization formats available. Here are a few of them.
Protobuf
Protobuf, or protocol buffers, is Google's data interchange format. It is implemented in C++ but has Python bindings. It has a sophisticated schema and packs data efficiently. Very powerful, but not very easy to use.
MessagePack
MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it doesn't require a schema. It has a type system that's similar to JSON, but a little richer. Keys can be any type, and not just strings and non-UTF8 strings are supported.
CBOR
CBOR stands for Concise Binary Object Representation. Again, it supports the JSON data model. CBOR is not as well-known as Protobuf or MessagePack but is interesting for two reasons:
- It is an official Internet standard: RFC 7049.
- It was designed specifically for the Internet of Things (IoT).
How to Choose?
This is the big question. With so many options, how do you choose? Let's consider the various factors that should be taken into account:
- Should the serialized format be human-readable and/or human-editable?
- Is serialized content going to be received from untrusted sources?
- Is serialization/deserialization a performance bottleneck?
- Does serialized data need to be exchanged with non-Python environments?
I'll make it very easy for you and cover several common scenarios and which format I recommend for each one:
Auto-Saving Local State of a Python Program
Use pickle (cPickle) here with the HIGHEST_PROTOCOL
. It's fast, efficient and can store and load most Python objects without any special code. It can be used as a local persistent cache also.
Configuration Files
Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It's used successfully by Ansible and many other projects. In some situations, you may prefer to use straight Python modules as configuration files. This may be the right choice, but then it's not serialization, and it's really part of the program and not a separate configuration file.
Web APIs
JSON is the clear winner here. These days, Web APIs are consumed most often by JavaScript web applications that speak JSON natively. Some Web APIs may return other formats (e.g. csv for dense tabular result sets), but I would argue that you can package csv data into JSON with minimal overhead (no need to repeat each row as an object with all the column names).
High-Volume / Low-Latency Large-Scale Communication
Use one of the binary protocols: Protobuf (if you need a schema), MessagePack, or CBOR. Run your own tests to verify the performance and the representative power of each option.
Conclusion
Serialization and deserialization of Python objects is an important aspect of distributed systems. You can't send Python objects directly over the wire. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage.
Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Being aware of all the options and the pros and cons of each one will let you choose the best method for your situation.
No comments:
Post a Comment