Thursday, September 21, 2017

An Introduction to ETS Tables in Elixir

An Introduction to ETS Tables in Elixir

When crafting an Elixir program, you often need to share a state. For example, in one of my previous articles I showed how to code a server to perform various calculations and keep the result in memory (and later we've seen how to make this server bullet-proof with the help of supervisors). There is a problem, however: if you have a single process that takes care of the state and many other processes that access it, the performance may be seriously affected. This is simply because the process can serve only one request at a time. 

However, there are ways to overcome this problem, and today we are going to talk about one of them. Meet Erlang Term Storage tables or simply ETS tables, a fast in-memory storage that can host tuples of arbitrary data. As the name implies, these tables were initially introduced in Erlang but, as with any other Erlang module, we can easily use them in Elixir as well.

In this article you will:

  • Learn how to create ETS tables and options available upon creation.
  • Learn how to perform read, write, delete and some other operations.
  • See ETS tables in action.
  • Learn about disk-based ETS tables and how they differ from in-memory tables.
  • See how to convert ETS and DETS back and forth.

All code examples work with both Elixir 1.4 and 1.5, which was recently released.

Introduction to ETS Tables

As I mentioned earlier, ETS tables are in-memory storage that contain tuples of data (called rows). Multiple processes may access the table by its id or a name represented as an atom and perform read, write, delete and other operations. ETS tables are created by a separate process, so if this process is terminated, the table is destroyed. However, there is no automatic garbage collection mechanism, so the table may hang out in the memory for quite some time.

Data in the ETS table are represented by a tuple {:key, value1, value2, valuen}. You can easily look up the data by its key or insert a new row, but by default there can't be two rows with the same key. Key-based operations are very fast, but if for some reason you need to produce a list from an ETS table and, say, perform complex manipulations of the data, that's possible too.

What's more, there are disk-based ETS tables available that store their contents in a file. Of course, they operate slower, but this way you get a simple file storage without any hassle. On top of that, in-memory ETS can be easily converted to disk-based and vice versa.

So, I think it's time to start our journey and see how the ETS tables are created!

Creating an ETS Table

To create an ETS table, employ the new/2 function. As long as we are using an Erlang module, its name should be written as an atom:

Note that until recently you could only create up to 1,400 tables per BEAM instance, but this is not the case anymore—you are only limited to the amount of available memory.

The first argument passed to the new function is the table's name (alias), whereas the second one contains a list of options. The cool_table variable now contains a number that identifies the table in the system:

You may now use this variable to perform subsequent operations to the table (read and write data, for example).

Available Options

Let's talk about the options that you may specify when creating a table. The first (and somewhat strange) thing to note is that by default you cannot use the table's alias in any way, and basically it has no effect. But still, the alias must be passed upon the table's creation.

To be able to access the table by its alias, you must provide a :named_table option like this:

By the way, if you'd like to rename the table, it can be done using the rename/2 function:

Next, as already mentioned, a table cannot contain multiple rows with the same key, and this is dictated by the type. There are four possible table types:

  • :set—that's the default one. It means that you can't have multiple rows with exactly the same keys. The rows are not being re-ordered in any particular manner.
  • :ordered_set—the same as :set, but the rows are ordered by the terms.
  • :bag—multiple rows may have the same key, but the rows still cannot be fully identical.
  • :duplicate_bag—rows can be fully identical.

There is one thing worth mentioning regarding the :ordered_set tables. As Erlang's documentation says, these tables treat keys as equal when they compare equal, not only when they match. What does that mean?

Two terms in Erlang match only if they have the same value and the same type. So integer 1 matches only another integer 1, but not float 1.0 as they have different types. Two terms are compare equal, however, if either they have the same value and type or if both of them are numerics and extend to the same value. This means that 1 and 1.0 are compare equal.

To provide the table's type, simply add an element to the list of options:

Another interesting option that you can pass is :compressed. It means that the data inside the table (but not the keys) will be—guess what—stored in a compact form. Of course, the operations that are executed upon the table will become slower.

Next up, you can control which element in the tuple should be used as the key. By default, the first element (position 1) is used, but this can be changed easily:

Now the second elements in the tuples will be treated as the keys.

The last but not the least option controls the table's access rights. These rights dictate what processes are able to access the table:

  • :public—any process can perform any operation to the table.
  • :protected—the default value. Only the owner process can write to the table, but all the processes can read.
  • :private—only the owner process can access the table.

So, to make a table private, you would write:

Alright, enough talking about options—let's see some common operations that you can perform to the tables!

Write Operations

In order to read something from the table, you first need to write some data there, so let's start with the latter operation. Use the insert/2 function to put data into the table:

You may also pass a list of tuples like this:

Note that if the table has a type of :set and a new key matches an existing one, the old data will be overwritten. Similarly, if a table has a type of :ordered_set and a new key compares equal to the old one, the data will be overwritten, so pay attention to this.

The insert operation (even with multiple tuples at once) is guaranteed to be atomic and isolated, which means that either everything is stored in the table or nothing at all. Also, other processes won't be able to see the intermediate result of the operation. All in all, this is pretty similar to SQL transactions.

If you are concerned about duplicating keys or do not want to overwrite your data by mistake, use the insert_new/2 function instead. It is similar to insert/2 but will never insert duplicating keys and will instead return false. This is the case for the :bag and :duplicate_bag tables as well:

If you provide a list of tuples, each key will be checked, and the operation will be cancelled even if one of the keys is duplicated.

Read Operations

Great, now we have some data in our table—how do we fetch them? The easiest way is to perform lookup by a key:

Remember that for the :ordered_set table, the key should compare equal to the provided value. For all other table types, it should match. Also, if a table is a :bag or an :ordered_bag, the lookup/2 function may return a list with multiple elements:

Instead of fetching a list, you may grab an element in the desired position using the lookup_element/3 function:

In this code, we are getting the row under the key :number and then taking the element in the second position. It also works perfectly with :bag or :duplicate_bag:

If you would like to simply check if some key is present in the table, use member/2, which returns either true or false:

You may also get the first or the last key in a table by using first/1 and last/1 respectively:

On top of that, it is possible to determine the previous or the next key based on the provided one. If such a key cannot be found, :"$end_of_table" will be returned:

Note, however, that the table traversal using functions like first, next, last or prev is not isolated. It means that a process may remove or add more data to the table while you are iterating over it. One way to overcome this issue is by using safe_fixtable/2, which fixes the table and ensures that each element will be fetched only once. The table remains fixed unless the process releases it:

Lastly, if you'd like to find an element in the table and remove it, employ the take/2 function:

Delete Operations

Okay, so now let's say you no longer need the table and wish to get rid of it. Use delete/1 for that:

Of course, you may delete a row (or multiple rows) by its key as well:

To clear out the entire table, utilize delete_all_objects/1:

And, lastly, to find and remove a specific object, use delete_object/2:

Converting the Table

An ETS table can be converted to a list anytime by using the tab2list/1 function:

Remember, however, that fetching the data from the table by the keys is a very fast operation, and you should stick to it if possible.

You may also dump your table to a file using tab2file/2:

Note that the second argument should be a charlist (a single-quoted string).

There are a handful of other operations available that can be applied to the ETS tables, and of course we are not going to discuss them all. I really recommend skimming through the Erlang documentation on ETS to learn more.

Persisting the State With ETS

To summarize the facts that we have learned so far, let's modify a simple program that I have presented in my article about GenServer. This is a module called CalcServer that allows you to perform various calculations by sending requests to the server or fetching the result:

Currently our server doesn't support all mathematical operations, but you may extend it as needed. Also, my other article explains how to convert this module to an application and take advantage of supervisors to take care of the server crashes.

What I'd like to do now is add another feature: the ability to log all the mathematical operations that were performed along with the passed argument. These operations will be stored in an ETS table so that we will be able to fetch it later.

First of all, modify the init function so that a new named private table with a type of :duplicate_bag is created. We are using :duplicate_bag because two identical operations with the same argument may be performed:

Now tweak the handle_cast callback so that it logs the requested operation, prepares a formula, and then performs the actual computation:

Here is the prepare_and_log private function:

We are logging the operation right away (the corresponding function will be presented in a moment). Then return the appropriate function or nil if we don't know how to handle the operation.

As for the log function, we should either support a tuple (containing both the operation's name and the argument) or an atom (containing only the operation's name, for example, :sqrt):

Next, the calculate function, which either returns a proper result or a stop message:

Finally, let's present a new interface function to fetch all the performed operations by their type:

Handle the call:

And perform the actual lookup:

Now test everything:

The result is correct because we have performed two :add operations with the arguments 1 and 2. Of course, you may further extend this program as you see fit. Still, don't abuse ETS tables, and employ them when it is really going to boost the performance—in many cases, using immutables is a better solution.

Disk ETS

Before wrapping up this article, I wanted to say a couple of words about disk-based ETS tables or simply DETS

DETS are pretty similar to ETS: they use tables to store various data in the form of tuples. The difference, as you've guessed, is that they rely on file storage instead of memory and have fewer features. DETS have functions similar to the ones we discussed above, but some operations are performed a bit differently.

To open a table, you need to use either open_file/1 or open_file/2—there is no new/2 function like in the :ets module. Since we don't have any existing table yet, let's stick to open_file/2, which is going to create a new file for us:

The filename is equal to the table's name by default, but this can be changed. The second argument passed to the open_file is the list of options written in the form of tuples. There are a handful of available options like :access or :auto_save. For instance, to change a filename, use the following option:

Note that there is also a :type option that may have one of the following values:

  • :set
  • :bag
  • :duplicate_bag

These types are the same as for the ETS. Note that DETS cannot have a type of :ordered_set.

There is no :named_table option, so you can always use the table's name to access it.

Another thing worth mentioning is that the DETS tables must be properly closed:

If you don't do this, the table will be repaired the next time it is opened.

You perform read and write operations just like you did with ETS:

Bear in mind, though, that DETS are slower than ETS because Elixir will need to access the disk which, of course, takes more time.

Note that you may convert ETS and DETS tables back and forth with ease. For example, let's use to_ets/2 and copy the contents of our DETS table in-memory:

Copy the ETS's contents to DETS using to_dets/2:

To sum up, disk-based ETS is a simple way to store contents in the file, but this module is slightly less powerful than ETS, and the operations are slower as well.

Conclusion

In this article, we have talked about ETS and disk-based ETS tables that allow us to store arbitrary terms in memory and in files respectively. We have seen how to create such tables, what the available types are, how to perform read and write operations, how to destroy tables, and how to convert them to other types. You may find more information about ETS in the Elixir guide and on the Erlang official page.

Once again, don't overuse ETS tables, and try to stick with immutables if possible. In some cases, however, ETS may be a nice performance boost, so knowing about this solution is helpful in any case. 

Hopefully, you've enjoyed this article. As always, thank you for staying with me, and see you really soon!


No comments:

Post a Comment