Membership checks and conversions
Finding an item in a collection of items is a recurring task in programming. Different programming languages differ in the way they approach this. For instance, PHP lets us determine whether the position of a substring in a string is greater than -1, which indicates no occurrence. A numeric index >= 0 indicates that a match was found. Python, on the other hand, lets us return whether a word is in the text (True/False). Sometimes we may hear that regular expression matching in Python is fast, but this seems to contradict to what we know about PHP, where string operations tend to be faster. Fast compared to what? For what kind of assumptions? For what types of data and input size? Then we hear: "Set operations are fast". We may have a gut feeling about this, but what if it is wrong? How do we decide then which claims to believe?
Python has many data structures that allow us to test for membership in a very similar way. This is highly flexible, but it is not always obvious which one is preferable when the data requirements tend to constantly change during the program flow. Will we do a simple membership or will there be ten more operations we didn’t foresee initially? The answer always depends on the context.
We could take a simple text, eliminate any punctuation in it, split it on empty characters and then convert the list of words to various other data structures. We could count the number of occurences of each word within the text and use this as a value in a dictionary, having the words as keys. We could use list, tuple, set, dictionary, named tuple, default dict, ordered dict, deque, but also normal strings to test how they respond to lookups for a particular keyword. For this purpose I took a relatively small 2062char-351word text, which corresponds to a normal-length blog post, having ants as its main topic. After converting the text to various data structures, I have chosen to test for the occurrence of the word “pheromone”, since I knew that it appeared exactly once, so as not to place an unfair burden on a regular expression. Here are the results from the test:
NamedTuple is actually a list of named tuples for each word. Since the default way of checking with the “in” operator didn’t seem to work here, I have chosen to iterate over the list, assign key and value to each tuple and compare the key with the chosen keyword. Computationally, this adds a lot of work, which is represented in the height of the bar indicating that this is 4.38 times slower than a lookup in the next slowest data structure here, which is a deque. Using the search method with a regular expression is really fast, but only comparing to the two structures already mentioned. In fact, regex search is 28.52 times slower here than a DefaultDict. To me, the DefaultDict result is a real surprise and I wouldn’t have found this, if I didn’t test most options. However, it would be wrong to immediately conclude that DefaultDict will always be faster than a set for larger inputs, especially when these can also be numeric. Since I tested on Python 2.6.6, I couldn’t examine newer data structures like OrderedDict.
Even when we know that certain checks will be faster by using a certain data structure, we may need to perform many conversions in order to get there, which will offset any initial advantages it has. In R, if we have a data frame and want to convert it to a vector, we may need to convert it to an intermediary matrix first. Every time such a conversion occurs, we are effectively slowing down the program. Clean designs alter the data with the smallest number of conversions. For this reason, we need to examine what happens, for instance, if we want to check for membership in a set, but need to first generate it from other existing data.
With the same text as an example, I have measured some conversions with other ones failing for some reasons. The fastest conversion is the one that isn’t made, but it would also be hard to label it here. Apart from this, we can see that converting a tuple to a list is around 3-5% faster compared to the other way around (again, for words as items). Then we see that dictionaries tend to convert better than sets. Since the time scales on this and the previous figure are the same, we can relate the data. For instance, membership checks in most data structures complete within 19 microseconds with the fastest taking half a microsecond. The fastest of all tested conversions took 4.7microseconds and the slowest 16.7 microseconds. This means that the fastest conversion is approximately 10 times slower than the fastest membership check. A stacked bar chart could give more context here for the case when both operations are needed. Other conversions would be even slower if they would need some form of iteration (native functions are usually faster).
But how do we convert a tuple, list or set to dictionary without adding more data? How do we convert a list, tuple or dictionary to a set? The answer might be different depending on what we want. I hope that this will help you to decide which data structure to choose if you need to do membership checks in large amounts of data.