Python Data Structures Compared
Let's take a look at 5 different Python data structures and see how they could be used to store data we might be processing in our everyday tasks, as well as the relative memory they use for storage and time they take to create and access.
Photo by Hitesh Choudhary on Unsplash
Choosing a structure for storing your data is an important part of solving programming tasks and implementing solutions, yet it is often not given the attention a choice of such potential weight deserves. Unfortunately, in Python I often see the list used as a catch-all data structure. The list has its advantages, of course, but also its drawbacks. There are lots of other data structure options as well.
Let's take a look at 5 different Python data structures and see how they could be used to store data we might be processing in our everyday tasks, as well as the relative memory they use for storage and time they take to create and access.
Types of Data Structures
First, let's lay out the 5 data structures we will consider herein, and provide some preliminary insight.
Class
In this case we are talking about vanilla classes (as opposed to data classes below), which are described at a high level in Python documentation as follows:
Classes provide a means of bundling data and functionality together. Creating a new class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.
The advantages of using classes is that they are conventional, and are well-used and -understood. Whether or not they are overkill in terms of relative required memory or time is something to look at.
Data Class
Added in Python 3.7, the data class is a special class meant for mainly holding data, which comes with some freebie methods out of the box for typical functionality like instantiating and printing instance contents. Creating a data class is accomplished using the @dataclass
decorator.
Although they use a very different mechanism, Data Classes can be thought of as "mutable namedtuples with defaults". Because Data Classes use normal class definition syntax, you are free to use inheritance, metaclasses, docstrings, user-defined methods, class factories, and other Python class features. Such a class is called a Data Class, but there's really nothing special about the class: the decorator adds generated methods to the class and returns the same class it was given.
As you can see, the automatically generated methods and the related time-savings are the main reason to consider data classes.
Named Tuple
Named tuples are an elegant implementation of a useful data structure, essentially tuple subclasses with named fields.
Named tuples assign meaning to each position in a tuple and allow for more readable, self-documenting code. They can be used wherever regular tuples are used, and they add the ability to access fields by name instead of position index.
At first look, named tuples appear to be the closest thing to simple C-like struct types natively available in Python, making them naively attractive to many.
Dictionary
The Python dictionary is a collection of key-value pairs.
Python Dictionaries are mutable unordered collections (they do not record element position or order of insertion) of key-value pairs. Keys within the dictionary must be unique and must be hashable. That includes types like numbers, strings and tuples. Lists and dicts can not be used as keys since they are mutable.
The advantage of dictionaries is that they are simple, the data within are easily accessible, and they are well-used and -understood.
List
Here it is, the one-size-fits-all Python data superstructure, or so lots of code would have you believe. Here is what the list really is:
Lists are mutable ordered and indexed collections of objects. The items of a list are arbitrary Python objects. Lists are formed by placing a comma-separated list of expressions in square brackets.
Why the widespread use of the list? It's very simple to understand and implement, and is usually the first structure one learns when picking up Python. Are there disadvantages related to speed and memory usage? Let's take a look.
Implementations
First off, let's have a look at the creation process of each of these structures and how they compare to one another.
The reason we might be using any of these data structures to store our data would vary widely, but for the unimaginative, let's imagine we are extracting data from a SQL database and need to store each record in one such structure in order to perform some processing prior to moving the data further along our pipeline.
With that in mind, here is instantiation code for creating each of the five structures.
""" class """ class CustomerClass: def __init__(self, cust_num:str, f_name:str, l_name:str, address:str, city:str, state:str, phone:str, age:int): self.cust_num = cust_num self.f_name = f_name self.l_name = l_name self.address = address self.city = city self.state = state self.phone = phone self.age = age def to_string(self): return(f'{self.cust_num}, {self.f_name}, {self.l_name}, {self.age}, {self.address}, {self.city}, {self.state}, {self.phone}'stgrcutures) """ data class """ from dataclasses import dataclass @dataclass class CustomerDataClass: cust_num: int f_name: str l_name: str address: str city: str state: str phone: str age: int """ named tuple """ from collections import namedtuple CustomerNamedTuple = namedtuple('CustomerNamedTuple', 'cust_num f_name l_name address city state phone age') """ dict """ def make_customer_dict(cust_num: int, f_name: str, l_name: str, address: str, city: str, state: str, phone: str, age: int): return {'cust_num': cust_num, 'f_name': f_name, 'l_name': l_name, 'address': address, 'city': city, 'state': state, 'phone': phone, 'age': age} """ list """ def make_customer_list(cust_num: int, f_name: str, l_name: str, address: str, city: str, state: str, phone: str, age: int): return [cust_num, f_name, l_name, address, city, state, phone, age]
Note the following:
- The creation of an instance of the built-in types dictionary and list have been place inside functions
- The difference between the class and the data class implementations, in light of the discussion above
- The (clearly subjective) elegance and simplicity of the named tuple
Let's have a look at instantiation of these structures, and a comparison of the resources required to do so.
Testing and Results
We will create a single instance of each of the 5 structures, each housing a single data record. We will repeat this process using the same data fields for each structure 1,000,000 times to get a better sense of average time, performing this process on my modest Dell notebook, using an Ubuntu-derived operating system.
Compare the code between the 5 structure instantiations below.
""" instantiating structures """ from sys import getsizeof import time # class customer_1 = CustomerClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) print(f'Data: {customer_1.to_string()}') print(f'Type: {type(customer_1)}') print(f'Size: {getsizeof(customer_1)} bytes') t0 = time.time() for i in range(1000000): customer = CustomerClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) t1 = time.time() print('Time: {:.3f}s\n'.format(t1-t0)) # data class customer_2 = CustomerDataClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) print(f'Data: {customer_2}') print(f'Type: {type(customer_2)}') print(f'Size: {getsizeof(customer_2)} bytes') t0 = time.time() for i in range(1000000): customer = CustomerDataClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) t1 = time.time() print('Time: {:.3f}s\n'.format(t1-t0)) # named tuple customer_3 = CustomerNamedTuple('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) print(f'Data: {customer_3}') print(f'Type: {type(customer_3)}') print(f'Size: {getsizeof(customer_3)} bytes') t0 = time.time() for i in range(1000000): customer = CustomerNamedTuple('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) t1 = time.time() print('Time: {:.3f}s\n'.format(t1-t0)) # dict customer_4 = make_customer_dict('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) print(f'Data: {customer_4}') print(f'Type: {type(customer_4)}') print(f'Size: {getsizeof(customer_4)} bytes') t0 = time.time() for i in range(1000000): customer = make_customer_dict('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) t1 = time.time() print('Time: {:.3f}s\n'.format(t1-t0)) # list customer_5 = make_customer_list('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) print(f'Data: {customer_5}') print(f'Type: {type(customer_5)}') print(f'Size: {getsizeof(customer_5)} bytes') t0 = time.time() for i in range(1000000): customer = make_customer_list('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56) t1 = time.time() print('Time: {:.3f}s\n'.format(t1-t0))
And here is the output of the above:
Data: EP90210, Edward, Perez, 56, 123 Fake Street, Cityville, TX, 888-456-1234 Type: <class '__main__.CustomerClass'> Size: 56 bytes Time: 0.657s Data: CustomerDataClass(cust_num='EP90210', f_name='Edward', l_name='Perez', address='123 Fake Street', city='Cityville', state='TX', phone='888-456-1234', age=56) Type: <class '__main__.CustomerDataClass'> Size: 56 bytes Time: 0.630s Data: CustomerNamedTuple(cust_num='EP90210', f_name='Edward', l_name='Perez', address='123 Fake Street', city='Cityville', state='TX', phone='888-456-1234', age=56) Type: <class '__main__.CustomerNamedTuple'> Size: 112 bytes Time: 0.447s Data: {'cust_num': 'EP90210', 'f_name': 'Edward', 'l_name': 'Perez', 'address': '123 Fake Street', 'city': 'Cityville', 'state': 'TX', 'phone': '888-456-1234', 'age': 56} Type: <class 'dict'> Size: 368 bytes Time: 0.318s Data: ['EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56] Type: <class 'list'> Size: 128 bytes Time: 0.184s
Finally, another useful piece of data would be to know the relative access times of values stored within our structures (in the case below, the address). The same retrieval will be repeated 1,000,000 times, and the average time reported below.
""" accessing an element """ # class t0 = time.time() for i in range(1000000): address = customer_1.address t1 = time.time() print(f'Type: {type(customer_1)}') print('Time: {:.3f}s\n'.format(t1-t0)) # data class t0 = time.time() for i in range(1000000): address = customer_2.address t1 = time.time() print(f'Type: {type(customer_2)}') print('Time: {:.3f}s\n'.format(t1-t0)) # named tuple t0 = time.time() for i in range(1000000): address = customer_3.address t1 = time.time() print(f'Type: {type(customer_3)}') print('Time: {:.3f}s\n'.format(t1-t0)) # dictionary t0 = time.time() for i in range(1000000): address = customer_4['address'] t1 = time.time() print(f'Type: {type(customer_4)}') print('Time: {:.3f}s\n'.format(t1-t0)) # list t0 = time.time() for i in range(1000000): address = customer_5[3] t1 = time.time() print(f'Type: {type(customer_5)}') print('Time: {:.3f}s\n'.format(t1-t0))
And the output:
Type: <class '__main__.CustomerClass'> Time: 0.098s Type: <class '__main__.CustomerDataClass'> Time: 0.092s Type: <class '__main__.CustomerNamedTuple'> Time: 0.134s Type: <class 'dict'> Time: 0.095s Type: <class 'list'> Time: 0.117s
The intent of this article is not to make a recommendation one way or another as to which data structure to use, nor is it to suggest that there is a universal best structure for every case. Instead, we wanted to have a look at some different options and their relative strength and weakness. As with all things, there are trade-offs to be made, and less quantitative considerations such as understandability, ease of use, etc. are to be taken into account when making these types of decisions.
That said, a few things do stand out from the above analysis:
- The dictionary uses the greatest amount of storage of all the structures, in our case almost 3 times as much as the next greatest — though we should be careful about generalizing until we look at the effects of scaling and internal field data types
- Unsurprisingly, the list is the fastest to instantiate, yet not the fastest from which to retrieve an element (it's almost the slowest)
- In our case, the named tuple is the slowest structure from which to retrieve an element, yet is middle of the pack for storage space
- Both classes take relatively longer to instantiate (expected), but element retrieval and space used, in both cases, are very competitive with the other structures
So not only are we not looking to recommend a single structure in every case, there is no clear winner that could be recommended in every case. Even taking the caution to generalize based on our small experiment, it is clear that priorities will need to be taken into account for making a decision as to which structure you use for a particular job. At the very least, this limited experimentation has provided some small window of insight into the performance of data structures available in Python.
Related: