Building a Toy Database in Python
Databases are one of the most fundamental components of modern computing. Behind the scenes of nearly every application you use on a daily basis, databases are hard at work storing, retrieving, and manipulating data. From social media apps and online stores to banking systems and government institutions, databases power our digital world.
While databases have existed in various forms since the early days of computing, they‘ve evolved significantly over the decades. The rise of web applications and "big data" in recent years has driven explosive growth in both the scale and complexity of databases. With the amount of data being captured and analyzed today, designing efficient, scalable, and secure databases is more critical than ever.
Learning by Building
Database internals can be a complex topic, especially for those new to backend development. Terms like normalization, ACID transactions, and query optimization can be intimidating at first. But often the best way to learn a new concept is to get your hands dirty and build something with it.
That‘s our goal in this post – to implement a simple "toy" database from scratch using Python. We‘ll walk through the core components of a basic key-value store, one of the simplest but most important types of databases. In the process, we‘ll demystify some of the magic behind databases and see that even a modest amount of Python can get us pretty far!
Why Python?
You can implement a key-value store in any number of programming languages, but Python is a particularly great choice for learning. Here‘s why:
-
Batteries included – Python ships with a powerful standard library that has built-in modules for many common tasks. In particular, we‘ll be using the
json
module to easily serialize our data to disk. -
Extensive 3rd party ecosystem – While we‘re using only the standard library today, Python has a wealth of excellent 3rd party packages for working with databases. From lightweight options like
sqlite3
to full-fledged ORMs like SQLAlchemy, Python‘s database toolkit is extensive. -
Easy to read and write – Python emphasizes readability and simplicity in its syntax. Compared to lower-level languages like C, or more verbose ones like Java, Python code is quicker to write and easier to understand. This lets us focus on the core database concepts without getting bogged down in complicated code.
-
Interactive development – One of Python‘s killer features is its interactive interpreter. You can execute Python code line by line and get immediate feedback, making it great for experimentation and debugging. We‘ll take advantage of this to play around with our database as we go.
A Crash Course in JSON
To implement our database, we‘ll be using JSON (JavaScript Object Notation) as the format to persist the data to disk. JSON has become the de facto standard for data interchange on the web, and for good reason. Let‘s do a quick refresher on what makes JSON so useful.
At its core, JSON is a lightweight, human-readable format for structuring data. It supports a few basic data types:
- Strings –
"Hello, world!"
,"json"
- Numbers –
42
,-1.5
- Booleans –
true
,false
- Arrays –
[1, 2, "c", true]
- Objects –
{"key": "value", "count": 5}
- null –
null
These simple building blocks can be nested and combined to represent more complex data structures. Here‘s an example JSON object representing a person:
{ "name": "John Smith", "age": 35, "employed": true, "hobbies": ["reading", "golf", "coding"], "address": { "street": "123 Main St", "city": "Anytown", "country": "USA" } }
One of JSON‘s big advantages is how seamlessly it maps to data structures in many programming languages. In Python, JSON objects become dictionaries, arrays become lists, and the rest of the types map to their Python equivalents. This symmetry makes conversion between JSON and Python trivial:
import jsonperson = { "name": "John Smith", "age": 35, "employed": True, "hobbies": ["reading", "golf", "coding"], "address": { "street": "123 Main St", "city": "Anytown", "country": "USA" } }
json_string = json.dumps(person) print(json_string)
person_from_json = json.loads(json_string) print(person_from_json)
The json
module‘s dumps
function encodes a Python object into a JSON string, while loads
decodes a JSON string into a Python object. You can also use dump
and load
to encode/decode to a file or file-like object. We‘ll be using these functions extensively in our database implementation.
JSON has exploded in popularity in recent years. A study in 2017 found that over 70% of all public web APIs use JSON as their primary data format. High-traffic sites like Twitter and Facebook use JSON extensively in their data exchange. And of course, JSON is the backbone of nearly every NoSQL database out there, including MongoDB, CouchDB, and Firebase.
Implementing a Python Database
With that JSON review out of the way, let‘s dive into the code! We‘ll start by sketching out a class to hold our database logic:
import json import osclass PyDB: def init(self, path): self.path = path self.data = {}
self.load(self.path) def load(self, path): if os.path.exists(path): with open(path, ‘r‘) as file: self.data = json.load(file) else: self.data = {} def save(self): with open(self.path, ‘w‘) as file: json.dump(self.data, file) def get(self, key): return self.data.get(key) def set(self, key, value): self.data[key] = value self.save() def delete(self, key): if key in self.data: del self.data[key] self.save()
Let‘s break this down piece by piece:
-
In the
__init__
constructor, we take apath
parameter specifying where to store the database file. We store this path and initializeself.data
as an empty dictionary to hold the in-memory data. Finally, we callload
to read any existing data from the file. -
The
load
method checks if a file exists at the given path. If so, it usesjson.load
to read the file contents intoself.data
. If no file exists yet,self.data
remains an empty dictionary. -
save
is responsible for persisting the in-memoryself.data
to disk. It opens the file atself.path
and usesjson.dump
to write outself.data
as a JSON string. We‘ll call this any time we modify the database. -
get
takes a key and returns the corresponding value fromself.data
, orNone
if the key isn‘t found. We just use the dictionary‘sget
method here for simplicity. -
To insert or update data,
set
takes a key and value, stores them inself.data
, and callssave
to persist the change. -
Finally,
delete
removes a key-value pair fromself.data
and saves, if the key exists. It doesn‘t return anything.
And that‘s it – a basic but fully functional key-value store in about 30 lines of Python! It‘s not going to win any performance awards, but it‘s enough to get a taste of how databases work under the hood.
Of course, there are a number of enhancements we could make. Right now, our database will happily accept values of any type, but we might want to enforce that keys are always strings, for example. We could also add some error handling to the set
and delete
methods. Here‘s an expanded version with these improvements:
def set(self, key, value): if not isinstance(key, str): raise ValueError("Key must be a string")if value is None: raise ValueError("Value cannot be None") self.data[key] = value self.save()
def delete(self, key):
if key is None:
raise ValueError("Key cannot be None")if key not in self.data: raise KeyError(f"Key not found: {key}") del self.data[key] self.save()
Now if we try to
set
a key that isn‘t a string or a value ofNone
, we‘ll get an informative exception. Similarly,delete
will raise an error if we attempt to delete a nonexistent or null key. We could even go a step further and create a custom exception hierarchy for our database, but we‘ll leave that as an exercise to the reader.To make our database interactive, we could build a simple command-line interface that lets us type in commands to get, set, and delete values. But for something a bit more fun and visual, let‘s create a web-based frontend that runs Python code right in the browser!
Using a tool called Pyodide, we can run a Python interpreter in WebAssembly directly in a web page. I‘ve created a little demo that presents a JavaScript-based HTML frontend to our PyDB class. It lets you type in keys and values, displaying the current contents of the database as you go. The "database" is just stored in memory, so it resets with each page load, but it‘s still pretty nifty!
Try out the interactive PyDB demo
Going Further
What we‘ve built is just the beginning when it comes to databases. There are endless features and optimizations we could add to our humble key-value store, like:
-
Indexing – Right now, to find a value by key, we have to scan through the entire dictionary. That‘s fine for small datasets, but it quickly breaks down as the data grows. We could implement a more sophisticated indexing scheme, like a hash table or B-tree, to enable efficient lookups even with millions of keys.
-
Query options – Key-value stores are fast and simple, but sometimes we need more powerful querying capabilities. We could expand PyDB‘s API to support things like range queries, filtering, sorting, and even basic aggregations. The more query power we add, the closer we get to a full-fledged database.
-
Persistence options – We‘re currently using JSON files for data storage, but that‘s far from the only option. We could swap out the storage engine to use a more efficient binary format, or even switch to an established library like SQLite under the hood.
-
Concurrency – PyDB is strictly single-threaded right now. If we wanted to support multiple concurrent readers and writers, we‘d have to implement some kind of locking scheme to maintain data integrity. Atomic operations, multiversioning, and write-ahead logging are all techniques used by databases to stay fast and correct in the face of concurrency.
-
Replication – For greater durability and availability, databases are often deployed in replicated configurations, where multiple database servers store copies of the data. Keeping replicas in sync through techniques like consensus algorithms and eventual consistency is a key challenge.
-
Sharding – As data size grows, it often becomes necessary to split the dataset across multiple machines. This horizontal partitioning is called sharding, and it‘s key to achieving massive scale in distributed databases. Consistent hashing, shard rebalancing, and query routing are all important components of an effective sharding scheme.
Most established databases have spent decades of engineering effort on the above concerns and many more. And we‘ve barely scratched the surface of the broader database landscape – document stores, graph databases, time-series databases, and cloud object storage are all widely used variants, each with their own performance characteristics and use cases.
When it comes to Python and databases, there‘s a rich ecosystem to explore. Here are a few key libraries to be aware of:
-
sqlite3 – Included in Python‘s standard library, sqlite3 provides an interface to the widely used SQLite embedded database. SQLite is lightweight and serverless, making it great for development, testing, and even production for low-to-moderate traffic applications.
-
SQLAlchemy – The most popular Python ORM (object-relational mapper), SQLAlchemy provides a Pythonic interface to relational databases. It supports a wide range of database backends and offers extensive features for modeling, querying, and manipulating data.
-
Django ORM – The Django web framework ships with its own full-featured ORM. While tightly coupled to Django itself, the Django ORM is a solid choice if you‘re already using Django and don‘t need the flexibility of a standalone ORM like SQLAlchemy.
-
pymongo – The official Python driver for MongoDB, pymongo lets you integrate Python applications with the popular document-oriented database. Its API is designed to closely match the MongoDB query language, and it supports the full range of MongoDB features.
-
redis-py – Python‘s interface to Redis, the lightning-fast key-value store. Redis is often used as a high-performance cache, message queue, and real-time analytics engine alongside a traditional database.
-
Elasticsearch – While often lumped in with other NoSQL stores, Elasticsearch is really a distributed search and analytics engine. Its Python client provides a powerful API for indexing and querying structured and unstructured data alike.
Conclusion
We covered a lot of ground in this post! We saw how even a simple database like our PyDB key-value store requires careful consideration of data structures, persistence mechanisms, and error handling. We took a whirlwind tour through JSON, Python‘s json
module, and even ran Python in a web browser.
Most importantly, we approached databases with a spirit of curiosity and experimentation. Databases can seem arcane and unapproachable from the outside, but by rolling up our sleeves and building one, we demystify these ubiquitous but often hidden systems.
Python and databases are a fantastic combination. Python‘s simplicity and versatility make it a perfect language for tinkering with database concepts and paradigms. And when it‘s time to integrate a database into a real application, Python‘s extensive database ecosystem has you covered.
Of course, PyDB is just the beginning. To build production-ready, scalable databases, we need to consider a whole host of additional concerns, from performance and concurrency to replication and sharding. But the core concepts of data modeling, persistence, and querying are evergreen. Build those fundamentals, and you‘ll be well equipped to tackle databases of any shape and size.
So keep exploring! Dive into the source code of open-source Python databases. Experiment with different database backends and paradigms. And most importantly, keep building cool stuff. Databases are the backbone of our data-driven world, and with Python in your toolkit, you can help shape that world. Happy coding!