Making the Single Table Design Decision
During my time at Amazon I migrated a production Aurora MySQL database to DynamoDB in an application where nearly all of the access patterns were key-value lookups. However, the data was normalized, with relationships defined between different data types in portfolios. Single table design is the tool that let me bridge the gap between DynamoDB’s non-relational architecture and the relationships that I needed to maintain in my data.
DynamoDB as an opinionated data store
Amazon developed DynamoDB around the kernel of its internal Dynamo database, which was meant to solve recurring problems they were having with relational databases as the company’s growth exceeded the scale those databases could support. Werner Vogels was among the group at Amazon who did this original work, and he describes their motivations in A Decade of Dynamo. There are a number of central paradigms to DynamoDB — distributed architecture, eventual consistency, uncapped scalability — but the central important one for modeling relational data is the API. DynamoDB has no capability for joins between tables. All lookups on DynamoDB are by the primary key, and any operation that does not involved the primary key requires an expensive table scan. The primary key can be a composite primary key, however, with a partition key that must have an exact match in any query operation, and a sort key that can have a wild card match. The partition key and sort key have a physical relationship with the data: the data are separated into partitions by the partition key and are sorted as they are saved using the sort key.
DynamoDB also embraces the fact that storage has become cheaper while compute largely has not. Vertical scaling of CPUs has slowed down in the past two decades as Moore’s Law no longer holds, while storage has largely become cheaper over the same period. Hence DynamoDB optimizes for compute over storage, with the data in each table replicated at least 3 times by default. This approach also drives another key feature of DynamoDB: Global Secondary Indices.
A global secondary index in DynamoDB is a replication of a projection of fields from the table with an alternative primary key. This allows you to define alternative access patterns for data in the table that have the same fast speed as the pattern defined by the primary index on the table. The downside is that unlike primary keys there is no way to query GSIs with strong consistency, and you are limited to 25 GSIs (as of this date) for each table.
Each of these features plays a part in the single table design pattern, which is meant to leverage the advantages of DynamoDB when storing relational data.
What is single table design
With their new database designed around key-value stores, Amazon had a massive migration of their internal data from Oracle to DynamoDB. To do that migration without major hiccups, they had to figure out how to store relational data in a non-relational database, an effort led by Rick Houlihan and his team, as he describes in this interview.
Alex Debrie has been one of the main proponents of single table design in DynamoDB, and I used his book as the basis of my migration work. He lays out the basics of single-table design here.
The fundamentals of single table design are thus:
- Data are pre-joined in the table, rather than joined by queries
- The partition key groups together related items of different types into item collections
- An overloaded sort key allows you to select by type
The database I migrated has a collection of infrastructure portfolios, each with a set of servers and databases that belong to the portfolio. The general table structure looked like this:
| ID (PK) | Name |
|---|---|
| 1 | Jeff's Plumbing |
| 2 | Bob's TVs |
| ID (PK) | Portfolio ID (FK) |
|---|---|
| JeffSiteServer | 1 |
| JeffInternalServer | 1 |
| BobSiteServer | 2 |
| BobInternalServer | 2 |
| ID (PK) | Server ID (FK) |
|---|---|
| JeffSiteDB | JeffSiteServer |
| JeffInternalDB | JeffInternalServer |
| BobSiteDB | BobSiteServer |
| BobInternalDB | BobInternalServer |
The single table design looked like this:
| PK | SK | Portfolio ID | Server ID | Database ID | Portfolio Name |
|---|---|---|---|---|---|
| 1 | PORTFOLIO | 1 | Jeff's portfolio | ||
| 1 | SERVER#jeffsiteserver | 1 | JeffSiteServer | ||
| 1 | SERVER#jeffinternalserver | 1 | JeffInternalServer | ||
| 1 | DATABASE#jeffsitedb | 1 | JeffSiteServer | JeffSiteDB | |
| 1 | DATABASE#jeffinternaldb | 1 | JeffInternalServer | JeffInternalDB | |
| 2 | PORTFOLIO | 2 | Bob's portfolio | ||
| 2 | SERVER#bobsiteserver | 2 | BobSiteServer | ||
| 2 | SERVER#bobinternalserver | 2 | BobInternalServer | ||
| 2 | DATABASE#bobsitedb | 2 | BobSiteServer | BobSiteDB | |
| 2 | DATABASE#bobinternaldb | 2 | BobInternalServer | BobInternalDB |
A query for PK === 1 would return Jeff's portfolio plus all of its servers and databases.
| PK | SK | Portfolio ID | Server ID | Database ID | Portfolio Name |
|---|---|---|---|---|---|
| 1 | PORTFOLIO | 1 | Jeff's portfolio | ||
| 1 | SERVER#jeffsiteserver | 1 | JeffSiteServer | ||
| 1 | SERVER#jeffinternalserver | 1 | JeffInternalServer | ||
| 1 | DATABASE#jeffsitedb | 1 | JeffSiteServer | JeffSiteDB | |
| 1 | DATABASE#jeffinternaldb | 1 | JeffInternalServer | JeffInternalDB |
To get all of the servers for Jeff's portfolio, I would use a query like PK === 1, SK startsWith SERVER.
| PK | SK | Portfolio ID | Server ID | Database ID | Portfolio Name |
|---|---|---|---|---|---|
| 1 | SERVER#jeffsiteserver | 1 | JeffSiteServer | ||
| 1 | SERVER#jeffinternalserver | 1 | JeffInternalServer |
I can also have a GSI with Server ID as the partition key and SK as the sort key, if I want to be able to query all of the databases that belong to each server. If I project all the fields onto a GSI with Server ID as the partition key and SK as the sort key, I could query for Server ID = JeffSiteServer and SK startsWith DATABASE and get the following:
| PK | SK | Portfolio ID | Server ID | Database ID | Portfolio Name |
|---|---|---|---|---|---|
| 1 | DATABASE#jeffsitedb | 1 | JeffSiteServer | JeffSiteDB |
One issue with that secondary index is that you would want the server IDs to be globally unique in that scenario. This is because the GSI spans the entire table, so non-unique server IDs in different portfolios would collide. If your server IDs are unique for every portfolio, you can make a column combining the portfolio ID with the server ID, then use that column as the partition key for the GSI, like this:
| PK | SK | Portfolio ID | Server ID | GSI1PK | Database ID | Portfolio Name |
|---|---|---|---|---|---|---|
| 1 | PORTFOLIO | 1 | Jeff's portfolio | |||
| 1 | SERVER#jeffsiteserver | 1 | JeffSiteServer | 1#JeffSiteServer | ||
| 1 | SERVER#jeffinternalserver | 1 | JeffInternalServer | 1#JeffInternalServer | ||
| 1 | DATABASE#jeffsitedb | 1 | JeffSiteServer | 1#JeffSiteServer | JeffSiteDB | |
| 1 | DATABASE#jeffinternaldb | 1 | JeffInternalServer | 1#JeffInternalServer | JeffInternalDB | |
| 2 | PORTFOLIO | 2 | Bob's portfolio | |||
| 2 | SERVER#bobsiteserver | 2 | BobSiteServer | 2#BobSiteServer | ||
| 2 | SERVER#bobinternalserver | 2 | BobInternalServer | 2#BobInternalServer | ||
| 2 | DATABASE#bobsitedb | 2 | BobSiteServer | 2#BobSiteServer | BobSiteDB | |
| 2 | DATABASE#bobinternaldb | 2 | BobInternalServer | 2#BobInternalServer | BobInternalDB |
Now you can query all of the databases for JeffSiteServer on portfolio 1 by querying on the GSI with GSI1PK = 1#JeffSiteServer and SK startsWith DATABASE:
| PK | SK | Portfolio ID | Server ID | GSI1PK | Database ID | Portfolio Name |
|---|---|---|---|---|---|---|
| 1 | DATABASE#jeffsitedb | 1 | JeffSiteServer | 1#JeffSiteServer | JeffSiteDB |
What if I wanted to make a query for all of the servers in the DynamoDB table? The application design actually made that query unnecessary. In this application, the user only loads a single portfolio in each session, making a query across portfolios unnecessary for OLTP. Those queries are better served by an OLAP database that separates that work from the customer-facing data, such as a data lake populated by your DynamoDB table.
There are more complex conditions, such as many-to-many joins, that single table design can handle, and for those cases I suggest applying the techniques in Alex DeBrie’s book.
When to use single table design
While single table design solved the specific problem of how to store my relational data, it’s best used when the data have certain conditions:
- The data are related. If the data have no relationship, there is no reason to put them in the same table. Single-table design benefits from keeping related data close together in the data store for quick retrieval, but there is no advantage when the data have no relationship
- Access patterns can be defined solely on the partition key and sort key. This is a broader recommendation for the decision to use DynamoDB. Any query that does not use the primary key requires a table scan. If you’re doing a lot of queries on values that aren’t in the partition key or the sort key, DynamoDB in general is a poor choice
- The relationships in the data model are not likely to change. If you’re early in a project and you’re not sure how the data types are really related or what the general access patterns will be, you should not commit to a design that cements the structure of the joins in the database itself. If you’re at the start of a project, you can model the data in memory or in a small local store first before analyzing the access patterns and determining if single table design is a good fit.
In my case, I felt confident using single table design because the data structures were mature and would not be changing. A colleague of mine did actually add a new data type, but it fit in with the existing hierarchy and did not change the design. Additionally the access patterns had no ad-hoc queries and all queries to the data store could be performed on the primary key.
When not to use single table design
- Single table design should result from data modeling, rather than the other way around. Pete Naylor describes in his Momento blog the anti-pattern of choosing single-table design as the ideal endpoint for a data model. I agree with his judgement that the number of tables that you use should be the outcome of a good data modeling approach, rather than the target
- You use a lot of ad-hoc access patterns. This is more of an argument against DynamoDB in general for that use case, but if you analyze the access patterns that your application uses with your data store and you can’t simplify these to conform to primary key queries, you should strongly consider data stores other than DynamoDB
- Your data model is changing rapidly. If your data model is changing rapidly, cementing the relationship between the entities in your data store will get in your way. You will either find yourself forced to accommodate an outdated data model or you will need to remake the data store entirely, complete with a backfill of existing data. Adding new fields that don’t affect the relationship between entities should not be a problem, but changes requiring backfills with complex queries can be difficult and changes to the relationship between entities is nearly impossible.