Our engineers have learned a lot since adopting Amazon DynamoDB as a data store for the Nexosis machine learning platform. They learned so much that they decided to turn it into a blog post series. It’s not an exhaustive course on DynamoDB, but a set of vignettes focused on specific issues they encountered. John kicks it off below with costs and document sizing.
I am a relational database developer, like many of you. For over 15 years, I have focused a great deal of time and energy learning SQL Server, understanding the intricacies of query optimization and execution plans. I faithfully applied 3NF to schema design, limiting or eliminating redundant data like a good database developer should.
Then along came an opportunity to use Amazon DynamoDB, and I was thrilled with the promises it made. “Single-millisecond latency at any scale!” After some initial evaluation, reading documentation and marketing materials, it seemed like a good fit for our use case at Nexosis: storing time-series data without a fixed schema. So, we dove in and started building around it.
Then the dark times came. Query performance was more sluggish than could have been imagined. Costs were creeping (and sometimes leaping) upwards. Limitations, initially buried in fine print, were suddenly out in the open, hampering our abilities to balance throughput with cost.
We fought through these obstacles though, and over time, we were able to get DynamoDB performing well enough to serve our customers at a reasonable cost. This blog series is a record of the lessons we learned adopting Amazon DynamoDB as a data store for the Nexosis machine learning platform. It’s not an exhaustive course on DynamoDB, but a set of vignettes focused on specific issues we encountered, and how we overcame them. We came at DynamoDB from a relational database perspective, and that led to certain blindnesses and pitfalls, which hopefully we can help you avoid.
Understanding costs: read/write capacity units
As a developer, you control the throughput of each DynamoDB table by configuring read and write capacity units (RCUs and WCUs respectively) for the table and the table’s indexes1. Each RCU and WCU provides a guaranteed level of throughput that is spelled out in the DynamoDB documentation, summarized below:
- 1 RCU = Read 1 strongly consistent document (up to 4KB) per second
- 1 WCU = Write 1 document (up to 1KB) per second
Amazon charges per-hour for configured read and write capacity. While prices vary between different AWS regions, for the purposes of this discussion we’ll use the US East region’s prices:
- 1 RCU = $0.00013 per hour, supporting up to 7,200 reads per hour
- 1 WCU = $0.00065 per hour, supporting up to 3,600 writes per hour
Sounds like peanuts right? Let’s use these figures in a real-world example. Say you are designing a DynamoDB table and you anticipate the size of each document will be around 1KB. You want to support 100 document reads and 100 document writes per second for a modest application. We can figure out the monthly cost of your table based on the information above:
- 100 RCU * $0.00013 per hour * 730 hours per month = $9.49
- 100 WCU * $0.00065 per hour * 730 hours per month = $47.45
- Total cost for one table with 100 RCU and 100 WCU for one month = $56.94
The first thing to note is that writing to DynamoDB tables is always far more expensive than reading. In this example, where documents are sized optimally for writing, writing still costs 5x as much as reading. This difference in cost becomes even more pronounced when documents become smaller than 1KB. To make the comparison fair, we will assume that we’re considering the same total volume of data, just split into smaller individual documents.
Let’s consider an example of reading and writing 1024 documents of 100 bytes each, which is still 100KB of total data. Assuming the application uses the DynamoDB
Query operator to read items with the same partition key together, the same 100 RCU will be sufficient to support the application.2 However, the smaller documents each need to be written separately, meaning our table requires 1024 WCUs for the same throughput.
- 100 RCU * $0.00013 per hour * 730 hours per month = $9.49
- 1024 WCU * $0.00065 per hour * 730 hours per month = $485.89
- Total cost for one table with 100 RCU and 1024 WCU for one month = $495.38
Whoa! That’s not good! This cost is for throughput on a single table, without any indexes. By way of comparison, a full RDBMS running on a db.r3.xlarge (4 cores, 30.5 GB RAM) EC2 instance will cost a comparable amount to just a couple of these tables:
- PostgreSQL on db.r3.xlarge: $1.000 per hour * 730 hours per month = $730
- SQL Server on db.r3.xlarge: $1.520 per hour * 730 hours per month = $1109.60
Of course, there are plenty of ways in which these simple examples will be different from any real-world use cases. These examples assume static throughput for an entire month; real-world applications will almost certainly use Auto Scaling, which will dynamically increase and decrease configured throughput based on current load. Each table in an application will have different document sizes and a different usage profile. It is certainly possible to build an application that uses DynamoDB efficiently and cost-effectively. The point of this exercise is to demonstrate that without careful consideration of an application’s data access patterns, the cost of using DynamoDB will quickly grow beyond that of a standalone RDBMS or NoSQL database.
Optimal usage scenarios for DynamoDB
So when is DynamoDB a more cost-effective choice than a standalone database server? The following are some guidelines:
- Read-intensive workloads: Since writing to DynamoDB is much more expensive than reading, applications that read documents far more frequently than they write are good candidates for using it.
- Relatively few tables: On a standalone database server, server resources such as CPU and RAM are shared between all of the tables and indexes on the server. In DynamoDB, each table and index has its own configured throughput that it doesn’t share with any other table. Therefore, applications that have varied access patterns across many tables may be better served by the shared resources of a standalone server. On the other hand, an application that uses relatively few tables in a consistent manner may benefit from fine-grained control over each table’s throughput.
- Occasional, predictable access: DynamoDB really shines in applications that require occasional, sustained high throughput followed by periods of relatively low throughput. When high throughput is required, a table can be scaled up to process requests quickly, and once the period of high demand ends, the table can be scaled back down to lower cost.3 Dynamic scaling of a DynamoDB table may be more cost-effective than an “always-on” standalone database server.
- Document sizes between 1KB and 4KB: As the examples presented above demonstrate, DynamoDB is not cost-effective for writing a large number of very small documents. Applications can maximize the throughput of a DynamoDB table by reading and writing documents between 1KB and 4KB in size. I will likely present some tips for transforming fine-grained documents into more coarse-grained documents within these size boundaries in a later article.
1 Read/Write capacity units are only configured separately for global secondary indexes, which have a different partition key than the table.
2 See “Capacity Unit Consumption for Reads” in the DynamoDB documentation for exact details about how the different read operations consume capacity units.
3 Dynamic capacity has its own set of restrictions, but they are outside the scope of this article.