Dynamo DB - Important concepts

🌟 Brief Introduction to DynamoDB

DynamoDB is a fully managed NoSQL database service offered by AWS. Its main features make it an ideal choice for modern applications that require high performance and scalability. Here are some highlights:

🚀 Serverless: Automatic scaling without the need to manage servers.
⚡ Low Latency: Fast responses for real-time applications.
📈 Throughput Capacity: On-demand or provisioned capacity configuration.
💪 High Performance: Great solution for intensive and large-scale workloads.
🔗 Easy AWS Integration: Easily connect to various AWS services. See some examples at Serverless Land.

This summary captures some of DynamoDB's initial concepts, based on the official AWS documentation, to provide a quick overview of its main advantages.

→ Data Types Supported by DynamoDB

Before we start talking about our partition key and other concepts, here's a brief summary of the data types supported by this database. Amazon DynamoDB supports a variety of data types, allowing flexibility in storage. These types are classified into three main categories:

Scalar:
- String (S): Text, with a 400 KB limit.
- Number (N): Numeric values, including integers and decimals, represented with up to 38 digits of precision.
- Binary (B): Binary data, such as files and images, with a 400 KB limit.
- Boolean: true or false values.
- Null: Represents null values.
Document:
- Map: Key-value structures, allowing nested data.
- List: Arrays of values, which can include any combination of data types, including nested lists and maps.
Sets:
- String Set (SS): Set of strings without duplicate values.
- Number Set (NS): Set of unique numbers.
- Binary Set (BS): Set of binary data, without duplicate values.

Important note - it's important to highlight the concept of binary used in this context:

Finally, these data types give DynamoDB a flexible structure, allowing from simple data to complex and nested objects. For more detailed information, check the official documentation at DynamoDB Data Types Documentation.

🔑 Partition Key

The partition key is essential for DynamoDB, as it determines which physical partition – stored on SSDs – each item will be stored in. This concept allows DynamoDB to distribute data efficiently.

According to AWS documentation:

"DynamoDB uses the partition key value as input to an internal hash function. The output of the hash function determines the partition (DynamoDB's internal physical storage) where the item will be stored. All items with the same partition key value are stored together, sorted by sort key value."

In summary, all items with the same partition key are stored together, ordered by the sort key value (when it exists). This allows quick access to data that shares the same partition key, optimizing queries and read/write operations in the database.

For more information about how the partition key works, check the official DynamoDB documentation.

In the note below, also taken from the link above, there's an explanation of why we see the PK (partition key) being called hash attribute and the SK (sort key) being called range attribute.

Want some tips on how to choose your partition key? The documentation has great tips.

From DynamoDB's own documentation, let's see a graphical representation of how items are distributed across partitions - knowing how this distribution is done helps a lot when thinking about the approach we'll use for modeling our table(s):

DynamoDB stores an item with a composite partition key and sorts that item using the sort key attribute value.

🔑 Sort Key

The sort key is a classification key that organizes items within the same partition sequentially, facilitating specific searches and range queries. This key is also called range attribute, as it defines the sequence and range of items stored within a partition associated with a partition key (or hash attribute). With the sort key, queries can be made using operators like BETWEEN, >, <, and >=, allowing ordered and precise filtering.

possible translations of the word sort (linguee):

🔍 Global Secondary Index

This index can be created at any time (it's not mandatory to create during table creation like LSI).
Supports a combination of partition key and sort key different from the main table.
Eventually consistent: reads are eventually consistent.
Limit of 20 global secondary indexes per table.

🔍 Local Secondary Index

The Local Secondary Index (LSI) allows creating an alternative view of the main table, with a new sort key configuration without changing the partition key. This index is ideal for performing additional queries on already stored data, using different sorting criteria.

Main features:

Creation at table time: LSI must be created along with the table; it cannot be added later.
Same partition key as the main table: The LSI's partition key must be identical to the main table's, but allows defining an alternative sort key.
Strong consistency: Reads on LSI are always consistent, ensuring that read data reflects the latest changes.
Limit of 5 LSIs per table: Each table can have a maximum of 5 LSIs, being a finite resource to be planned according to query needs.

LSI is useful for cases where different data orderings are needed without changing the partition structure.

→ RCU (read capacity unit)

Read capacity unit
Read capacity unit in DynamoDB. One read capacity unit (RCU) allows one strongly consistent read of up to 4 KB per second, or two eventually consistent reads of up to 4 KB per second. RCUs should be sized according to the expected read load and the size of items to be read.
In summary, (a) If strongly consistent, one read of up to 4 KB/s. If eventually consistent, two reads of up to 4KB/s. Documentation

→ WCU (write capacity unit)

Write capacity unit
One write capacity unit (WCU) allows one write of up to 1 KB per second to an item in the table. Documentation

🚀 Throughput Capacity in DynamoDB

DynamoDB offers two main options for managing throughput capacity, ensuring flexibility and scalability to meet different usage patterns:

On-Demand Capacity:
- Ideal for unpredictable workloads or with extreme variations.
- DynamoDB automatically allocates and adjusts capacity to meet demand peaks.
- Doesn't require prior planning of RCUs (Read Capacity Units) and WCUs (Write Capacity Units).
- Pay-per-request, ensuring you only pay for what you use.
Provisioned Capacity:
- More economical for constant and predictable workloads.
- Allows manually defining the number of RCUs and WCUs, with Auto Scaling option to automatically adjust capacity when demand increases.
- RCU (Read Capacity Unit): Each RCU allows reading up to 4 KB twice per second for eventually consistent reads, or 4 KB per second for strongly consistent reads.
- WCU (Write Capacity Unit): Each WCU allows writing up to 1 KB per second.
- Properly configuring throughput avoids throttling and reduces costs in predictable workloads.

The choice between on-demand and provisioned capacity depends on the specific needs of the application, with on-demand being ideal for variable loads and provisioned for stable loads (with autoscaling flexibility).

For more details about throughput management, check the official documentation.

→ DynamoDB Streams

Definition: Allows generating an event stream from data modifications in the table.
DynamoDB Streams is a feature that captures changes in a DynamoDB table in real-time. Each time an item in the table is inserted, updated, or deleted, DynamoDB Streams records this modification and allows other applications to access these change events. Each change is kept in the stream for up to 24 hours and can be consumed by services like AWS Lambda, enabling automatic processing of these changes.

Explanatory drawing present in the documentation:

DynamoDB Streams and Lambda integration to automatically send a welcome email to new customers.

🔎 Scan vs Query in DynamoDB

In DynamoDB, scan and query operations are used to retrieve data from tables, but they work in very different ways and present distinct costs, especially in large tables. Understanding when to use each can have a significant impact on performance and cost.

🚀 Query

A query is an efficient operation, focused on retrieving specific items based on a partition key (and optionally a sort key). Since the query only searches data in a specific partition, the resource consumption is reduced, making it the more economical choice for targeted queries.

Efficiency: Since the query accesses only one partition, it avoids traversing unnecessary data.
Consistency: Can be configured for strongly consistent or eventually consistent reads, with the latter consuming half the read units.
Costs: Ideal for searches that can be delimited by the partition key, reducing the number of RCUs (Read Capacity Units) needed.

🌀 Scan

A scan is an operation that reads the entire table or index, applying filters to the data only after reading. In large tables, this approach can become quite costly, as the scan traverses all items and consumes more RCUs, including for data that will be discarded by filters.

Versatility: Allows filtering without needing a partition key, but accesses the entire table, making it heavier.
Consistency: Like queries, scan reads can be configured for strong or eventual consistency.
Costs: Large tables result in high RCU consumption, making scan expensive for large volumes of data.

For more details, see the documentation about Query and Scan in DynamoDB.