Unlocking Nested Data in Dremel: Repetition & Definition Levels

Shambhavi Shandilya
7 min readJun 12, 2024

--

The explosion of big data in the early 2000s presented a significant challenge for existing data processing solutions. Traditional relational databases struggled with the sheer volume and velocity of data, and early distributed processing frameworks had limitations. Here’s a timeline showcasing the evolution and limitations that led to Dremel’s introduction:

⌛ Big Data Timeline

Early 2000s — Rise of Distributed Processing Frameworks

  • Hadoop (2006): Emerged as a pioneering open-source framework for distributed storage (HDFS) and processing (MapReduce) of large datasets across clusters of commodity hardware. While powerful, MapReduce’s batch-oriented processing and reliance on disk I/O resulted in slow turnaround times for interactive queries.

Late 2000s — Addressing Hadoop’s Limitations

  • Pig (2006): Introduced a high-level language (Pig Latin) for describing data flows on top of Hadoop, making it easier to write MapReduce jobs. However, Pig still relied on MapReduce’s underlying batch-processing model.
  • Hive (2008): Built on top of Hadoop, providing a SQL-like interface for querying data stored in HDFS. While offering more familiar syntax, Hive still suffered from similar performance limitations as MapReduce for complex queries.

Dremel (2007)

Despite the advancements of Hadoop, Pig, and Hive, a critical gap remained: the ability to perform interactive analytics on massive datasets. Here’s how Dremel addressed these limitations:

  • Columnar Data Layout: Unlike traditional row-based storage, Dremel stores data in columns rather than entire rows. This allows for faster retrieval of specific data elements needed for a query, leading to improved performance compared to Hive’s row-based approach.
  • Multi-Level Execution Trees: Dremel broke down complex queries into smaller, independent subqueries. These subqueries were then executed in parallel on different cluster nodes, significantly reducing overall processing time compared to the sequential nature of MapReduce jobs.
  • In-Memory Processing: Dremel leverages in-memory processing whenever possible, further accelerating query execution compared to disk-based processing in Hadoop and Hive.

In this blog, I will try to cover the importance of Columnar Data Layout in Dremel and how Dremel goes beyond traditional columnar storage by introducing robust support for nested data structures.

📍 Importance of Columnar Data Layout

Traditional databases retrieve entire rows from disk even if only a few specific data points are needed. Imagine searching a library for a particular author’s books. In a row-based system, you’d have to scan the entire bookshelf (all rows) to find the author’s section (specific column). With columnar storage, Dremel acts like a librarian who knows exactly where author information is located. It only needs to access the “Author” column, significantly reducing disk reads and speeding up the query.

Difference between row-based and column-based storage | Credits: heavy.ai

While Dremel popularized columnar storage for large-scale data analysis, it wasn’t the first. Several database systems before Dremel, e.g. C-Store, Vertica, and InfiniDB explored and utilized column storage for various reasons.

Dremel, however, allows us to store nested data within columns, using techniques like repetition and definition levels to represent complex relationships between data points efficiently.

📍 Representation of Nested Data Type

Dremel utilizes repetition and definition levels to encode the hierarchical structure of nested data within columns. These levels are stored in separate integer columns alongside the actual data values.

  • Repetition Level: This level indicates how many times a value is repeated within the same nesting depth. A value with a repetition level of 0 signifies the first occurrence at its level. Higher values (1, 2, etc.) represent subsequent repetitions within the same parent group.
  • Definition Level: This level indicates the presence of optional or repeated fields in the path leading to the value. A definition level of 1 signifies the field is present at its level. A level of 2 (or higher, depending on the schema complexity) might indicate the field is implicitly defined by the existence of its parent group, even if it doesn’t have a separate value.

We shall understand these values using a proto example. Here is the schema of the proto we are using.

message UploadCartInformation {
required string cartId = 1;
repeated Item items = 2;
optional string discountCode = 3;
}
message Item {
required string id = 1;
required string name = 2;
repeated string tags = 3;
repeated string images = 4;
repeated ItemCustomization customizations = 5;
}
message ItemCustomization {
required string note = 1;
optional string countryCode = 2;
}
message Image {
required string url = 1;
optional string altText = 2;
}

Now, the two example UploadCartInformation values we will be using:

To understand repetition and definition levels, we shall perform the following steps:

  1. Write down the path of the required attribute.
  2. Ignore the required fields in the path and mark the repeated fields with a sequence index (starting from 1).
  3. Add another sequencing for non-required fields in the path (repeated and optional fields), starting from 1.
  4. The repetition level is the sequence index value of the most recent field that has been repeated.
  5. The definition level is the sequence index value of the most recent non-required field that has a value and has been defined.

✅ Example 1: items.name

r and d values for items.name

We have 3 field values of {items.name} in the examples: T-Shirt (cart1), Shoes, and T-Shirt (cart2). We shall first mark the sequence values for the path attributes. items is marked for both repeated and definition sequences as it is a repeated field. name is not marked as it is a required field.

T-Shirt (cart1): When we first encounter a value for {items.name}, its repetition level will be 0. Since the item attribute has been defined in the path, it’s definition level is 1 (definition sequence value of item).

Shoes (cart2): Similarly, in cart2, when we first encounter a value for {items.name}, its repetition level will be 0. Since the item attribute has been defined in the path, it’s definition level is 1 (definition sequence value of item).

T-Shirt (cart2): Here, we find that to reach “Shoe”, item attribute has been repeated, therefore its repetition level will be 1 (repetition sequence value of item). Since the item attribute has been defined in the path, it’s definition level is 1 (definition sequence value of item).

✅ Example 2: items.images.url

We have 5 field values of {items.name} in the examples: image1, image2 (cart1), image3, image4, and image5 (cart2). We shall first mark the sequence values for the path attributes. items and images are marked for both repeated and definition sequences as it is a repeated field. url is not marked as it is a required field.

The definition levels of all the fields are 2 since the images attribute has been defined in the path (definition sequence value of images)

image1 (cart1): When we first encounter a value for {items.images.url}, its repetition level will be 0.

image2 (cart1): In this case, images have been repeated therefore the repetition level is 2 (repetition index of images).

image3 (cart2): When we first encounter a value for {items.images.url}, its repetition level will be 0.

image4 (cart2): In this case, item field has been repeated, (images has not yet been repeated), therefore the repetition level is 1 (repetition index of images).

image5 (cart2): In this case, images have been repeated therefore the repetition level is 2 (repetition index of images).

✅ Example 3: items.images.altText

We have 2 field values of {items.images.altText} in the examples: Front view (cart1), and Side view (cart2). We shall first mark the sequence values for the path attributes. items and images are marked for both repeated and definition sequences as it is a repeated field. altText is marked for Definition Sequence as it is an optional field.

The definition levels of all the fields are 3since the altText attribute has been defined in the path (definition sequence value of altText)

Front View (cart1): When we first encounter a value for {items.images.altText}, its repetition level will be 0.

Side View (cart1): When we first encounter a value for {items.images.altText}, its repetition level will be 0.

📍 Conclusion

These levels are the heart of Dremel’s ability to represent and manipulate nested data efficiently. Repetition levels identify how often a value is repeated within its nesting depth, while definition levels indicate the presence of optional or repeated fields. Understanding these levels empowers users to interpret the structure and relationships within their data.

While Dremel is no longer actively developed, its successor, BigQuery, builds upon these principles. The underlying concepts of columnar storage and nested data representation with definition and repetition levels remain at the core of BigQuery’s efficient data processing capabilities.

References:

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

--

--