Google Spanner: Whitepaper Review

Shambhavi Shandilya
6 min readDec 16, 2023

--

With the volume of data increasing and the limit of scaling up a single machine, the storage systems are bound to store data in multiple instances. This gave rise to a new paradigm of data storage known as “Distributed Databases”. A distributed database is a database that runs and stores data across multiple computers, as opposed to doing everything on a single machine.

While a distributed database handles the scalability, availability and geographic replication properties, it is challenging to develop a system that is fault-tolerant and consistent. In this article, I shall provide a basic understanding of Spanner, a scalable, globally-distributed database designed, built and deployed by Google. Google Cloud Spanner is designed to seamlessly combine the benefits of traditional relational databases with the advantages of cloud-native NoSQL databases.

The distinguishing features provided by Spanner are:

  1. Schematised Semi-Relational Tables
  2. Data Versioning
  3. General purpose transactions with a SQL-based query language.

Vertically scalable databases ( Oracle, MySql, PostgreSQL) generally provide strong consistency, but Horizontally scalable databases (Hbase, Cassandra etc.) are eventually consistent. Despite being a distributed database, Spanner guarantees both externally consistent reads and writes and globally consistent reads across the database at a timestamp.

In this article, I have discussed the basic implementation of Spanner and the mechanisms (Paxos and TrueTime) that make it an externally consistent database.

Credits: Priyanka Vergadia, Staff Developer Advocate, Google Cloud

Data Model

A tablet is a single unit of data handled by Spanner. It implements a bag of the following mappings: (key, timestamp) → value
Since Spanner assigns timestamps to data, it makes it more like a multi-version database than a key-value store. On top of these key-value mappings, Spanner supports another layer of abstraction, which is a set of contiguous keys that share a common prefix, known as a directory.

A directory is the unit of data movement in Spanner. Whenever data is moved between Paxos groups (which will be discussed later), it is moved directory by directory. A directory is the smallest unit whose geographic-replication properties can be specified by an application. In case a directory turns too large, Spanner will shard a directory into multiple fragments.

Spanner supports data models based on schematized semi-relational tables. Every row in a table is defined by a name, which is formed by the primary keys of the table. The table overall defines mappings from primary-key columns to non-primary-key columns.

Architecture

The architecture of Spanner is divided into a set of administrative Zones. Zones can be added or removed from the system as new data centres are brought into service and old ones are turned down. The figure below gives a general visualisation.

Spanner Server Architecture. Source: https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

The universemaster acts as a console which helps in debugging by displaying the status information of all the zones. The placement driver, as the name suggests, handles the movement of data across zones.

Each zone has a zonemaster that assigns data to hundreds and thousands of spanservers. The spanservers serve data to the clients. A zone also consists of location proxies, which help clients to locate the spanserver holding their data.

Further, a spanserver is responsible for 100 and 1000 instances of tablets.

Maintaining Consistency

External consistency is the most important feature of Spanner being a distributed database. To maintain consistency across various replicas, Spanner utilizes the Paxos Algorithm across the data replicas.

I shall provide a very basic understanding of the Paxos algorithm in this article. For further readings, please refer to the references attached. Paxos is a consensus algorithm where a proposal (change) is accepted or rejected unanimously in the distributed system, thus maintaining consistency across the system. Thus, the system comprises of three types of nodes:

  1. Proposers: Nodes that propose a change (Write, Relocate requests)
  2. Acceptors: Nodes that accept or reject a change
  3. Learners: After the proposal phase, the nodes learn the accepted change and define a new state to the system.

A node can be in all three states during the complete Paxos process.

Spanserver Software Stack. Source: https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

To support replication in Spanner, each spanserver implements a single Paxos machine on the top of each tablet. These Paxos machines maintain their metadata and logs. A set of tablets that refer to the same data (replicas) is collectively termed a Paxos group. Every set of replicas has a leader that proposes a change in the system.

Concurrency is maintained by a lock table and transaction manager, implemented at every replica leader. Whenever a transaction arrives at the replica leader, it is first accessed to identify if it involves more than one Paxos group (which is ideally not the case). If multiple Paxos groups are involved, the transaction manager coordinates a two-phase commit.

TrueTime

In a distributed system, it is a challenge to maintain a global time system. Each machine has its own clock, and therefore, they can get out of synchronization with time. Providing each server instance with its own atomic clock cannot solve this problem, as atomic clocks are also prone to clock drift. Synchronizing the system with a single global clock is also not feasible, as the RPC call response times are uncertain.

Credits: Google Cloud

To handle these inconsistencies of time in a distributed system, Spanner implemented the TrueTime API. This implementation guarantees that a whole-database audit read at a timestamp t will see the effects of every transaction that has been committed as of t. The timestamps utilised by Spanner reflect the serialization order. TrueTime explicitly represents time as a TTinterval, which is an interval with bounded time uncertainty.

The clock references used by TrueTime are GPS and Atomic Clock. The use of multiple references minimizes the system’s failures, as both references suffer from different failure modes. TrueTime is implemented by a set of time master machines per datacenter and time slave daemons per machine.

The majority of time master machines have GPS receivers, which deduce time. The remaining masters utilise the Atomic Clocks. All master’s time references are regularly compared with each other, and if a master diverges a lot, it is evicted from the system. To decide on a commit timestamp, every daemon polls multiple masters to decide upon a TrueTime Interval.

Conclusion

As the CAP theorem states, a distributed system thrives to have Consistency, Availability and Partition tolerance. As claimed by Google, in case of partition, Spanner chooses C and forfeits A. It is technically a CP system.

But, it’s Google’s engineering feat that they are able to minimize the possibility of a network Partition Tolerance ‘P’ (1 failure in 10⁵ or less)with the help of its ultra-fast global private redundant network across Google datacenters.

To sum up, Spanner uses the two-phase commit to achieve serializability, but it uses TrueTime for external consistency, consistent reads without locking, and consistent snapshots. These technologies ensure Spanner is an externally consistent distributed database with better performance than traditional upscaled database systems.

References:

  1. https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
  2. https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/45855.pdf
  3. https://youtu.be/d7nAGI_NZPk?si=WBYqTe2H5js3RJj0
  4. https://youtu.be/JwPs6BgPvzo?si=mMAeFXkEVYGYKwfE

--

--