CockroachDB is a distributed SQL database. The primary design goals are scalability, strong consistency and survivability (hence the name). CockroachDB aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. CockroachDB nodes are symmetric; a design goal is homogeneous deployment (one binary) with minimal configuration and no required external dependencies.

The entry point for database clients is the SQL interface. Every node in a CockroachDB cluster can act as a client SQL gateway. A SQL gateway transforms and executes client SQL statements to key-value (KV) operations, which the gateway distributes across the cluster as necessary and returns results to the client. CockroachDB implements a single, monolithic sorted map from key to value where both keys and values are byte strings.

The KV map is logically composed of smaller segments of the keyspace called ranges. Each range is backed by data stored in a local KV storage engine (we use RocksDB, a variant of LevelDB). Range data is replicated to a configurable number of additional CockroachDB nodes. Ranges are merged and split to maintain a target size, by default 64M. The relatively small size facilitates quick repair and rebalancing to address node failures, new capacity and even read/write load. However, the size must be balanced against the pressure on the system from having more ranges to manage.

CockroachDB achieves horizontally scalability:

  • adding more nodes increases the capacity of the cluster by the amount of storage on each node (divided by a configurable replication factor), theoretically up to 4 exabytes (4E) of logical data;
  • client queries can be sent to any node in the cluster, and queries can operate independently (w/o conflicts), meaning that overall throughput is a linear factor of the number of nodes in the cluster.
  • queries are distributed (ref: distributed SQL) so that the overall throughput of single queries can be increased by adding more nodes.

CockroachDB achieves strong consistency:

  • uses a distributed consensus protocol for synchronous replication of data in each key value range. We’ve chosen to use the Raft consensus algorithm; all consensus state is stored in RocksDB.
  • single or batched mutations to a single range are mediated via the range’s Raft instance. Raft guarantees ACID semantics.
  • logical mutations which affect multiple ranges employ distributed transactions for ACID semantics. CockroachDB uses an efficient non-locking distributed commit protocol.

CockroachDB achieves survivability:

  • range replicas can be co-located within a single datacenter for low latency replication and survive disk or machine failures. They can be distributed across racks to survive some network switch failures.
  • range replicas can be located in datacenters spanning increasingly disparate geographies to survive ever-greater failure scenarios from datacenter power or networking loss to regional power failures (e.g. { US-East-1a, US-East-1b, US-East-1c }, { US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East, US-West, Japan, Australia }).

CockroachDB provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes–both from a historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. CockroachDB implements a limited form of linearizability , providing ordering for any observer or chain of observers.

Similar to Spanner directories, CockroachDB allows configuration of arbitrary zones of data. This allows replication factor, storage device type, and/or datacenter location to be chosen to optimize performance and/or availability. Unlike Spanner, zones are monolithic and don’t allow movement of fine grained data on the level of entity groups.

Reference:

  1. CockroachDB设计文档(上)
  2. CockroachDB设计文档(中)
  3. CockroachDB设计文档(下)