Google Certification - Cloud Bigtable

This article contains my notes on Google Cloud Bigtable and is part of my series on Professional Cloud Architect certification.

I plan to take the Professional Cloud Database Engineer certification, so I am starting my training for the PCA with a database deep dive.

I do not work with Bigtable often, but when I do I need to get up to speed quickly. The typical GCP budget for a Bigtable deployment starts at $5K per month. Investing in a couple of hours to refresh is a good idea. These notes minimize my time searching for resources.

Note: this article is a work in progress while I train for the PCA and Database exams.

What is Cloud Bigtable

If you are familiar with Hadoop, Cloud Bigtable is HBase compatible. Cloud Bigtable is Google’s fully managed, scalable NoSQL database service.

Cloud Bigtable is a sparsely populated table that can scale to billions of rows and millions of columns, enabling you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.

In Bigtable, each row represents a single entity (such as an individual user or sensor) and is labeled with a unique row key. Each column stores attribute values for each row, and column families can be used to organize related columns. At the intersection of a row and column, there can be multiple cells, with each cell representing a different version of the data at a given timestamp.

Notes

There is an inconsistency in the documentation. In some places, it says thousands of columns, in other places millions of columns.
Even though a table can have millions of columns, a row should not. link
A row key must be 4 KB or less
Does not support joins
Transactions are supported only within a single row
Each table has only one index, the row key
The intersection of a row and column can contain multiple timestamped cells
Tables are sparse

History

Bigtable development began in 2004
Bigtable was released in February 2005
Cloud Bigtable was released on May 6, 2015
Google’s BigTable – Feb 2005
Bigtable: A Distributed Storage System for Structured Data – August 2006

Google Research in Data Technologies

There is a massive amount of knowledge in these research papers. If your goal is to better understand how data is managed and processed at Google read a few of these. Colossus is probably the most important.

2002 – GFS – The Google File System –
2004 – MapReduce – Simplified Data Processing on Large Clusters
2006 – BigTable – Distributed Storage System for Structured Data
2008 – Dremel – Interactive Analysis of Web-Scale Datasets –
2009 – Colossus – Google’s scalable storage system – successor to the Google File System (GFS)
2010 – Flume – Easy, Efficient Data-Parallel Pipelines
2011 – Megastore – Providing Scalable, Highly Available Storage for Interactive Services –
2012 – Spanner – Spanner, TrueTime and the CAP Theorem
2012 – MillWheel – Fault-Tolerant Stream Processing at Internet Scale
2013 – PubSub – Thialfi: A Client Notification Service for Internet-Scale Applications
2014 – F1 – Distributed SQL Database That Scales – Google AdWords
2016 – Shasta – Interactive Reporting at Scale
2016 – Slicer – Auto-Sharding for Datacenter Applications
2019 – Zanzibar – Google’s Consistent, Global Authorization System
2019 – WarpFlow – Exploring Petabytes of Space-Time Data
2019 – SageDB – A Learned Database System
2021 – Napa – Powering Scalable Data Warehousing with Robust Query Performance at Google
2021 – Glean – Structured Extractions from Templatic Documents
2023 – Firestore – The NoSQL Serverless Database for the Application Developer

Products that use Cloud Bigtable

Bigtable is used internally at Google for a number of products.

Gmail
Google Analytics
Google Blogger
Google Books
Google Code
Google Earth
Google Maps
YouTube

Apache HBase and Cassandra are some of the best known open source projects that were modeled after Bigtable.

As of January 2022, Bigtable manages over 10 Exabytes of data and serves more than 5 billion requests per second.

Key Features

Fully managed NoSQL database
Horizontally scalable
Optimized for high reads/writes per second
Supports millions of requests per second with single digit millisecond latency
Low latency
Highly scalable database
Eventually Consistent
SLO/SLA
- 99.9 – Cloud Bigtable – Zonal instance (single cluster)
- 99.999 – Replicated Instance (2 or more clusters) with Multi-Cluster routing policy (3 or more Regions)
Million of columns
Column Families
Column Cells
256 MB per row
Integration with Big Data Tools
- Apache HBase API Standard
- Apache Beam
- Apache Hadoop
- Apache Spark
Integration with Google Products
- BigQuery
- Cloud Dataflow
- Cloud Dataproc

Architecture

Horizontally scalable
Throughput can be adjusted by adding or removing nodes
Each node provides 10,000 queries per second
Each node has its own storage
No downtime while changing nodes
Components
- Frontend Server Pool
- Bigtable Cluster
  - Nodes
- Scalable Storage Backend
  - Shared data into multiple “tablets”
    - A Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Each tablet is associated with a node, and operations on these rows are performed on the node. To optimize performance, tablets are split or moved to a different node depending on access patterns. Based on user access patterns — read, write, and scan operations — tablets are rebalanced across the nodes.
  - SSTable – Sorted String Table

Use cases

Key usage: Lots of data over lots of time (days)
Time-series data
Marketing data
Low latency serving
Financial data
IoT data
Graph data

Best practices

Understand Cloud Bigtable performance – estimating throughput for Cloud Bigtable, how to plan Cloud Bigtable capacity by looking at throughput and storage use, how enabling replication affects read and write throughput differently, and how Cloud Bigtable optimizes data over time.
Cloud Bigtable schema design – guidance on designing Cloud Bigtable schema, including concepts of key/value store, designing row keys based on planned read requests, handling columns and rows, and special use cases.
Cloud Bigtable replication overview – how to replicate Cloud Bigtable across multiple zones or regions, understand performance implications of replication, and how Cloud Bigtable resolves conflicts and handles failovers.
About Bigtable backups – how to save a copy of a table’s schema and data with Bigtable Backups, which can help you recover from application-level data corruption or from operator errors, such as accidentally deleting a table.

Cloud Bigtable pricing

When estimating the pricing for Bigtable review the following major items

The type of Bigtable instance and the total number of nodes in your instance’s clusters
The amount of storage that your tables use
The amount of network bandwidth that you use
Data Access audit log costs, if enabled. This item is often overlooked.
Backup storage

Bigtable IAM

These notes on IAM are just tidbits that I want to remember. Refer to this guide for more details.

You can configure access control at the following levels:
- project
- instance
- table
IAM Permission Categories
- App profile permissions
- Backups permissions
- Cluster permissions
- Hot tablets permissions
- Instance permissions
- Key visualizer permissions
- Location permissions
- Table permissions
Predefined IAM Roles
- Bigtable Administrator – roles/bigtable.admin
- Bigtable Reader – roles/bigtable.reader
- Bigtable User – roles/bigtable.user
- Bigtable Viewer – roles/bigtable.viewer
In Bigtable, you cannot grant access to the following types of principals:
- allAuthenticatedUsers
- allUsers
Instance-level IAM management
- gcloud bigtable instances set-iam-policy INSTANCE_ID POLICY_FILE
Table-level IAM management
- gcloud bigtable instances tables set-iam-policy TABLE_ID –instance=INSTANCE_ID POLICY_FILE
IAM conditions
- Date/time attributes
  - Use to set temporary (expiring), scheduled, or limited-duration access to Bigtable resources. For example, you can allow a user to access a table until a specified date.
- Resource attributes
  - Use to configure conditional access based on a resource name, resource type, or resource service attributes. In Bigtable, you can use attributes of instances, clusters, and tables to configure conditional access. For example, you can allow a user to manage tables only on tables that begin with a specific prefix, or you can allow a user to access only a specific table.
Example IAM Policy File – note I have not verified this yet on a real Bigtable instance.

{

"bindings": [

{

"role": "roles/bigtable.user",

"members": [

"user:john@example.com"

"condition": {

"title": "expirable access",

"description": "Does not grant access after Sep 2023",

"expression": "request.time < timestamp('2023-10-01T00:00:00.000Z')",

}

"etag": "BwWWja0YfJA=",

"version": 3

}

Terraform

Bigtable Resources

Documentation Resources

Interesting Google Articles

Interesting Third-party Articles

How does BigTable work/Best practices
CSE 444: Database Internals – Lots of Bigtable internal information
Google Bigtable: Distributed Structured Datastore (Part 1)
Google Bigtable: Distributed Structured Datastore (Part 2)
Google Bigtable: Bloom filters, commit logs. (Part 3)
SSTable and Log Structured Storage: LevelDB
Andrew Dawson: What is Bigtable?

Links for Developers

All Bigtable code samples
All Bigtable samples – treelist view
Google BigTable Recipes – Interesting site with Bigtable code snippets in JavaScript and Python
PHP
- Bigtable PHP Client
- Bigtable PHP Client API Documentation

Practice Resources

Cloud Bigtable is fairly expensive to practice with personally. A single node instance will cost $650 per month. Because of that reason, I recommend doing your practice in Google Qwiklabs so that the cost is zero beyond your membership fees.

Qwiklabs – Manage Bigtable on Google Cloud – Quest
- GSP1053: Designing and Querying Bigtable Schemas
- GSP1054: Creating and Populating a Bigtable Instance
- GSP1055: Streaming Data to Bigtable
- GSP1056: Managing Bigtable Health and Performance
Qwiklab – Additional Labs
- GSP099: Bigtable: Qwik Start – Command Line
- GSP1038: Introduction to Cloud Bigtable (Java)
Cloud Bigtable Emulator
- gcloud components update
- gcloud components install beta
- gcloud components install bigtable
- gcloud beta emulators bigtable start
- gcloud beta emulators bigtable start –host-port=[HOST]:[PORT]
- docker run -p 127.0.0.1:8086:8086 –rm -ti google/cloud-sdk gcloud beta emulators bigtable start –host-port=0.0.0.0:8086

YouTube Videos

Introduction
- Google: What is Cloud Bigtable? – Sep 15, 2021 – 5m 36s – good introduction
- Google: What can you do with Bigtable? – Sep 29, 2021 – 4m 30s – good introduction to use cases
Architecture and Performance
- Google: Cloud Bigtable performance 101 – Nov 29, 2018 – 3m 12s – good introduction to architecture
- Google: Bigtable and Geolocation Performance – Jan 3, 2019 – 2m 39s – Bigtable with Pub/Sub and Functions
Monitoring
- Google: Using Cloud Bigtable Monitoring UI – Dec 13, 2018 – 2m 58s – good introduction to monitoring
- Google: Key Visualizer for Cloud Bigtable – Jul 31, 2018 – 6m 13s – a must watch for the heat map explanation
Longer Session Videos
- Google: Building a Global Data Presence with Cloud Bigtable (Cloud Next ’19) – 46m 36s – a must watch
- Google: Visualizing Cloud Bigtable Access Patterns at Twitter for Optimizing Analytics (Cloud Next ’18) – 35m 19s –
- Google: Music Recommendations at Scale With Cloud Bigtable (Cloud Next ’19) – Apr 10, 2019 – 33m 22s – very good video
Qwiklabs Courses
- Google: High-throughput streaming with Cloud Bigtable – Dec 15, 2021 – 13m 18s – very good training video
- Google: Optimizing Cloud Bigtable performance – Dec 15, 2021 – 7m 11s –

Cloud Bigtable CLI

The cbt CLI is a command-line interface for performing several different operations on Cloud Bigtable.

Written in Go using the Go client library for Cloud Bigtable
Source code for the cbt CLI is available in the GoogleCloudPlatform/google-cloud-go GitHub repository

Installation

On Windows, installation requires “Run as Adminstrator”.

gcloud components update
gcloud components install cbt
Create a .cbtrc file
cbt listinstances

Questions / Practice

Backup/Restore
- How to create a backup
- How to export a backup
- How to restore from a backup

Note: this article is a work in progress while I train for the PCA and Database exams.

Summary

Of the Google Cloud managed database services, Bigtable is the most impressive. The management and monitoring GUIs are the best I have experienced across the big three cloud vendors. You can deploy a cluster in one region in 5 minutes, fill it with data and then add replicas around the world in minutes. No downtime, no customer interruption, no procedure to copy data, seed replication, etc. It does everything for you. Simply amazing.

I wish that Google Cloud consider creating a small single-node Bigtable instance that costs $100/month so that small companies can implement Bigtable without thinking about costs. Small companies do not need the performance that even a single node provides. As these companies grow they can switch to a standard instance type. Perhaps a limited-cost standard node at a reduced price for 12 months. Trust me, they won’t give up Bigtable when the promotion expires.

John Hanley

I design software for enterprise-class systems and data centers. My background is 30+ years in storage (SCSI, FC, iSCSI, disk arrays, imaging) virtualization. 20+ years in identity, security, and forensics.

For the past 14+ years, I have been working in the cloud (AWS, Azure, Google, Alibaba, IBM, Oracle) designing hybrid and multi-cloud software solutions. I am an MVP/GDE with several.

Security, software development and devops in a cloud world

Subscribe to Blog via Email

Terraform and Cost Controls

Repeatable GCP Environments

Categories

Search

Tags

Recent Posts

Find Us