This article contains my notes on Google Cloud Bigtable and is part of my series on Professional Cloud Architect certification.
I plan to take the Professional Cloud Database Engineer certification, so I am starting my training for the PCA with a database deep dive.
I do not work with Bigtable often, but when I do I need to get up to speed quickly. The typical GCP budget for a Bigtable deployment starts at $5K per month. Investing in a couple of hours to refresh is a good idea. These notes minimize my time searching for resources.
Note: this article is a work in progress while I train for the PCA and Database exams.
What is Cloud Bigtable
If you are familiar with Hadoop, Cloud Bigtable is HBase compatible. Cloud Bigtable is Google’s fully managed, scalable NoSQL database service.
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and millions of columns, enabling you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
In Bigtable, each row represents a single entity (such as an individual user or sensor) and is labeled with a unique row key. Each column stores attribute values for each row, and column families can be used to organize related columns. At the intersection of a row and column, there can be multiple cells, with each cell representing a different version of the data at a given timestamp.
Notes
- There is an inconsistency in the documentation. In some places, it says thousands of columns, in other places millions of columns.
- Even though a table can have millions of columns, a row should not. link
- A row key must be 4 KB or less
- Does not support joins
- Transactions are supported only within a single row
- Each table has only one index, the row key
- The intersection of a row and column can contain multiple timestamped cells
- Tables are sparse
History
- Bigtable development began in 2004
- Bigtable was released in February 2005
- Cloud Bigtable was released on May 6, 2015
- Google’s BigTable – Feb 2005
- Bigtable: A Distributed Storage System for Structured Data – August 2006
Google Research in Data Technologies
There is a massive amount of knowledge in these research papers. If your goal is to better understand how data is managed and processed at Google read a few of these. Colossus is probably the most important.
- 2002 – GFS – The Google File System –
- 2004 – MapReduce – Simplified Data Processing on Large Clusters
- 2006 – BigTable – Distributed Storage System for Structured Data
- 2008 – Dremel – Interactive Analysis of Web-Scale Datasets –
- 2009 – Colossus – Google’s scalable storage system – successor to the Google File System (GFS)
- 2010 – Flume – Easy, Efficient Data-Parallel Pipelines
- 2011 – Megastore – Providing Scalable, Highly Available Storage for Interactive Services –
- 2012 – Spanner – Spanner, TrueTime and the CAP Theorem
- 2012 – MillWheel – Fault-Tolerant Stream Processing at Internet Scale
- 2013 – PubSub – Thialfi: A Client Notification Service for Internet-Scale Applications
- 2014 – F1 – Distributed SQL Database That Scales – Google AdWords
- 2016 – Shasta – Interactive Reporting at Scale
- 2016 – Slicer – Auto-Sharding for Datacenter Applications
- 2019 – Zanzibar – Google’s Consistent, Global Authorization System
- 2019 – WarpFlow – Exploring Petabytes of Space-Time Data
- 2019 – SageDB – A Learned Database System
- 2021 – Napa – Powering Scalable Data Warehousing with Robust Query Performance at Google
- 2021 – Glean – Structured Extractions from Templatic Documents
- 2023 – Firestore – The NoSQL Serverless Database for the Application Developer
Products that use Cloud Bigtable
Bigtable is used internally at Google for a number of products.
- Gmail
- Google Analytics
- Google Blogger
- Google Books
- Google Code
- Google Earth
- Google Maps
- YouTube
Apache HBase and Cassandra are some of the best known open source projects that were modeled after Bigtable.
As of January 2022, Bigtable manages over 10 Exabytes of data and serves more than 5 billion requests per second.
Key Features
- Fully managed NoSQL database
- Horizontally scalable
- Optimized for high reads/writes per second
- Supports millions of requests per second with single digit millisecond latency
- Low latency
- Highly scalable database
- Eventually Consistent
- SLO/SLA
- 99.9 – Cloud Bigtable – Zonal instance (single cluster)
- 99.999 – Replicated Instance (2 or more clusters) with Multi-Cluster routing policy (3 or more Regions)
- Million of columns
- Column Families
- Column Cells
- 256 MB per row
- Integration with Big Data Tools
- Apache HBase API Standard
- Apache Beam
- Apache Hadoop
- Apache Spark
- Integration with Google Products
- BigQuery
- Cloud Dataflow
- Cloud Dataproc
Architecture
- Horizontally scalable
- Throughput can be adjusted by adding or removing nodes
- Each node provides 10,000 queries per second
- Each node has its own storage
- No downtime while changing nodes
- Components
- Frontend Server Pool
- Bigtable Cluster
- Nodes
- Scalable Storage Backend
- Shared data into multiple “tablets”
- A Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Each tablet is associated with a node, and operations on these rows are performed on the node. To optimize performance, tablets are split or moved to a different node depending on access patterns. Based on user access patterns — read, write, and scan operations — tablets are rebalanced across the nodes.
- SSTable – Sorted String Table
- Shared data into multiple “tablets”
Use cases
- Key usage: Lots of data over lots of time (days)
- Time-series data
- Marketing data
- Low latency serving
- Financial data
- IoT data
- Graph data
Best practices
- Understand Cloud Bigtable performance – estimating throughput for Cloud Bigtable, how to plan Cloud Bigtable capacity by looking at throughput and storage use, how enabling replication affects read and write throughput differently, and how Cloud Bigtable optimizes data over time.
- Cloud Bigtable schema design – guidance on designing Cloud Bigtable schema, including concepts of key/value store, designing row keys based on planned read requests, handling columns and rows, and special use cases.
- Cloud Bigtable replication overview – how to replicate Cloud Bigtable across multiple zones or regions, understand performance implications of replication, and how Cloud Bigtable resolves conflicts and handles failovers.
- About Bigtable backups – how to save a copy of a table’s schema and data with Bigtable Backups, which can help you recover from application-level data corruption or from operator errors, such as accidentally deleting a table.
Cloud Bigtable pricing
When estimating the pricing for Bigtable review the following major items
- The type of Bigtable instance and the total number of nodes in your instance’s clusters
- The amount of storage that your tables use
- The amount of network bandwidth that you use
- Data Access audit log costs, if enabled. This item is often overlooked.
- Backup storage
Bigtable IAM
These notes on IAM are just tidbits that I want to remember. Refer to this guide for more details.
- You can configure access control at the following levels:
- project
- instance
- table
- IAM Permission Categories
- App profile permissions
- Backups permissions
- Cluster permissions
- Hot tablets permissions
- Instance permissions
- Key visualizer permissions
- Location permissions
- Table permissions
- Predefined IAM Roles
- Bigtable Administrator – roles/bigtable.admin
- Bigtable Reader – roles/bigtable.reader
- Bigtable User – roles/bigtable.user
- Bigtable Viewer – roles/bigtable.viewer
- In Bigtable, you cannot grant access to the following types of principals:
- allAuthenticatedUsers
- allUsers
- Instance-level IAM management
- gcloud bigtable instances set-iam-policy INSTANCE_ID POLICY_FILE
- Table-level IAM management
- gcloud bigtable instances tables set-iam-policy TABLE_ID –instance=INSTANCE_ID POLICY_FILE
- IAM conditions
- Date/time attributes
- Use to set temporary (expiring), scheduled, or limited-duration access to Bigtable resources. For example, you can allow a user to access a table until a specified date.
- Resource attributes
- Use to configure conditional access based on a resource name, resource type, or resource service attributes. In Bigtable, you can use attributes of instances, clusters, and tables to configure conditional access. For example, you can allow a user to manage tables only on tables that begin with a specific prefix, or you can allow a user to access only a specific table.
- Date/time attributes
- Example IAM Policy File – note I have not verified this yet on a real Bigtable instance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "bindings": [ { "role": "roles/bigtable.user", "members": [ "user:john@example.com" ], "condition": { "title": "expirable access", "description": "Does not grant access after Sep 2023", "expression": "request.time < timestamp('2023-10-01T00:00:00.000Z')", } } ], "etag": "BwWWja0YfJA=", "version": 3 } |
Terraform
Documentation Resources
Interesting Google Articles
- Cloud Bigtable launches Autoscaling plus new features for optimizing costs and improved manageability
- Processing billions of events in real time at Twitter
- What’s new in Bigtable observability
Interesting Third-party Articles
- How does BigTable work/Best practices
- CSE 444: Database Internals – Lots of Bigtable internal information
- Google Bigtable: Distributed Structured Datastore (Part 1)
- Google Bigtable: Distributed Structured Datastore (Part 2)
- Google Bigtable: Bloom filters, commit logs. (Part 3)
- SSTable and Log Structured Storage: LevelDB
- Andrew Dawson: What is Bigtable?
Links for Developers
- All Bigtable code samples
- All Bigtable samples – treelist view
- Google BigTable Recipes – Interesting site with Bigtable code snippets in JavaScript and Python
- PHP
Practice Resources
Cloud Bigtable is fairly expensive to practice with personally. A single node instance will cost $650 per month. Because of that reason, I recommend doing your practice in Google Qwiklabs so that the cost is zero beyond your membership fees.
- Qwiklabs – Manage Bigtable on Google Cloud – Quest
- GSP1053: Designing and Querying Bigtable Schemas
- GSP1054: Creating and Populating a Bigtable Instance
- GSP1055: Streaming Data to Bigtable
- GSP1056: Managing Bigtable Health and Performance
- Qwiklab – Additional Labs
- GSP099: Bigtable: Qwik Start – Command Line
- GSP1038: Introduction to Cloud Bigtable (Java)
- Cloud Bigtable Emulator
- gcloud components update
- gcloud components install beta
- gcloud components install bigtable
- gcloud beta emulators bigtable start
- gcloud beta emulators bigtable start –host-port=[HOST]:[PORT]
- docker run -p 127.0.0.1:8086:8086 –rm -ti google/cloud-sdk gcloud beta emulators bigtable start –host-port=0.0.0.0:8086
YouTube Videos
- Introduction
- Google: What is Cloud Bigtable? – Sep 15, 2021 – 5m 36s – good introduction
- Google: What can you do with Bigtable? – Sep 29, 2021 – 4m 30s – good introduction to use cases
- Architecture and Performance
- Google: Cloud Bigtable performance 101 – Nov 29, 2018 – 3m 12s – good introduction to architecture
- Google: Bigtable and Geolocation Performance – Jan 3, 2019 – 2m 39s – Bigtable with Pub/Sub and Functions
- Monitoring
- Google: Using Cloud Bigtable Monitoring UI – Dec 13, 2018 – 2m 58s – good introduction to monitoring
- Google: Key Visualizer for Cloud Bigtable – Jul 31, 2018 – 6m 13s – a must watch for the heat map explanation
- Longer Session Videos
- Google: Building a Global Data Presence with Cloud Bigtable (Cloud Next ’19) – 46m 36s – a must watch
- Google: Visualizing Cloud Bigtable Access Patterns at Twitter for Optimizing Analytics (Cloud Next ’18) – 35m 19s –
- Google: Music Recommendations at Scale With Cloud Bigtable (Cloud Next ’19) – Apr 10, 2019 – 33m 22s – very good video
- Qwiklabs Courses
- Google: High-throughput streaming with Cloud Bigtable – Dec 15, 2021 – 13m 18s – very good training video
- Google: Optimizing Cloud Bigtable performance – Dec 15, 2021 – 7m 11s –
Cloud Bigtable CLI
The cbt CLI is a command-line interface for performing several different operations on Cloud Bigtable.
- Written in Go using the Go client library for Cloud Bigtable
- Source code for the cbt CLI is available in the GoogleCloudPlatform/google-cloud-go GitHub repository
Installation
On Windows, installation requires “Run as Adminstrator”.
- gcloud components update
- gcloud components install cbt
- Create a .cbtrc file
- cbt listinstances
Questions / Practice
- Backup/Restore
- How to create a backup
- How to export a backup
- How to restore from a backup
Note: this article is a work in progress while I train for the PCA and Database exams.
Summary
Of the Google Cloud managed database services, Bigtable is the most impressive. The management and monitoring GUIs are the best I have experienced across the big three cloud vendors. You can deploy a cluster in one region in 5 minutes, fill it with data and then add replicas around the world in minutes. No downtime, no customer interruption, no procedure to copy data, seed replication, etc. It does everything for you. Simply amazing.
I wish that Google Cloud consider creating a small single-node Bigtable instance that costs $100/month so that small companies can implement Bigtable without thinking about costs. Small companies do not need the performance that even a single node provides. As these companies grow they can switch to a standard instance type. Perhaps a limited-cost standard node at a reduced price for 12 months. Trust me, they won’t give up Bigtable when the promotion expires.
I design software for enterprise-class systems and data centers. My background is 30+ years in storage (SCSI, FC, iSCSI, disk arrays, imaging) virtualization. 20+ years in identity, security, and forensics.
For the past 14+ years, I have been working in the cloud (AWS, Azure, Google, Alibaba, IBM, Oracle) designing hybrid and multi-cloud software solutions. I am an MVP/GDE with several.
Leave a Reply