1. What is BigQuery?
Answer: BigQuery is a fully-managed, serverless, and highly scalable data warehouse
provided by Google Cloud. It allows for super-fast SQL queries on large datasets by
utilizing Google’s infrastructure. It uses a distributed architecture to manage
storage and compute, which enables users to query large amounts of data in real-
time without worrying about provisioning servers or managing infrastructure.
External Table:
External tables reference data stored outside BigQuery (e.g., in GCS), while native
tables store data internally in BigQuery.
2. Explain the concept of partitioning in BigQuery.
Answer: Partitioning in BigQuery is a method of dividing a large table into
smaller, more manageable pieces called partitions. Partitioning improves query
performance by limiting the amount of data scanned during queries. BigQuery
supports partitioning by date (e.g., partitioning a table by a TIMESTAMP or DATE
field) and integer range partitioning. Each partition is stored separately,
allowing queries to scan only the relevant partitions instead of the entire table.
3. What is the difference between clustering and partitioning in BigQuery?
Answer:
Partitioning: Divides a table into smaller segments based on a specified field
(usually a date or integer). This reduces the amount of data scanned by queries
that filter on that partitioning field.
Clustering: Clustering organizes the data within each partition (or the entire
table if not partitioned) based on the values of specified columns. It optimizes
queries that filter on or aggregate by those clustered fields by reducing the
amount of data scanned.
While partitioning divides the table at a high level, clustering organizes the data
within those partitions.
Query to list partitions of a table
Use the special metadata table: [Link].INFORMATION_SCHEMA.PARTITIONS`
SELECT
table_name,
partition_id,
total_rows,
total_bytes,
last_modified_time
FROM
`your_project.your_dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'your_partitioned_table';
SELECT
table_name,
clustering_column
FROM
`your_project.your_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
clustering_column IS NOT NULL;
4. How does BigQuery handle schema changes?
Answer: BigQuery allows for schema changes to be made in several ways:
Adding new fields: New fields can be added without affecting existing data.
Modifying data types: Some schema modifications, like changing a field’s data type,
are not supported directly and would require creating a new table and migrating the
data.
Deleting fields: BigQuery doesn’t allow direct deletion of fields, but you can
create a new table with the desired schema and load the data from the old table.
BigQuery also supports schema auto-detection during data load, but this can be
overridden by specifying the schema manually.
5. What is a "sharded table" in BigQuery, and when would you use it?
Answer: A "sharded table" in BigQuery refers to multiple tables with the same
schema but different suffixes (such as _202301, _202302, etc.). Instead of
partitioning a table, you can store data in multiple tables based on some criteria
(typically date). You would use this approach when you want to store data in
separate tables but still manage them as a single logical dataset.
Sharding is typically used when:
Data grows rapidly over time, and partitioning might not be a good fit.
Data is naturally segmented by time (e.g., daily or monthly data).
However, it requires more careful query management, as you need to specify the
right tables to query.
6. How do you optimize queries in BigQuery?
Answer: To optimize queries in BigQuery, you can use the following strategies:
Use partitioned and clustered tables: Partitioning and clustering your tables can
significantly reduce query costs by limiting the amount of data that is scanned.
Use approximate aggregation functions: BigQuery supports approximate functions like
APPROX_COUNT_DISTINCT() and APPROX_QUANTILES() that can provide faster results with
minimal loss of accuracy.
**Avoid SELECT ***: Always specify the columns you need in the query to reduce the
amount of data being processed.
Use materialized views: Materialized views store precomputed results, which can
speed up repetitive queries.
Use caching: BigQuery caches query results, so if the same query is run multiple
times, it will return results faster.
Limit use of JOINs: Use JOINs judiciously, as they can increase the cost and time
for a query. Consider using WITH clauses or subqueries to reduce complexity.
7. Explain the concept of "flat-rate" pricing vs "on-demand" pricing in BigQuery.
Answer:
On-demand pricing: With on-demand pricing, you pay for the amount of data processed
by your queries. BigQuery charges per byte of data processed, which means costs can
vary based on query complexity.
Flat-rate pricing: With flat-rate pricing, you pay a fixed monthly fee for a
predefined amount of query processing capacity. This model can be beneficial for
organizations with high query volumes, as it provides predictable costs regardless
of how much data is processed.
8. What are BigQuery Data Transfer Service and its use cases?
Answer: The BigQuery Data Transfer Service allows you to schedule and automate the
movement of data from various external data sources to BigQuery. Some common use
cases include:
Data transfer from Google Cloud Storage: Automating the loading of data stored in
Google Cloud Storage into BigQuery.
SaaS data integrations: BigQuery Data Transfer Service supports integrations with
third-party SaaS applications like Google Ads, YouTube, and others, making it easy
to bring in data from external services.
Scheduled data loading: It allows for the scheduled and automatic transfer of data
into BigQuery on a daily, weekly, or monthly basis.
9. What is BigQuery’s “streaming insert” feature, and when should you use it?
Answer: The streaming insert feature in BigQuery allows you to insert real-time
data into a table. This is useful when you need to load data continuously or with
minimal delay (e.g., IoT data, financial transactions, or logs).
You should use streaming inserts when:
You require near-real-time data ingestion.
The volume of incoming data is small to medium, as streaming inserts are more
expensive than batch loading data.
The data doesn’t need to be loaded in large batches.
However, keep in mind that streaming data incurs additional costs for the ingestion
and storage.
10. How does BigQuery handle large datasets, and what are the best practices for
querying them?
Answer: BigQuery is designed to handle very large datasets efficiently through its
distributed architecture. It automatically scales resources to manage large
queries. Some best practices include:
Optimize partitioning and clustering: Use partitioned and clustered tables to
minimize the data scanned during queries.
Use WITH clauses: Break down complex queries using common table expressions (CTEs)
to make them more readable and manageable.
Leverage BigQuery’s UI or tools like the BigQuery API or client libraries: These
provide helpful options for interacting with BigQuery at scale, such as query
retries or incremental data loads.
For very large datasets, it's important to design queries carefully to avoid
scanning unnecessary data, using the above optimizations.
11. What are the key differences between Google BigQuery and traditional databases
like MySQL or PostgreSQL?
Answer:
Serverless: BigQuery is serverless, meaning you don’t have to manage or provision
hardware, while traditional databases like MySQL and PostgreSQL require
infrastructure management.
Scalability: BigQuery automatically scales to handle petabytes of data, while
traditional databases may require additional configuration and maintenance as data
grows.
Storage and compute separation: BigQuery separates storage and compute, meaning you
pay for storage and compute independently. Traditional databases often combine
both, and scaling requires resizing the database or moving to a more powerful
instance.
Query performance: BigQuery uses distributed architecture to process queries,
enabling faster queries on large datasets compared to traditional databases, which
might struggle with very large datasets.
Pricing model: BigQuery uses an on-demand pricing model based on the amount of data
processed for each query, whereas traditional databases may use a more traditional
pricing model based on instance size or storage.
12. What are the different ways to load data into BigQuery?
Answer: BigQuery supports several ways to load data:
Batch loading: Using tools like the BigQuery web UI, bq command-line tool, or API
to load large datasets from files in Cloud Storage into BigQuery.
Streaming inserts: Real-time data can be inserted using the streaming API for low-
latency ingestion.
Data Transfer Service: Automates and schedules the transfer of data from various
external sources into BigQuery.
Federated queries: Queries can be run on external data sources like Google Cloud
Storage or Google Sheets without loading the data into BigQuery.
Different types of storage classes:
Storage Class Access Frequency Min Duration Retrieval Cost Use Case
Standard Frequent None Low Active data
Nearline Monthly 30 days Medium Monthly backup
Coldline Quarterly 90 days Higher Archival
Archive Yearly or less 365 days Highest Long-term storage
======================================
In Google BigQuery, partitioning and bucketing are techniques used to optimize the
performance and manageability of large datasets by organizing data into smaller,
more manageable pieces. These techniques help with querying efficiency and cost
reduction.
Partitioning in BigQuery
Partitioning divides a table into smaller, more manageable segments based on the
values of a particular column, typically a timestamp or date. Partitioning is
useful for tables that store time-based or range-based data, as it allows queries
to scan only relevant partitions, improving performance and reducing costs.
Types of Partitioning in BigQuery:
Ingestion-time Partitioning:
Data is partitioned automatically based on the time it was ingested into BigQuery.
Partition column: _PARTITIONTIME (auto-generated timestamp field).
Example Use Case: Large log datasets that you expect to insert over time.
Date Partitioning:
Data is partitioned by a specific DATE column in your table, such as the timestamp
of an event or transaction.
You specify the column in the PARTITION BY clause during table creation.
Example: A table storing transaction data, where each transaction has a
transaction_date column.
Syntax Example:
sql
Copy
CREATE TABLE `[Link]`
PARTITION BY DATE(transaction_date)
AS
SELECT * FROM `source_table`;
Integer Range Partitioning:
Data is partitioned by an integer column, with each partition representing a range
of integer values.
Example: Partitioning a table that tracks customer IDs or order IDs by ranges.
Benefits of Partitioning:
Improved Query Performance: Queries scan only relevant partitions, reducing the
data processed.
Cost Reduction: Since you're only scanning the partitions that matter, costs are
lowered.
Data Management: Easy management of time-based or range-based datasets.
Bucketing in BigQuery
Bucketing organizes data into multiple buckets based on the hash of a specific
column, helping with query performance, especially when queries involve JOIN
operations. Bucketing doesn't partition data by time or range but instead by a
column value.
Key Features of Bucketing:
Column-Based Hashing: Bucketing uses the hash of a column value to divide data into
a specified number of buckets.
Improves JOINs: Bucketing improves the efficiency of JOIN operations, particularly
when joining on the bucketed column. The data in each bucket is stored together, so
joining becomes faster.
Fixed Buckets: The number of buckets must be predefined when creating a bucketed
table.
Example Use Case:
If you have a dataset with customer IDs and need to join it with another dataset on
customer IDs, bucketing ensures that data with the same customer ID is placed in
the same bucket, improving query performance.
Syntax Example for Bucketing:
sql
Copy
CREATE TABLE `[Link]`
PARTITION BY DATE(transaction_date)
CLUSTER BY customer_id
AS
SELECT * FROM `source_table`;
Note: BigQuery bucketing can be combined with partitioning (e.g., partitioned by
date and bucketed by customer ID), but it’s not mandatory.
Benefits of Bucketing:
Efficient Joins: Bucketing improves the performance of queries with joins on
bucketed columns because related data is stored together.
Improved Query Performance: Especially for large tables with frequent JOIN
operations on specific columns.
Comparison: Partitioning vs. Bucketing
Feature Partitioning Bucketing
Purpose Divide data into segments based on a column value, usually time-based.
Divide data into buckets based on a hashed column value.
Column Usually uses a DATE or TIMESTAMP column. Uses any column (often for
JOIN optimization).
Query Optimization Optimizes for time or range-based queries. Optimizes
for JOIN operations on the bucketed column.
Cost Optimization Reduces the amount of data scanned for time/range-based queries.
Reduces the data scanned during JOIN operations.
Granularity Partitions data into ranges (time, date, or integer). Buckets are fixed
in number and determined by the hash of a column.
Use Case Ideal for time-series data, logs, or event data. Ideal for
optimizing JOIN queries on specific columns.
When to Use Partitioning and Bucketing:
Partitioning: Use partitioning when your data is time-based or logically divided
into distinct ranges (e.g., sales data per day, or logs).
Bucketing: Use bucketing when you frequently perform JOIN operations on a specific
column and want to optimize the join performance.
Both partitioning and bucketing in BigQuery are powerful ways to optimize
performance and reduce costs, but they serve different purposes and can be combined
to achieve even more efficient queries, especially when dealing with large
datasets.
Architecture:
🔹 Dremel
Columnar, interactive query engine that powers BigQuery for ultra-fast SQL analysis
on massive datasets.
🔹 Root Node (Dremel)
Coordinates the query execution across the tree hierarchy in Dremel.
🔹 Leaf Node (Dremel)
Executes the actual data scan and computation tasks on the underlying data.
🔹 Mirror Node (Dremel)
Acts as a replica for fault tolerance, mirroring another node's responsibilities.
🔹 Colossus
Google's distributed file system that stores all data used by services like
BigQuery, GCS, and Spanner.
🔹 Jupiter
High-bandwidth datacenter network fabric that connects servers in Google's
infrastructure with low latency.
🔹 Borg
Google’s cluster manager that schedules and runs containerized workloads — the
predecessor to Kubernetes.
======================
1. Frontend Layer:
This layer handles all user interactions and query requests. It processes and
parses SQL queries submitted through the BigQuery UI, API, or command-line
interface (CLI).
It also manages authentication and authorization for users to access the system.
2. Query Execution Engine:
Distributed Query Execution: BigQuery automatically distributes queries across
multiple nodes in a cluster. When a query is submitted, it is broken down into
multiple tasks that are processed in parallel across many machines to speed up the
query execution.
Dremel: BigQuery uses Dremel (Google’s internal distributed query execution engine)
to process queries efficiently. Dremel enables nested data processing, which is
useful when dealing with large, nested datasets (e.g., JSON).
3. Storage Layer:
Columnar Storage: BigQuery stores data in a columnar format, which optimizes
storage and enables efficient querying, especially for analytical workloads.
Distributed File System: Data is stored in a highly distributed and redundant
manner across multiple nodes in Google's infrastructure, ensuring both high
availability and scalability.
Storage is separated from Compute: Unlike traditional databases, BigQuery separates
compute (query processing) from storage. This means you can scale the compute and
storage independently, allowing flexibility and cost control.
4. Data Ingestion:
BigQuery supports multiple ways to load data, including streaming (real-time data
ingestion), batch loading from Google Cloud Storage, and importing data from
external sources (e.g., Google Sheets, Cloud Pub/Sub, or Google Cloud Dataflow).
Data is automatically partitioned and indexed when loaded into BigQuery for faster
querying.
5. Execution Resources:
Slots: BigQuery uses a concept called "slots" to allocate compute resources. A slot
is a unit of computational capacity, and queries are assigned a number of slots
depending on their complexity. You can purchase dedicated slots for better
performance, or BigQuery can automatically manage resources on a pay-per-use model.
6. Security and Access Control:
BigQuery integrates with Google Cloud’s Identity and Access Management (IAM) to
control who can access specific datasets, tables, and even query execution.
It supports encryption for data at rest and in transit.
7. APIs and Integration:
BigQuery provides REST APIs, client libraries (in languages like Python, Java, and
Go), and a command-line interface (CLI) for interacting with the system.
BigQuery also integrates seamlessly with other Google Cloud services like Dataflow,
Dataproc, AI Platform, and Data Studio for visualization and data analytics.
8. Machine Learning (BigQuery ML):
BigQuery allows users to run machine learning models directly within the platform
using SQL. This is an integrated approach where users can perform data analysis and
machine learning without needing to move data between systems.
9. Federated Queries:
BigQuery allows querying external data sources, such as data stored in Google Cloud
Storage, Google Sheets, or other external databases (via BigQuery Omni). This can
be done without importing the data into BigQuery itself.
10. Data Partitioning and Clustering:
Partitioning: BigQuery allows data to be partitioned by a specific field (such as
date). This improves query performance and reduces costs by scanning only the
relevant partitions.
Clustering: Within a partition, BigQuery can cluster data based on specified
columns, which further optimizes performance by reducing the amount of data
scanned.
===============================================
Restore the deleted data...
✅ 1. Check Trash (Soft Delete - 7 Days Retention)
As of late 2023, BigQuery supports a trash feature for deleted tables. If your
organization hasn't disabled it:
🔁 You can restore the table within 7 days.
Steps:
Go to BigQuery Console → Select your project.
Click on "Trash" in the left navigation pane.
Find the deleted table (you can sort/filter by deletion time).
Click "Restore" next to it.
⚠️ If "Trash" is not visible: Your organization might have disabled soft delete.
✅ 2. Time Travel Recovery (If Table was Overwritten)
If the table wasn't deleted, but overwritten or modified, you can use BigQuery time
travel to access past versions (up to 7 days back).
Use this SQL syntax:
sql
Copy
Edit
SELECT * FROM `[Link]` FOR SYSTEM_TIME AS OF
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
Or for a specific timestamp:
sql
Copy
Edit
SELECT * FROM `[Link]` FOR SYSTEM_TIME AS OF TIMESTAMP("2025-06-15
10:00:00")
You can then export that data to a new table.
✅ 3. Check Scheduled Backups or Export Jobs
If your project uses scheduled queries, dataform, or Dataflow jobs, check if:
There are automatic table exports to Cloud Storage
Snapshots were created (CREATE SNAPSHOT TABLE)
Replicas or views exist in another dataset
❌ Not Recoverable If:
Table was permanently deleted and trash is disabled.
More than 7 days have passed.
✅ Recovery Action (if eligible):
You can restore via:
bash
Copy
Edit
bq cp --restore_table true project:dataset.__TABLES__@timestamp
project:dataset.new_table
Replace timestamp with UNIX epoch seconds for when the table existed.
================================
customer managed encryption