Complete Guide To Mastering and Optimizing Google Bigquery
If you are looking to get started with Bigquery, here are the concepts that you need to be familiar with so you can get the most optimum results from Google Bigquery.
Table Partitioning
It is useful to use the date time column to partition Bigquery datasets, this helps with the improvement of performance. If a date-time column is not available in the data set, you can use the ingestion time to partition the data set.
Clustered Tables
You can further optimize your queries in Bigquery by clustering according to some rows.
You should cluster based on the most used rows for your queries. The query below will be optimized based on wiki and table
CREATE TABLE `fh-bigquery.wikipedia_v3.pageviews_2017`
PARTITION BY DATE(datehour)
CLUSTER BY wiki, title
OPTIONS(
description="Wikipedia pageviews - partitioned by day, clustered by (wiki, title). Contact https://twitter.com/felipehoffa"
, require_partition_filter=true
)
AS SELECT * FROM `fh-bigquery.wikipedia_v2.pageviews_2017`
WHERE datehour > '1990-01-01'
Nested Data
Bigquery works best with denormalized data, so the use of nested data and repeated fields is recommended over star schema or snowflake schema.
A good example of this is a library, usually in a relational database, you will have authors, books databases and then you will have a fact database defining the relationships between books and authors.
This example I have taken from the official bigquery documentation, and you will have the relationship in this way in a bigquery table.
- id first_name
- last_name
- dob (date of birth)
- addresses (a nested and repeated field)
- addresses.status (current or previous)
- addresses.address
- addresses.city
- addresses.state
- addresses.zip
- addresses.numberOfYears (years at the address)
Some other examples of the same are, you will sessions and have events that happened in that session as nested and repeated fields.
For practice, it is good to work with the google analytics public data set in Bigquery.
The google analytics table has a deep nested structure, with records fields that have other records that are nested in them.
Here is a query you can try out.
SELECT date, product.productPrice
FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801,UNNEST(hits) as hits, UNNEST(hits.product) as product
WHERE visitId=1501583974
No comments yet.
Add your comment