Complete Guide To Mastering and Optimizing Google Bigquery

If you are looking to get started with Bigquery, here are the concepts that you need to be familiar with so you can get the most optimum results from Google Bigquery.

Table Partitioning

It is useful to use the date time column to partition Bigquery datasets, this helps with the improvement of performance. If a date-time column is not available in the data set, you can use the ingestion time to partition the data set.

Clustered Tables

You can further optimize your queries in Bigquery by clustering according to some rows.

You should cluster based on the most used rows for your queries. The query below will be optimized based on wiki and table

CREATE TABLE `fh-bigquery.wikipedia_v3.pageviews_2017`
PARTITION BY DATE(datehour)
CLUSTER BY wiki, title
OPTIONS(
   description="Wikipedia pageviews - partitioned by day, clustered by (wiki, title). Contact https://twitter.com/felipehoffa"
   , require_partition_filter=true
)
AS SELECT * FROM `fh-bigquery.wikipedia_v2.pageviews_2017`
WHERE datehour > '1990-01-01' 

Nested Data

Bigquery works best with denormalized data, so the use of nested data and repeated fields is recommended over star schema or snowflake schema.

A good example of this is a library, usually in a relational database, you will have authors, books databases and then you will have a fact database defining the relationships between books and authors.

This example I have taken from the official bigquery documentation, and you will have the relationship in this way in a bigquery table.

  • id first_name
  • last_name
  • dob (date of birth)
  • addresses (a nested and repeated field)
    • addresses.status (current or previous)
    • addresses.address
    • addresses.city
    • addresses.state
    • addresses.zip
    • addresses.numberOfYears (years at the address)

Some other examples of the same are, you will sessions and have events that happened in that session as nested and repeated fields.

For practice, it is good to work with the google analytics public data set in Bigquery.

The google analytics table has a deep nested structure, with records fields that have other records that are nested in them.

Here is a query you can try out.

SELECT date, product.productPrice
  FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801,UNNEST(hits) as hits, UNNEST(hits.product) as product 
  WHERE visitId=1501583974

About the author

admin

Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2020. Created by Meks. Powered by WordPress.