Scheduling a Singer pipeline on Google Cloud – Part 3 Adwords to Bigquery

If you have followed the tutorial you will have a docker image with your stitch pipeline in Google container registry.

For running this setup we will be implementing the following setup in Google Cloud.

System architecture diagram showing Cloud Scheduler scheduling a Compute Engine instance via Pub/Sub

Here are the tasks we will need to complete the setup:-

  1. Create a VM instance to get the historical data
  2. Create a pub/sub topic to trigger cloud functions
  3. Create Google cloud functions to start and stop the instance
  4. Setup cloud schedular jobs

Create a VM instance to get the historical data

The strategy I am following, in this case, is that I am downloading all the historical data till the current data and store the current date in the state.json file and everyday I run the cronjob to get data of the previous data into Bigquery.

We will set up a containerized compute engine using the docker image we created in the previous article, the reason for this is that we do not need the compute engine to be on all the time. The compute engine will simply run a specified container upon boot (Make sure to set restart to On Failure and give Privileged Access to the container). You can also try using a startup script for implementing the same.


gcloud compute instances create-with-container temp-instance \
    --image gcr.io/careful-parser-269221/ga-bigquery-replication:latest \
    --zone us-east1-d \
    --machine-type n1-standard-1 \
    --container-restart-policy on-failure

Run the code in your google cloud shell to create the container, after the instance is created in google cloud, you should be able to see a new table and data in Bigquery with the historical data.

Create Pub/Sub Topics

After this, you will need to create two pub/sub topics, one for starting the instance and another one to stop the instance.

Create Google Cloud Functions

We will need two google cloud functions that are associated with respective pub/sub topics.

Starting the Instance

Here is the code for index.js

/**
 * Triggered from a message on a Cloud Pub/Sub topic.
  */
var Compute = require('@google-cloud/compute');
var compute = Compute();
exports.startInstance = function startInstance(req, res) {
    var zone = compute.zone('Zone of your instance');
    var vm = zone.vm('Name of your instance');
    vm.start(function(err, operation, apiResponse) {
        console.log('instance start successfully');
    });
res.status(200).send('Success start instance');
};

Package.json

{
  "name": "sample-pubsub",
  "version": "0.0.1",
  "dependencies": {
    "@google-cloud/pubsub": "^0.18.0",
    "@google-cloud/compute": "0.7.1"
  }
}

Stopping the Instance

Index.js

var Compute = require('@google-cloud/compute');
var compute = Compute();
exports.stopInstance = function stopInstance(req, res) {
    var zone = compute.zone('us-east1-d');
    var vm = zone.vm('temp-instance');
    vm.stop(function(err, operation, apiResponse) {
        console.log('instance stop successfully');
    });
    res.status(200).send('Success stop instance');
};

Package.json

{
  "name": "sample-pubsub",
  "version": "0.0.1",
  "dependencies": {
    "@google-cloud/pubsub": "^0.18.0",
    "@google-cloud/compute": "0.7.1"
  }
}

Creating the Cloud Scheduler Jobs

This is the last step, of the setup. I have set up two jobs. The first one starts the instance sending a message to the pub/sub topic. This starts every morning at 1 am

At 1.30 every morning I stop the instance by using another job.

That’s All Folks

If you set up all the above steps you will have an automated pipeline running in Google Cloud.

Here are the links to all articles in this series:-

a) Part 1:- Creating a singer pipeline getting data from google adwords to bigquery
b) Part 2:- Creating the docker file for the singer pipeline
c) Part 3:- Automating the singer pipeline

About the author

admin

Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2020. Created by Meks. Powered by WordPress.