Getting Data From Google Adwords to Google Bigquery using Singer- Part 2

Okay now we have the pipeline running, we want to make sure that the state.json file is stored somewhere safe since we might opt to destroy a VM instance at some time, it is best to store the files in google cloud storage.

Here are the steps we will take in this file:-

a) Creating a Docker file

b) Create an ENTRYPOINT script, that will run every time the docker image is run and will run the pipeline and store/load the state.json file from google cloud storage.

Creating the Entry.sh file

Here is the Bash file that will run our pipeline.

#!/bin/bash
export GOOGLE_APPLICATION_CREDENTIALS=$PWD/client_secrets.json
export STORAGE_BUCKET="gs://state_storage_singer"
gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS
gsutil cp $STORAGE_BUCKET/state.json state.json
if [ -e state.json ]; then
  STATE="--state state.json"
fi
OUTPUT=$(mktemp)
env-tap/bin/tap-adwords -c adwordsconfig1.json -p output.json $STATE| env-target/bin/target-bigquery -c bigqueryconfig1.json | tee $OUTPUT
if [ $? -eq 0 ]; then
  # Writes the function output to a temp file in storage
  tail -1 $OUTPUT > state.json.tmp
  # Prettifies the JSON file and discards the output
  #python -mjson.tool state.json.tmp  > /dev/null
  if [ $? -eq 0 ]; then
    gsutil cp state.json.tmp $STORAGE_BUCKET/state.json
    rm state.json.tmp
    rm state.json
    echo "gets here"
  else
    echo "Not updating state.json. Invalid JSON"
    exit 1
  fi
else
  echo "Not updating state.json due to error."
  exit 1
fi

Here is the explanation about the code:-

a) We authenticate gcloud service account so that we can use gsutil to copy state.json file from google cloud storage
b) If the file exists we set the state environment variable to point to the file otherwise it is kept empty
c) Run the Singer pipeline and output the bookmarks into a state.json file
d) We upload the state.json to google cloud storage and remove it from the compute instance

Creating the Dockerfile

Here is are the different files that you need to have in your working directory.

— adwordsconfig.json (Config file for tap-adwords)
— output.json (Output catalog file for tap-adwords)
— bigqueryconfig.json (Config file for target-bigquery)
— clientsecrets.json (File for authenticating bigquery)

Once you have all these files, let’s get started with the docker file

# DockerFile
FROM python:3.8-buster
RUN apt-get update && \
    apt-get -y install python3-pip

RUN curl -SSL https://sdk.cloud.google.com | bash
ENV PATH="$PATH:/root/google-cloud-sdk/bin"

RUN mkdir -p /home/pavneet992
WORKDIR /home/pavneet992
COPY . .

RUN python3 -m venv env-tap && env-tap/bin/pip3 install tap-adwords --no-cache-dir
RUN python3 -m venv env-target && env-target/bin/pip3 install target-bigquery --no-cache-dir

RUN chmod u+x entrypoint.sh
ENTRYPOINT [ "./entrypoint.sh" ]

Some pointers about the Docker file:-

a) I didn’t opt for the slim version of python as Bigquery requires the library gcc to provide encryption support.
b) You need to have Google cloud SDK, Python-Pip3 and Google cloud storage SDK installed for the code to work
c) Copy all the files over to the docker image
d) Use Entrypoint to execute the entrypoint.sh file as this way we can put the image in a containerized compute engine and the file will run every time the compute instance is booted.
e) Install tap-adwords and target-bigquery in different virtual environments

Once you put the above code into a Dockerfile add it to google container registry. It is easier to work with google container registry if you are looking to run the schedular on google cloud.

docker build -t gcr.io/${PROJECT_ID}/ga-bigquery-replication:latest .
docker push gcr.io/careful-parser-269221/ga-bigquery-replication:latest

Before running this code do add your project id as an environment variable.

You can check if you have created the image properly by running it.

sudo docker run --rm gcr.io/careful-parser-269221/ga-bigquery-replication:latest

Next Steps

Once we have the image in the next blog, we will run in a schedule in google cloud.

Here are the links to all articles in this series:-

a) Part 1:- Creating a singer pipeline getting data from google adwords to bigquery
b) Part 2:- Creating the docker file for the singer pipeline
c) Part 3:- Automating the singer pipeline



About the author

admin

Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2023. Created by Meks. Powered by WordPress.