Getting Data From Google Adwords to Google Bigquery using Singer- Part 2
Okay now we have the pipeline running, we want to make sure that the state.json file is stored somewhere safe since we might opt to destroy a VM instance at some time, it is best to store the files in google cloud storage.
Here are the steps we will take in this file:-
a) Creating a Docker file
b) Create an ENTRYPOINT script, that will run every time the docker image is run and will run the pipeline and store/load the state.json file from google cloud storage.
Creating the Entry.sh file
Here is the Bash file that will run our pipeline.
#!/bin/bash export GOOGLE_APPLICATION_CREDENTIALS=$PWD/client_secrets.json export STORAGE_BUCKET="gs://state_storage_singer" gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS gsutil cp $STORAGE_BUCKET/state.json state.json if [ -e state.json ]; then STATE="--state state.json" fi OUTPUT=$(mktemp) env-tap/bin/tap-adwords -c adwordsconfig1.json -p output.json $STATE| env-target/bin/target-bigquery -c bigqueryconfig1.json | tee $OUTPUT if [ $? -eq 0 ]; then # Writes the function output to a temp file in storage tail -1 $OUTPUT > state.json.tmp # Prettifies the JSON file and discards the output #python -mjson.tool state.json.tmp > /dev/null if [ $? -eq 0 ]; then gsutil cp state.json.tmp $STORAGE_BUCKET/state.json rm state.json.tmp rm state.json echo "gets here" else echo "Not updating state.json. Invalid JSON" exit 1 fi else echo "Not updating state.json due to error." exit 1 fi
Here is the explanation about the code:-
a) We authenticate gcloud service account so that we can use gsutil to copy state.json file from google cloud storage
b) If the file exists we set the state environment variable to point to the file otherwise it is kept empty
c) Run the Singer pipeline and output the bookmarks into a state.json file
d) We upload the state.json to google cloud storage and remove it from the compute instance
Creating the Dockerfile
Here is are the different files that you need to have in your working directory.
— adwordsconfig.json (Config file for tap-adwords)
— output.json (Output catalog file for tap-adwords)
— bigqueryconfig.json (Config file for target-bigquery)
— clientsecrets.json (File for authenticating bigquery)
Once you have all these files, let’s get started with the docker file
# DockerFile FROM python:3.8-buster RUN apt-get update && \ apt-get -y install python3-pip RUN curl -SSL https://sdk.cloud.google.com | bash ENV PATH="$PATH:/root/google-cloud-sdk/bin" RUN mkdir -p /home/pavneet992 WORKDIR /home/pavneet992 COPY . . RUN python3 -m venv env-tap && env-tap/bin/pip3 install tap-adwords --no-cache-dir RUN python3 -m venv env-target && env-target/bin/pip3 install target-bigquery --no-cache-dir RUN chmod u+x entrypoint.sh ENTRYPOINT [ "./entrypoint.sh" ]
Some pointers about the Docker file:-
a) I didn’t opt for the slim version of python as Bigquery requires the library gcc to provide encryption support.
b) You need to have Google cloud SDK, Python-Pip3 and Google cloud storage SDK installed for the code to work
c) Copy all the files over to the docker image
d) Use Entrypoint to execute the entrypoint.sh file as this way we can put the image in a containerized compute engine and the file will run every time the compute instance is booted.
e) Install tap-adwords and target-bigquery in different virtual environments
Once you put the above code into a Dockerfile add it to google container registry. It is easier to work with google container registry if you are looking to run the schedular on google cloud.
docker build -t gcr.io/${PROJECT_ID}/ga-bigquery-replication:latest . docker push gcr.io/careful-parser-269221/ga-bigquery-replication:latest
Before running this code do add your project id as an environment variable.
You can check if you have created the image properly by running it.
sudo docker run --rm gcr.io/careful-parser-269221/ga-bigquery-replication:latest
Next Steps
Once we have the image in the next blog, we will run in a schedule in google cloud.
Here are the links to all articles in this series:-
a) Part 1:- Creating a singer pipeline getting data from google adwords to bigquery
b) Part 2:- Creating the docker file for the singer pipeline
c) Part 3:- Automating the singer pipeline
No comments yet.
Add your comment