Author - admin

1
Building a Web Analytics System Using Kafka and Spark Streaming in Google Cloud
2
Creating Your First Data Pipeline in Google Cloud with Apache Sqoop and Apache Airflow
3
Complete Guide To Mastering and Optimizing Google Bigquery
4
Scheduling a Singer pipeline on Google Cloud – Part 3 Adwords to Bigquery
5
Getting Data From Google Adwords to Google Bigquery using Singer- Part 2
6
Getting Data From Google Adwords into Bigquery Using Singer- Part 1
7
Simple and Complete Tutorial on Simple Linear Regression
8
Best Practices For Data Modeling in Power BI
9
How to Connect Power BI to an External API Using OAuth 2
10
How to create Visualizations and connect Power BI to different Sources

Building a Web Analytics System Using Kafka and Spark Streaming in Google Cloud

The aim of the project is to create a data pipeline, that will receive hits from the website using a Flask rest API. The rest API we publish the data to Kafka topics. We subscribe to these topics using a Google Dataproc cluster. Then we use spark streaming to read the data from the Kafka topic and push it into Google Bigquery. STEP 1 – Pushing data into Kafka Topics from the Rest Api Endpoints Here is the code of the Javascript snippet that I put on the website and the Flask API code. Here is the code for the Flask Api, for Kafka producer, look into resources/webevents.py Here is the code for the file. The bootstrap servers in case of Dataproc are the worker nodes, the kafka by default works on the node 9092, you can connect to the Dataproc cluster using the internal IP of the worker nodes.[…]

Read More

Creating Your First Data Pipeline in Google Cloud with Apache Sqoop and Apache Airflow

Welcome to taking the first steps to create your first data pipeline. By the end of it, you will have a data pipeline that takes data from Mysql, we do some preprocessing on it and then store it in google Bigquery. In this exercise, we will be creating an imaginary pipeline for sales data that is present in transactional databases, in this case, cloud SQL. We are basing this exercise on the following Kaggle challenge, https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data . Here are the steps that we will be taking to setup this data pipeline.a) Creating a cloud MySQL instance in google cloud and upload the database to it.b) Create a pipeline to get the data from cloud SQL into Google Bigquery Creating a SQL Database and Upload Data For our first step, we will create a Cloud SQL database on google cloud, we will upload all the files we got from Kaggle into[…]

Read More

Complete Guide To Mastering and Optimizing Google Bigquery

If you are looking to get started with Bigquery, here are the concepts that you need to be familiar with so you can get the most optimum results from Google Bigquery. Table Partitioning It is useful to use the date time column to partition Bigquery datasets, this helps with the improvement of performance. If a date-time column is not available in the data set, you can use the ingestion time to partition the data set. Clustered Tables You can further optimize your queries in Bigquery by clustering according to some rows. You should cluster based on the most used rows for your queries. The query below will be optimized based on wiki and table Nested Data Bigquery works best with denormalized data, so the use of nested data and repeated fields is recommended over star schema or snowflake schema. A good example of this is a library, usually in a[…]

Read More

Scheduling a Singer pipeline on Google Cloud – Part 3 Adwords to Bigquery

If you have followed the tutorial you will have a docker image with your stitch pipeline in Google container registry. For running this setup we will be implementing the following setup in Google Cloud. Here are the tasks we will need to complete the setup:- Create a VM instance to get the historical data Create a pub/sub topic to trigger cloud functions Create Google cloud functions to start and stop the instance Setup cloud schedular jobs Create a VM instance to get the historical data The strategy I am following, in this case, is that I am downloading all the historical data till the current data and store the current date in the state.json file and everyday I run the cronjob to get data of the previous data into Bigquery. We will set up a containerized compute engine using the docker image we created in the previous article, the reason[…]

Read More

Getting Data From Google Adwords to Google Bigquery using Singer- Part 2

Okay now we have the pipeline running, we want to make sure that the state.json file is stored somewhere safe since we might opt to destroy a VM instance at some time, it is best to store the files in google cloud storage. Here are the steps we will take in this file:- a) Creating a Docker file b) Create an ENTRYPOINT script, that will run every time the docker image is run and will run the pipeline and store/load the state.json file from google cloud storage. Creating the Entry.sh file Here is the Bash file that will run our pipeline. Here is the explanation about the code:- a) We authenticate gcloud service account so that we can use gsutil to copy state.json file from google cloud storageb) If the file exists we set the state environment variable to point to the file otherwise it is kept emptyc) Run the[…]

Read More

Getting Data From Google Adwords into Bigquery Using Singer- Part 1

When I was working on this problem, I found the documentation to be lacking in a lot of aspects. So, I decided to cover this in detail. Here are the steps we will follow:- a) Getting the catalog.json file ready for tap-adwords b) Getting data into target-bigquery c) Working with state.json file for tap-adwords Creating a Config.json file for tap-adwords Singer recommends different virtual environments for both the tap and the target. So, first step is to create a new virtual environment where we will install tap-adwords. virtualenv -p python3 env-tap source env-tap/bin/activate pip3 install tap-adwords The first this you need to run tap-adwords is a config file. So here is from where you get all these values from, start date and end date are pretty self-explanatory:- a) Developer token- You can get a developer token by logging in to your Google AdWords manager account and navigating to the AdWords API Center. b) OAuth Client[…]

Read More

Simple and Complete Tutorial on Simple Linear Regression

Linear regression is the first stepping stone into machine learning and So, rather than just being content with a simple explanation, use it to refresh concepts related to linear algebra, statistics, and calculus. Simple linear regression and multiple linear regression are important algorithms that are widely used. So, here are the steps needed to master simple regression. Different Topics Covered In The Algorithm Simple Explanation Of Linear Regression How to Prepare your data for linear Regression Why and when you should use linear regression Least squares method for solving linear regression Definition in less than 100 words If there is a linear relationship between Y a dependent variable,  and X(independent variable) such that the linear relationship between the 2 variables can be represented by a straight line (called regression line). Y= Mx +c Preprocessing Of Data Before Getting Started With Linear Regression Remove Noise. Linear regression assumes that your input[…]

Read More

Best Practices For Data Modeling in Power BI

For ensuring the top performance of your Power BI dashboard, you need to pay particular attention to data modeling. Here are the most recommended practices that you need to take into account:- Do not use wide tables – There a lot of issues that can crop up because of the use of wide tables. The most common issue is simply performance. In a wide table, you are adding redundant values, again and again, so you are not making the most optimum use of your memory. In case your dataset keeps growing you can simply run out of memory Other issues that can crop up due to wide tables is building relationships between two wide tables, you can end up with bi-directional and M-N relationships that can be difficult to work with. To overcome these issues, it is recommended highly to use a star schema or a constellation schema for the[…]

Read More

How to Connect Power BI to an External API Using OAuth 2

Power BI is an excellent tool for data visualization. The tool enables the user to connect it to external API’s. This power is especially useful as these days a lot of apps and websites have API’s to use for BI purposes. For the sake of this tutorial, let’s work with the twitter API. Similar can be done for other API’s you just have to change the first step which is related to creating tokens necessary to access an API Important Note:- The method I am going to discuss in this article only works with API’s that do not have a redirect URL or it is optional. For example, the API for twitter fits this description. If an API has a redirect URL, for example, that of Spotify has a redirect URL so in that case a custom connector needs to be built. That is covered in a separate article on[…]

Read More

How to create Visualizations and connect Power BI to different Sources

Power BI is an important tool and famous tool for business intelligence, In this article, we will go over some basics of how to use Power BI, which will include. Connecting Power BI to google analytics Connecting Power BI to an external API Connecting Power BI to a SQL Database Connecting Power BI to google analytics Step 1 Connect Power BI to google analytics by clicking on the Get Data icon in the top bar, and in the list of available sources select google analytics as a source. Integrate GA and then select the different metrics/ desired chart time and filters for the report Step 2– Change dates  Then going forward you can change dates to get the report for the desired time frame Steps 3 Add the required reports to a dashboard CONNECT POWER BI TO API STEP 1 – Get Data  Click the Get Data icon and select[…]

Read More

Copyright © 2023. Created by Meks. Powered by WordPress.