1
2
Simple and complete tutorial of Support Vector Machines
3
Learning HBASE data model and important shell commands
4
Understanding Cassandra Data Modelling, Partition and clustering keys
5
Getting started with Pig scripting in Hadoop
6
Map Reduce Example In Python, Getting Started With Big Data
7
13 Must Know Hadoop HDFS Commands Every Data Engineer Must Know
8
9
Simple and Complete Tutorial on Logistic Regression
10
Simple And Complete Tutorial For Understanding Decision Trees

Table Of content Simple Explanation of Adaboost Step by step understanding of how adaboost works. Number of weak learners required Bias and variance tradeoff in adaboost Parameter optimization in adaboost Feature selection in adaboost SIMPLE EXPLANATION OF ADABOOST Adaboost creates an ensemble of weak learners to create a strong learner. Weak learners are models that achieve accuracy just above random chance on a classification problem. The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps. Adboost is a sequential learner. Basically, in adaboost you run all of your data through a weak learner, and try to classify the data. Then in the next iteration, you give more weightage to the incorrectly classified examples in the training data. So, your next weak learner does a[…]

## Simple and complete tutorial of Support Vector Machines

Table of content Understanding how you calculate distance to a point from a plane Maths behind how you find support vectors How does and why does a kernel work How to figure out if your support vector is good at generalizing Bias and variance tradeoff in support vector machines Imbalanced Classes In SVM Loss function in svm Assumptions under SVM and stuff to be careful about SIMPLEST WAY TO IMAGINE SUPPORT VECTOR MACHINES Take this plotted data for a 2 dimensional data with two different labels. Now try to imagine a line somewhere in the middle that differentiates both the data sets. I am pretty sure most of you can picture this line. Our aim with SVM is to find this exact line. In more technical terms, we want a line or hyper plane so that the distance of the closest points in each group from the line/Hyperplane is the[…]

## Learning HBASE data model and important shell commands

Why we need to work with HBASE HDFS is good for sequential data access, but it lacks the random read/write capability. HBase runs on top of the Hadoop File System and provides read and write access. It is extremely fault-tolerant for storage of sparse data. Data Model in HBase The different components of Apache HBase data model are tables Rows, Column Families, Columns, Cells and Versions. Table Hbase tables are made up of multiple rows stored according to their respective row keys in the table Row Each rows has a row key and corresponding to it you can have one or multiple column families/columns.  Design row key in such a way that, related entities should be stored in adjacent rows to increase read efficacy. This helps to avoid hotspotting on a particular node which basically means that most of the read and write operations are not using a single region server.[…]

## Understanding Cassandra Data Modelling, Partition and clustering keys

Let’s get started:- Overview of Cassandra Cassandra is a NOSQL database developed by Facebook. It is a great database that allows you to effectively run queries on large amount of structured and semi structured data. To know when you have to choose Cassandra as your database of choice, you need to have an understanding of CAP Theorem. CAP theorem states that it is impossible for a distributed system to satisfy all three of these guarantees together, you need to leave one out. C is Consistency, means all nodes have the same data at the same time P is partition tolerance, means that in case of failure the system continues to operate A is availability , make sure every request receives a response In Cassandra, availability and partition tolerance are considered to be more important than consistency in Cassandra. However you can tune consistency also with replication factor and consistency level to[…]

## Getting started with Pig scripting in Hadoop

In this blog, we will cover some basics about Pig scripting, and get you started with it. Pig is a high-level scripting language that enables programmers to write SQL like queries to get results from the HDFS file system. Pig converts the queries into MapReduce task hence decreases the time and investment that was needed before to run write MapReduce functions. Setting Up the Pig scripting environment I am writing this tutorial using the google Dataproc service. Google Dataproc comes with a has a built configuration for the following services, so we do not need to do anything special to run a google pig script. Spark 2.3.1 Apache Hadoop 2.9.0 Apache Pig 0.17.0 Apache Hive 2.3.2 Apache Tez 0.9.0 Cloud Storage connector 1.9.9-hadoop2 Scala 2.11.8 Python 2.7 Main Components of a Pig Script Load Data First of all, let’s get some data to start processing, let us use the ml-100k[…]

## Map Reduce Example In Python, Getting Started With Big Data

If you are new to map-reduce, it is good to start off with a simple example and understand how the different map and reduce functions work in action. Here are the three basic steps that a MapReduce function goes through. There is no surprise in the fact that MapReduce has a map and reduce function, but there is also an intermediate shuffle sort phase between the map and reduce function Here is a diagram for an analogy of how MapReduce works, Map Function– The main goal of any MapReduce function is to allow distributed processing by using multiple cores.  In our example, we have taken the example of pulling boots instead of computing cores, and we can imagine the key being a certain political party and value donates a vote cast for that party. In the different polling stations, we will also have votes for the different parties. Basically, the[…]

## 13 Must Know Hadoop HDFS Commands Every Data Engineer Must Know

For those of you who are not used to working with servers using the command line, welcome to the command line, here are the Hadoop commands that you need to know to explore the HDFS landscape. If you have already worked with servers before you know the importance of managing file using the command line. Let’s get started:- Hadoop dfs – ls <Path name> This command lists out all the different files and folders in a particular path. There is no cd (change directory) command in hdfs file system. You can only list the directories and use them for reaching the next directory. 2) Hadoop dfs – mkdir <Path name> For creating new folders in the HDFS use the mkdir command. Provide the relative path where you want to create the folder. Now let us check if the folder was created. 3) Hadoop dfs -cp <File Path> <Path of copied file>  As I[…]

Starting off with big data, basically, every small thing is a task and it takes time and energy to understand stuff. What is Hadoop Big data is so big that at one point it will be impossible to keep it on the same disk. So, you have to have a distributed system to manage this amount of data. Hadoop takes a big data file and distributes it on different clusters with replication because it is inevitable that some clusters will crash or have problems. Hadoop is based on a framework developed by Google but it was Yahoo that propagated and helped the open source project. What is Hive Apache Hive is a system that helps data scientist to perform SQL like queries on data files stored on Hadoop. For this reason, it is super-useful as it basically converts stored data files into SQL databases. For this exercise, we are going[…]

## Simple and Complete Tutorial on Logistic Regression

Here are the steps we are going to take to understand logistic regression Basic understanding of logistic regression Some concepts that you need to understand that will aid in understanding logistic regression Basic explanation of logistic regression Understanding the maximum likelihood Bias variance tradeoff in logistic regression Regularization in logistic regression How to evaluate a logistic regression function BASIC UNDERSTANDING OF LOGISTIC REGRESSION The first thing we need to understand why we need logistic regression or generalized linear models. 1) The output of a linear regression model can be any real number ranging from negative to positive infinity, however, a categorical variable can only take on a limited number of discrete values within a specified range. 2) The error terms are not normally distributed for discrete output variables i.e 0, 1. So, the conditions for a linear regression are not met Generalized Linear Models In linear regression, we use a[…]