Getting started with Pig scripting in Hadoop

In this blog, we will cover some basics about Pig scripting, and get you started with it.

Pig is a high-level scripting language that enables programmers to write SQL like queries to get results from the HDFS file system. Pig converts the queries into MapReduce task hence decreases the time and investment that was needed before to run write MapReduce functions.

Setting Up the Pig scripting environment

I am writing this tutorial using the google Dataproc service. Google Dataproc comes with a has a built configuration for the following services, so we do not need to do anything special to run a google pig script.

  • Spark 2.3.1
  • Apache Hadoop 2.9.0
  • Apache Pig 0.17.0
  • Apache Hive 2.3.2
  • Apache Tez 0.9.0
  • Cloud Storage connector 1.9.9-hadoop2
  • Scala 2.11.8
  • Python 2.7

Main Components of a Pig Script

Load Data

First of all, let’s get some data to start processing, let us use the ml-100k database from movie lens, we will work with and file.

movie= LOAD '/user/test/'; DESCRIBE movie;
Output for the above code

Our data file tab-delimited which is the default for pig so we do not have to specify any separator, otherwise, the load query should include ‘using PigStorage(‘delimiter used in the data’)’

Defining data columns and type

Now let us take it one step further and define the data types when we load the data, we will do this using ‘AS’ in our query,

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);

Here is the output, this time we can see the columns in the output:- 

View the data in Pig

In all the above examples we have used ‘DESCRIBE’ to view the relationships between the data set. Now let us use DUMP to output the relationship.

But rather than outputting the whole relationship data, we are going to output the first 10 rows of the relationship data by using the ‘LIMIT ‘ command.

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);
subsetMovies= LIMIT movie 10;
DUMP subsetMovies;


Select specific columns from the relationship data

Now lets us try to develop a new relationship data from our existing dataset, for this we will use the ‘FOREACH’ query.

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);
movieRating = FOREACH movie GENERATE userId, movieId;
movieSubset = LIMIT movieRating 10;
DUMP movieSubset;

Now you can see that we only have two rows of data

Store processed relationship data into HDFS file System

What if you wanted to store your processed file in a new file in HDFS, for this you will want to use the ‘STORE’ query.

STORE movieSubset INTO 'user/test/';

We are using the default delimiter, otherwise, you can use ‘USING PigStorage(‘Delimiter’)’ to generate files with the delimiter you want.

Join relationships with Pig

Join query is pretty straightforward, we specify the two relationships we want to join and which column we are going to use to join them.

final_dataset= JOIN movie BY (movieId) ,  movieNames BY (movieId);

This query will join the two relationship datasets together using the column movieId.

Group data

One important query in pig is ‘GROUPBY’. This query basically groups all tuples of data by a specific column.

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);
movieRating = GROUP movie BY movieId;
DUMP movieRating;

If you run DESCRIBE on the grouped data, you will see that the first value is group value, the name of the column is also group and type depends on the data.

Group query is essentially running a mapper function, let’s try to write a corresponding reducer query for it.

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);
movieRating = GROUP movie BY movieId;
dataset= FOREACH movieRating GENERATE group as movieId,AVG(movie.rating) as movieRating;
DUMP dataset;

Pay attention to the name of the columns in the grouped relationship data, movie rating. movie. rating basically gets the data from tuples of movie relationship data inside the grouped relationship data.

Order Data

Now let us Order this information so we see the movies with the best avg. rating first.

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);
movieRating = GROUP movie BY movieId;
dataset= FOREACH movieRating GENERATE group as movieId,AVG(movie.rating) as movieRating;
ordered_dataset= ORDER dataset BY movieRating desc;
DUMP ordered_dataset;

Filter Data

Now let us only work with movie ratings more than 3, for this, we will filter our data.

movie= LOAD '/user/test/' AS (userId:int,movieId:int,rating:int,ratingTime:int);
movieFiltered= FILTER movie BY rating>3;
DUMP movieFiltered;

Filtered Data

About the author


Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2023. Created by Meks. Powered by WordPress.