Map Reduce Example In Python, Getting Started With Big Data

If you are new to map-reduce, it is good to start off with a simple example and understand how the different map and reduce functions work in action.

Here are the three basic steps that a MapReduce function goes through. There is no surprise in the fact that MapReduce has a map and reduce function, but there is also an intermediate shuffle sort phase between the map and reduce function

Here is a diagram for an analogy of how MapReduce works,

Example of a Map reduce function- Source Simplilearn

Map Function

The main goal of any MapReduce function is to allow distributed processing by using multiple cores. 

In our example, we have taken the example of pulling boots instead of computing cores, and we can imagine the key being a certain political party and value donates a vote cast for that party. In the different polling stations, we will also have votes for the different parties. Basically, the same key can be present in different cores.

In the mapping function, our goal is also to decrease the amount of data being sent on the network, so we will only send the required key-value pairs for further processing.

In the mapping function, our basic goal is to distribute the data into key-value pairs. Later on, using these keys we can sort our data and do out mathematical functions on it in the reduce stage.

So by the End of the Map Phase, we will have data in this format.

Polling station 1

Party1 – 1 vote

Party 2 – 1 vote

Party 1 – 1 vote

Party 1 – 1 vote

Party 2 – 1 Vote

Polling station 2

Party2 – 1 vote

Party 2 – 1 vote

Party 2 – 1 vote

Party 1 – 1 vote

Party 1 – 1 Vote

We can only send the sum of votes by the party to the next stage, but let us keep it as it for simplicity.

Shuffle and sort

The shuffle and sort stage the MapReduce function does on its own. We do not have to write any function for it.

In this stage, the function collects all values according to keys and combines all the values together.

Here is what you get after this stage.

Party 1- 1 vote, 1 vote, 1 vote, 1 vote, 1 vote

Party 2- 1 vote, 1 vote, 1 vote, 1 vote, 1 vote

Reduce Function

Now in our example, we can count votes for party 1 in one processing core and count votes for party 2 on a different processing core, this way we can increase the processing speed of our analysis.

In the reduce face we can do any mathematical calculation on the value for a particular key.

We finally get a draw in our election results.

Party 1 – 5 votes

Party 2- 5 votes

Now let us code our first MapReduce function together, we will work with the Ml-100k database and run our code on a Google cloud database.

The Tool we are going to use for setting up Map-reduce is MRJOB.

For Configuring MRJOB on the master node of your Hadoop cluster using the following commands.

 

sudo easy_install pip

sudo pip install mrjob

We are using ‘sudo’ as we are not logged in as the root user otherwise, you can use the ‘su’ command to log in as the root or as an admin user.

For my first mapper code I will follow along with the udemy course I am taking, that I am taking.

Here is the code

 
 
from mrjob.job import

MRJobfrom mrjob.step import MRStep

class RatingsBreakdown(MRJob):

  def steps(self):
    return [ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings)]

  def mapper_get_ratings(self, _, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    return rating, 1

  def reducer_count_ratings(self, key, values):
    return key, sum(values)

if __name__ == '__main__':
    RatingsBreakdown.run()

Here is a breakdown of the code:-

  1. First, we include MRJob in our code, and we define our class, RatingsBreakdown, this class is inheriting all the methods from the class MRJob
  2. Next, we define the “step” which consists of a mapper, a combiner, and a reducer. All of those are optional, though you must have at least one.
  3. Next, we define the mapper function, which gets all the ratings from the file, by splitting the data line which is tab-delimited.
  4. We pass this information to the reduce function, which sum all the rating for a particular key

For running it on a Hadoop cluster if you are google cloud you can simply use the following command, on Amazon or Google we do not need to specify the location of the Hadoop streaming file.

[Python]
sudo python firstmapper1.py -r hadoop hdfs:///user/test/u.data
[/Python]

This is the output from the first code.

You can even combine multiple mappers and reducers in MRJob and come up with more in-depth insights.

About the author

admin

Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2020. Created by Meks. Powered by WordPress.