Map Reduce Example In Python, Getting Started With Big Data
If you are new to map-reduce, it is good to start off with a simple example and understand how the different map and reduce functions work in action.
Here are the three basic steps that a MapReduce function goes through. There is no surprise in the fact that MapReduce has a map and reduce function, but there is also an intermediate shuffle sort phase between the map and reduce function
Here is a diagram for an analogy of how MapReduce works,

Example of a Map reduce function- Source Simplilearn
Map Function–
The main goal of any MapReduce function is to allow distributed processing by using multiple cores.
In our example, we have taken the example of pulling boots instead of computing cores, and we can imagine the key being a certain political party and value donates a vote cast for that party. In the different polling stations, we will also have votes for the different parties. Basically, the same key can be present in different cores.
In the mapping function, our goal is also to decrease the amount of data being sent on the network, so we will only send the required key-value pairs for further processing.
In the mapping function, our basic goal is to distribute the data into key-value pairs. Later on, using these keys we can sort our data and do out mathematical functions on it in the reduce stage.
So by the End of the Map Phase, we will have data in this format.
Polling station 1
Party1 – 1 vote
Party 2 – 1 vote
Party 1 – 1 vote
Party 1 – 1 vote
Party 2 – 1 Vote
Polling station 2
Party2 – 1 vote
Party 2 – 1 vote
Party 2 – 1 vote
Party 1 – 1 vote
Party 1 – 1 Vote
We can only send the sum of votes by the party to the next stage, but let us keep it as it for simplicity.
Shuffle and sort
The shuffle and sort stage the MapReduce function does on its own. We do not have to write any function for it.
In this stage, the function collects all values according to keys and combines all the values together.
Here is what you get after this stage.
Party 1- 1 vote, 1 vote, 1 vote, 1 vote, 1 vote
Party 2- 1 vote, 1 vote, 1 vote, 1 vote, 1 vote
Reduce Function
Now in our example, we can count votes for party 1 in one processing core and count votes for party 2 on a different processing core, this way we can increase the processing speed of our analysis.
In the reduce face we can do any mathematical calculation on the value for a particular key.
We finally get a draw in our election results.
Party 1 – 5 votes
Party 2- 5 votes
Now let us code our first MapReduce function together, we will work with the Ml-100k database and run our code on a Google cloud database.
The Tool we are going to use for setting up Map-reduce is MRJOB.
For Configuring MRJOB on the master node of your Hadoop cluster using the following commands.
sudo easy_install pip sudo pip install mrjob
We are using ‘sudo’ as we are not logged in as the root user otherwise, you can use the ‘su’ command to log in as the root or as an admin user.
For my first mapper code I will follow along with the udemy course I am taking, that I am taking.
Here is the code
from mrjob.job import MRJobfrom mrjob.step import MRStep class RatingsBreakdown(MRJob): def steps(self): return [ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings)] def mapper_get_ratings(self, _, line): (userID, movieID, rating, timestamp) = line.split('\t') return rating, 1 def reducer_count_ratings(self, key, values): return key, sum(values) if __name__ == '__main__': RatingsBreakdown.run()
Here is a breakdown of the code:-
- First, we include MRJob in our code, and we define our class, RatingsBreakdown, this class is inheriting all the methods from the class MRJob
- Next, we define the “step” which consists of a mapper, a combiner, and a reducer. All of those are optional, though you must have at least one.
- Next, we define the mapper function, which gets all the ratings from the file, by splitting the data line which is tab-delimited.
- We pass this information to the reduce function, which sum all the rating for a particular key
For running it on a Hadoop cluster if you are google cloud you can simply use the following command, on Amazon or Google we do not need to specify the location of the Hadoop streaming file.
[Python] sudo python firstmapper1.py -r hadoop hdfs:///user/test/u.data [/Python]

This is the output from the first code.
You can even combine multiple mappers and reducers in MRJob and come up with more in-depth insights.
No comments yet.
Add your comment