Monday, November 11, 2013

Understanding MapReduce

In my previous posts, we understand hadoop as a distributed file system with built-in fault tolerance capabilities and highly scalability. But file system is just a basic building block to provide a powerful distributed processing framework called MapReduce. This all started in 2003 and 2004 when Google released two academic papers describing Google technology.
These two together provided a platform for processing data on very large scale in highly efficient manner. Doug Cutting was working on a similar problem on an open source project. When these two papers were published, Dough started implementation of the research published and Hadoop was born. Later Doug joined Yahoo and Yahoo became one of the most prominent supporters of Hadoop.
So, the point is HDFS and MapReduce are born together for each other to solve problems in processing data on very large scale. Where HDFS provided a file system and MapReduce provided a distributed data processing framework.
MapReduce is a software framework for easily writing applications to process vast amount of data in parallel on large clusters. At very high level, there are two parts, Map and Reduce. Ideally applications develop a Map and a Reduce method in java via implementing appropriate interface or abstract class. They also specify input and output locations and some configurations. Rest is taken care by framework. Framework consists of a single JobTracker and one TaskTracker per DataNode. JobTracker is master and all TaskTrackers are slaves. Master is responsible of scheduling tasks on the slaves, monitoring them and re-executing failed tasks. Slave executes the task. Hence, developers focus on core logic of implementing Map and Reduce methods. Once developed, we just have to submit the Job and configuration to the Job Tracker which then takes care of the responsibility of completing job and providing status back to us.
However, there is lot to understand which happens inside and outside but for now let’s just focus to understand MapReduce concept
To explain MapReduce concept, experts has been using a solution to word count problem. We will use a more simplified version of word count problem for now so let’s define the problem.
We need to find count of two specific words in a given file i.e.
hello and world. Let’s assume I have a file test.txt and it has simple text as shown in below screen.
After processing my word count logic, I expect output as below.
hello=3
world=2
Let’s understand how MapReduce paradigm will solve this problem. MapReduce concept is not very new for programmers. We have been using it since long. Instead of jumping directly into Hadoop MapReduce framework to solve this problem, let’s understand how a Linux developer will solve this problem. There may be many ways to do this but I will purposefully do it using two bash shell scripts as below mapper.sh and reducer.sh.
I will feed my file test.txt into mapper.sh using pipe. My script mapper.sh will read it line by line and tokenize each line. If a token is hello, it will print a <key,value> pair as <hello,1>. If a token is world, it will print a <key,value> pair as <world,1>. My script will ignore all other tokens. Let’s see this process in below screenshot.
Logic is simple as explained, my while loop will read each line and for loop will tokenize each word in the line. Each token is then examined and if it is hello or world, an appropriate key value pair is printed. If you try to do it your own, don’t forget to give execute permission to your mapper.sh scrip using chmod command as shown in above screen.
Let’s look at the reducer.sh. I will feed in output of my mapper.sh to reducer.sh through pipe. Reducer will examine each key value pair produced by mapper and will simply count for how many times it found first pair i.e. “hello,1” and how many times it found second pair i.e. “world,1”. Finally it will print my desired output as shown in screen below.
That’s what MapReduce is at very high level. Hadoop MapReduce framework will feed input data to mapper method developed by a programmer. Mapper knows what to do with this data hence it will process data and generate key value pairs which are given back to framework. Framework will perform a search and sort operation on all key value pairs generated from various nodes across the cluster. Then it will feed these key value pairs back to reducer. Reducer is again a method written by programmer hence it knows what to do with these key value pairs. Reducer will perform reduce operation to generate final result.
This key value paradigm initially seems to be a limiting but is a powerful model and surprisingly widely applicable as can be seen by the adoption of Hadoop and MapReduce across a wide variety of industries and problem scenarios. The fact is simple that much of the data is either intrinsically key value in nature or can be represented in such a way.
Data model itself is not the only thing that makes Hadoop and MapReduce purposeful. Its real power lies in how it uses the technique of parallel execution. We can have a large number of DataNodes on which we can store part of data and execute part of processing locally, Use a framework which manages division of work and combination of partial results into a final answer. And this model is horizontally scalable by just adding more and more nodes if we need, and we need not to worry with growing size of cluster because fault tolerance comes by default. Isn’t it fantastic?
I hope this at least builds a basic idea of MapReduce paradigm. In next post we will try out developing, compiling and executing a real MapReduce program in Java.