In my previous post, we explored MapReduce concept using a bash shell script. In this post we will compile and execute a MapReduce program written in Java. For this demonstration I will use HortonWorks HDP Sand box 1.3. However beta version HortonWorks sand box is 2.0 is also available now, there are few differences to complete procedures explained in this post on HDP 2.0. You don’t need to worry about that as I will explain differences wherever necessary. In case you do not have HortonWorks HDP Sand box setup available with you or you are new to HDP sandbox, I recommend going through at least below posts.
Let’s get started by login to your HDP sand box using root user. Password for root user is hadoop. First thing we need to do is to setup your PATH variable to include java binaries. Follow the screenshot shown below along with explanation given. If you are an experienced Linux person, you can easily skip explanations.
- First command I executed was javac which is java compiler and it answered as command not found. This means either java is not installed or PATH is not set correctly.
- Second command echo $JAVA_HOME which prints java home directory and confirms that java is already installed.
- Third command echo $PATH prints current PATH and I can see that it does not include java binaries location.
- Firth command export PATH=$PATH:${JAVA_HOME}bin will change PATH variable to include java binaries directory. You will notice that JAVA_HOME already contains / at the end so I have not included / in ${JAVA_HOME}bin. It is advisable to place this command into .bash_profile file to get executed automatically every time you login to system as this value is lost if you logout and login again.
- Fifth command is again printing value for PATH variable to confirm that jdk is now in you PATH.
Note for HDP sand box 2.0: You will notice following differences in HDP 2.0.
If you execute command echo $JAVA_HOME you will notice that / which was extra in HDP 1.3 is corrected in HDP 2.0. So your export command will change as shown in below screen.
However in HDP 2.0, PATH already includes jdk binaries directory but due to this / issue, it’s incorrect. It is recommended that you correct this typo into /etc/bashrc file once for all. Please do not change JAVA_HOME value in /etc/bashrc file but change PATH variable to include /. This single correction will make java available to you in HDP 2.0 and you do not need to export PATH variable again.
However in HDP 2.0, PATH already includes jdk binaries directory but due to this / issue, it’s incorrect. It is recommended that you correct this typo into /etc/bashrc file once for all. Please do not change JAVA_HOME value in /etc/bashrc file but change PATH variable to include /. This single correction will make java available to you in HDP 2.0 and you do not need to export PATH variable again.
At this stage, we have java compiler available to us to compile our java programs in HDP sand box. Next step is to set CLASSPATH to include necessary java classes which we will be using in our java MapReduce example. Follow screen shot below. It’s very simple and should be easily understood hence I am omitting explanation.
Note for HDP sand box 2.0: You will notice following differences in HDP 2.0.
in HDP 2.0 VM, hadoop-core.jar is removed and you will not find it. There are two jar files which you need to include in your CLASSPATH to be able to compile your java MapReduce example. Use below command in HDP 2.0.export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.2.0.2.0.6.0-76.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core-2.2.0.2.0.6.0-76.jar
in HDP 2.0 VM, hadoop-core.jar is removed and you will not find it. There are two jar files which you need to include in your CLASSPATH to be able to compile your java MapReduce example. Use below command in HDP 2.0.export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.2.0.2.0.6.0-76.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core-2.2.0.2.0.6.0-76.jar
At this stage, your environment is setup and you are ready to compile your java based MapReduce program. So let’s get a sample program. We will use word count example for this demonstration which has become hello world of hadoop MapReduce. You can copy example word count code from apache hadoop website and save it into a WordCount.java file on your machine. We will walk through the code later, let’s first compile and execute it. You need to move your WordCount.java file into your HDP sand box. If you do not know how to do this, please refer my previous post Using HDFS File system
Once your WordCount.java is in your HDP sand box, follow screen shot and explanation below to compile and build your program.
- First command mkdir wordcount_classes will create a directory to hold all java compiled class files.
- Second command javac –d wordclunt_classes WordCount.java will compile your program.
- Next command ls ./wordcount_classes/org/myorg will list all class files created by java compiler.
- Next command jar –cvf wordcount.jar –C wordcount_classes/ . will build a jar file for our classes.
- Last command ls shows that I have wordcount.jar file created in current directory.
Now we need an input file for our WordCount program. Our program will count all words in this file using MapReduce method. We will reuse test.txt file which we created in previous post. If have printed content of this file in screen shot below.
We need to move our input file into HDFS file system. Follow screenshot given below with explanation.
- First command hadoop fs –ls /user/hue/ will list the content of the directory /user/hue in HDFS.
- Next command hadoop fs –mkdir /user/hue/input will create a new directory in HDFS.
- Then we are listing content of current local directory.
- Next command hadoop fs –put test.txt /user/hue/input will copy test.txt from local file system to HDFS.
- Finally we are listing content of our HDFS input directory and we can see that our file test.txt now sites in HDFS.
Below screen shows final statement to execute our java class org.myorg.WordCount using wordcount.jar, we are passing two arguments to our class. First argument is input directory and second argument is output directory.
This class when executed will create output directory and will also create output file named part-00000 containing final results.
You can check results using command hadoop fs -cat /user/hue/output/part-0000 or alternatively using hue file browser as shown in below screen. You will notice that results are already sorted.
Now, it’s time to walk through java code. If you are a java programmer, you will find it fairly simple and may need no explanation. However let’s cover some key things.
There is only one top level public class defined as WordCount and everything else is encapsulated inside it. This is the class which we executed in our example above. There are two inner classes Map and Reduce.
Map class creates one method called map. This is where you will write your map logic. Framework will pass key value pair to map method. In our case, key is irrelevant and value will be single line from the input file. So our map code is tokenizing the value and is processing all tokens. Processing is fairly simple as we just put each token into an output collector as a key and value hardcoded as 1. That’s it, it goes to framework which will collect information from all mappers, perform sort and shuffle operations on keys, combine values for each key and pass it back to reducer with set of values for each key. There are lots of things internally done by framework which we need not to bother at this stage.
There is only one top level public class defined as WordCount and everything else is encapsulated inside it. This is the class which we executed in our example above. There are two inner classes Map and Reduce.
Map class creates one method called map. This is where you will write your map logic. Framework will pass key value pair to map method. In our case, key is irrelevant and value will be single line from the input file. So our map code is tokenizing the value and is processing all tokens. Processing is fairly simple as we just put each token into an output collector as a key and value hardcoded as 1. That’s it, it goes to framework which will collect information from all mappers, perform sort and shuffle operations on keys, combine values for each key and pass it back to reducer with set of values for each key. There are lots of things internally done by framework which we need not to bother at this stage.
Finally, there comes the role of second class Reduce and method named as reduce. Framework will pass a key and list of values as an iterator to reduce method. Inside this method, we write our reduce logic. That’s again simple in our case. We just count number of values in the iterator and place it into output collector again as key value pair where key is same and value is count of values in the iterator. This goes back to framework which takes care of the rest i.e. collecting information from all reducers and placing final output into output directory in sorted order.
Final part is inside main method. This main method is all about setting configuration information for the framework to be able to handle our job.
Final part is inside main method. This main method is all about setting configuration information for the framework to be able to handle our job.
If you are not a java programmer, you will find it difficult to get all this but don’t get disappointed. Java is not the only way to develop MapReduce programs, there are many and are fairly simpler than java to do most of the things. Java is used when you need full control and flexibility, but we don’t need that level of control and flexibility always. In world of business analytics, ability to deliver quickly is the most important and critical thing for the success. Using java for that purpose might be time consuming. Obviously, knowledge of java, Linux and understanding core mechanism will help you to become a better hadoop professional irrespective of which tool you use to build your MapReduce jobs.