Wednesday, November 27, 2013

Getting Started with PIG

The power of hadoop is its distributed file system and MapReduce processing framework. We already know that most flexible and powerful method for developing MapReduce jobs is using Java. But we may not need that level of control and power for doing most of our ETL tasks and Java turns out to be too much low level for doing simple tasks. This thought has given birth to Hive and Pig. These two were designed and developed by two different teams in two different companies. They both are given to open source and became famous due to their wide acceptance. Now a day, most of the companies using hadoop is using both of them. I have explained Hive in my several posts and will continue going deeper into Hive. By now we understand that Hive is a query like language and inherit most of SQL features and syntaxes. What is PIG then and how it differs from Hive?
It is officially known as Pig Latin and a Parallel data flow language. So it’s not a query language, its data flow language. What does that mean? In a query language, we tell what result do we want and let query engine decide how to get those results.
For example when we write a SQL
SELECT DEPARTMENT, AVG (SALARY) FROM EMPLOYEES GROUP BY DEPARTMENT; we are just asking a question “What is average salary for each department?” Query engine will decide what to do to get this answered. But when we have to do it procedurally, it’s completely different approach. Think of writing a function in C or Java language to get this answered, that’s a procedural example. You have to open file and load data, divide it into groups, calculate average and finally write averages somewhere. Scared? PIG is not as low level as C or JAVA but yes, it is procedural. If you know HQL (Read Hive related Posts on this blog) or at least SQL, as you learn more on PIG you will realize differences so let’s not waste time discussing differences and start learning PIG Latin.
We know pig is a scripting language and like any other, we have two methods to execute our scripts. You can place all of your Pig code into a file and execute that file. Second method is to start pig shell and enter all your pig commands interactively. We will start with second method however first one is used in real solutions. We are choosing second method i.e. pig shell to start learning pig because it immediately compiles each command I enter and let me know if there is any error. In case of error, I correct it and enter my command again. Pig remembers only latest one so I need not to worry what I entered earlier. This feature makes it easy to work for starters. Once we are comfortable with pig syntax, we will move to executing out pig files.
I am assuming you have access to pig environment, if you don’t have, go back to my post here and get a virtual box running. You may need to read this post to be well versed with HDFS and virtual box.
Let’ login to you virtual machine and type below command which will start pig shell. Pig shell is known as
Grunt. We will refer grunt instead of pig shell. pig –x mapreduce
Once you are in grunt, we will execute one simple data flow which will produce same results as you can expect from below SQL.SELECT DEPARTMENT, AVG (SALARY) FROM EMPLOYEES GROUP BY DEPARTMENT;
It’s a hello world for is in PIG. You will need a data file. I already have a data file which I used in some of my Hive tutorials which is shared here. You need to download this file and place it into you HDFS directory. If you are not aware how to move files into your hadoop file system, you should read this article.
Let’s do it. If you look at below screen, you will see that I am able to execute my HDFS shell commands from grunt. I used –ls command to show you that my data file is named hql-data.txt and it is placed at /user/pkp/hql directory in my hadoop file system. Commands inside red box are actually my Pig Latin commands.

Let me explain. First three lines starting with mydata till semicolon is a single pig command. There are three parts of it, load, using and as. Load is main command and it takes filename so I gave full path for my data filename. Using is a clause of load command. Since my file is a comma separated file I used this clause to specify that. By default, if I don’t specify using clause, pig will assume a tab separated file. Finally I have as clause where I specified column names. By default pig will assume all columns as bytearray which is fine for this example hence I have not specified any data type for my columns.
Fourth line created a group on department column.
Fifth line generates results by looping for each group and calculating salary averages for each group. We will learn more about all these things sometime later. In this post my objective is to begin with simple pig script.
Finally, sixth line will create a file and place result into it. Grunt will compile all pig statements as you enter it on grunt shell but it will not execute anything until it finds a
store command. As soon as you press enter in above screen, grunt will execute entire flow and create a directory “dept_sal_avg” into my home directory and place a file loaded with results into that directory.
It’s time to check results. Check below screen.


Do you need any explanation? I don’t think.
Keep reading and keep learning.