Friday, November 29, 2013

Apache Pig Latin – A Paradigm shift

What is Pig? You ask this question to anyone and answer will be somthing as “It is a parallel data flow language”. Such a flat answer does not give me a proper picture. In this post we will try to answer this question and learn some other Pig fundamentals. Let’s start.
There are at least three paradigms of computer programing techniques. Most of us are already familiar with two of them. Programming languages and Query languages.
In programing languages like Java and C, we control execution. When we write program, we define execution steps.
In query languages we ask questions.
A Java or C programer who doesn’t know query language will find SQL a difficult thing during early learning phases. It takes some time for him to get hold of this paradigm shift.
Pig Latin is another shift. It uses third programming technique known as data flow language or sometimes termed as data stream language. If you are not familiar with any data flow language, it’s another paradigm shift for you and it might take some extra effort to get familiar and comfortable with this method.
In this post, we will cover some basics about Pig Latin. Let’s Start.
  1. Pig is a mixed case language. Keywords in Pig are case insensitive but names in pig are case sensitive. When I say names are case sensitive that means variable names, function names etc. When I say keywords are case insensitive that means LOAD, USING, FOREACH, GROUP BY etc.
  2. Like any other scripting language, Pig Latin script support commenting. Single line comments are given using double hyphen (- -) and multiline comments are Java style i.e. pair of starting /* and ending */.
  3. Pig is a data flow language. Before you do anything with data, you have to load it. Almost every pig script will begin with a load statement. An example is given below.
                    
    MYDATA = load ‘/user/pkp/test.txt’;

For simplicity, you can think of MYDATA as a variable. Above statement will load data from test.txt file and assign it to a variable MYDATA.

MYDATA is not a simple variable. It is similar to a table in any database which contains multiple rows of data. In our example, MYDATA holds multiple rows from test.txt.

Pig Latin uses different terminology for Table and Rows. Table in Pig terminology is a Bag. Row in Pig terminology is a Tuple. In this example MYDATA is a Bag which holds multiple Tuples from test.txt file.
  1. When Pig executes load statement, it loads tuples from file into a Bag. By default load statement assumes that your tuples are separated by new line character. 
  2. If you are visualizing Tuple as a row, you must see fields there. By default, load statement assumes that your fields are tab delimited. 
  3. Once you loaded data into Pig, what do you want to do with it?
    Answer will be processing. But I don’t want to call it processing. I will call it “data flow”. Because I will flow data from one bag to another bag. That’s all we do in Pig and that’s what a paradigm shift you have to understand. Pig is a data flow language and we don’t process data. We flow data from one bag to another bag. Bag and Relation are almost synonymous in Pig. You may think, what will I achieve moving data from one bag to another bag? Doubt is genuine. I will transform data from one format to another format in each step and new bag will hold data into a new format. Isn’t great? No? Let me ask, what is ETL process?
    E- We Extract data from a source, that’s what we did in load statement.
    T-
    We Transform data into a new format, that’s what I am saying that we will do in each step of our Pig Latin.
    L- That’s a simple part. We will load our final Bag wherever we want, most of the time into another file.

    Pig Latin is essentially an ETL scripting language. it is the most simple and powerful ETL tool as of today coupled with the power of hadoop.
  4. When you are done with your transformations or a series of transformations, you have your final data into a Bag. Most of the time, you might want to store it into a new file. You will use below command for this.
             store BAG_NAME into ‘FILE_NAME’;

             if you just want to display your bag content on the screen instead of storing it into a file, 
             you can use dump BAG_NAME;
We will run a simple example. I will load a tab separated file test.txt and dump it on screen. Let me show you my file before we start.  In below screen, my data file has 5 rows. Each row has 4 columns or fields and they are separated by a tab character.

Let’s load it into Pig Bag and dump it back on screen. My Bag name is MYDATA. Grunt throws so many messages on screen and at the end of processing, it shows output. I have eliminated all those intermediate messages and kept only output in the screen for better clarity.


In above screen, you notice that each tuple is enclosed by braces. If you remember, my original data file did not have these braces. It’s pig who is adding these braces. You can try storing it into a file (store BAG_NAME into ‘FILE_NAME’;)instead of dumping on screen and you will notice that pig does not add these braces into file. Which clarifies that pig does not enclose top level tuples into braces when store them back in file. What is top level tuple? I said so because we can nest tuples i.e. tuples inside a tuple. We will see nesting below in next step. Next thing to notice here is my fields in a tuple are separated by comma but my original file was separated by tab. That’s again a Pig behavior which does not happen when we store it back into file.
Now let’s perform a transformation, I will group it on third column and dump it on screen. Pig starts column count from 0 so I gave $2 for the third column.

Great! you saw first transformation and a new bag of your tuples. That’s why I mentioned in the beginning that, in pig we move data from one bag to another bag and perform some transformation at each movement. If you understand this concept, you will easily master data flow language. If you keep thinking like Java, C, SQL or any other language you know, you will have difficulties in understanding and developing Pig Latin scripts.
Let’s discuss some important observations from the output.
  1. You can again see that each top level tuple is covered inside a pair of top level braces so what’s my first tuple? (HR,{(105,Ravi,HR,Pune)})
    And what is second one?
    (IT,{(101,Prashant,IT,Bangalore),(102,Prakash,IT,Bangalore)})
  2. Next question for you. In my second tuple, how many fields are there? If you answer two? You are right. My first column value is IT and rest of the string is second column.
  3. You should have question for me. Why second field is enclosed by {}? Let me answer. Curly braces represents and inner Bag. Nesting has happened here. We have an outer Bag named MYGROUP. This bag contains three tuple. Each tuple contains two fields. First field is a string and second field is an inner Bag. You may ask why we need this inner Bag. Answer is simple. We need a Bag because we need to store multiple tuples for each group for example, for group IT, we have two tuples separated by comma.
  4. Let’s store results into a file, check below screen. You will notice that Pig has again removed outer braces from top level tuples and fields are separated by tab instead of comma. This is default behavior of output function which actually writes data into file. We will again visit this behavior when we discuss output function in Pig.

My objective in this post was to help you cope up with paradigm shift from a programming language or query language to a data flow language. Once you start thinking in terms of data flow from one Bag to another Bag and from one form to another form, you will be able to speed up your Pig learning. Hope I thrived in my attempt. Your feedback is valuable.
I