Thrive School - Big data, Hadoop, MapReduce, Hive, Pig, Hbase Tutorials: Pig Schema

In my previous post, we learned Pig scalar data types and complex data types. We have seen examples to load a tuple. We have seen another example where we loaded a tuple which contains a Map as a field. That was an example of nested complex data type.
We are left with an example to load an inner bag into a pig relation. What is inner bag? We understand bag, it’s a collection of tuples. So whenever we load data from our files, it creates several tuples and all of them are loaded in an outer bag. Look at the example below. Line 1 & 2 loads my file test-data.tsv into a relation named D. In this case D is an outer bag. Line number 3 creates a new relation X by grouping state column from D. In this case X is again an outer bag. At line number 5, when we dump X, content of X is displayed on screen. Look at the last line in output.
First column value is Maharashtra and second column is enclosed into {}. That entire second column is an inner bag. This inner bag holds two tuples which are separated by comma. I have saved it into a file and we will load it back using a new script.

Look at the example below to learn how we load an inner bag from our file. It’s simple and need not any explanation.

I will summarize Complex data types in below table.

We have seen that when we load data from a source file, data format in source format is also important. When we loaded map data, we had data encoded in [key#value] format. When we loaded an inner bag, we had data encoded in {(f1,f2,…),(f1,f2,…)} format.

You must have a doubt that do we really get data in such a very well encoded pig schema notations? Real life data comes in all varieties. You may get data files encoded in pig schema notations but you may get completely unstructured data or semi structured data or you may get very well structured data but not in pig schema notations, it may be some other schema notation like JSON or XML or an HBASE table or any other format. That’s one of the top challenges in data analysis and any tool claiming to be a data analysis tool must have to deal with those challenges. So the question is how pig handles this challenge.

This role is played by LOAD and STORE function. Load/Store functions determine how data goes into Pig and comes out of Pig. If you remember examples which we used to load a comma separated file, we used PigStotage function to inform pig that my data file is a comma separated file. PigStorage is default load function for pig. As per Pig 0.12 documentation, there are 8 such functions to support various source data formats. We will see some examples on their usage but what if you have data which is in a format other than supported by these 8 functions. You can write your own function and use it in your pig script to load your custom formatted data. But if your data is in any of the industry accepted standard formats or your schema is stored in HCATLOG, you will find such Load/Store functions already developed and shared by someone in open source community. I will write about HCATLOG separately but if you need you can get information about HCatLoader here.

Now it’s time for an example of using load function other than default PigStorage. I will show you an example on how to use JsonLoader to load data from a JSON data file.

For this example, we need a simple Json data file shown as below.

JsonLoader expects a schema definition file named as .pig_schema in the input directory. Screen below shows content of the file for our example.

Finally, below script shows my script for loading Json formatted data.

Keep reading, keep learning.

Good Luck.

Thursday, December 5, 2013

Pig Schema – Part 2