Tuesday, December 3, 2013

Pig – Data Types and Schema– Part 1

In previous posts, we learned how to load data into Pig relation (outer bag). When you load data into a relation, you have an option to specify schema for your data. Specifying a schema is letting pig know about organization of your data like field names and data types. Specifying schema is optional, if you don’t specify, pig will make guesses with best possible way. But if you specify schema for you data, Pig will make use of it in error checking and optimization. Let’s take an example.
I am loading tuples from hr-data.txt in three relations one after other. First relation MYDATA is loaded without specifying any schema about my data. I can still use MYDATA bag. When I need to refer any field, I can use positional notation which starts from 0. If I use $0 in my pig script, it points to first column in MYDATA which is id column. If I use $2 in my pig scripts, it points to last_name. We will see examples later how to use positional notations.
Second bag MYDATA1 is loaded with partial schema information. I informed pig about column names but nothing about data types. I specified only 4 columns but actual file contains more than that. Pig will ignore them and will load only 4 columns.
Third bag, MYDATA2 is loaded with much more detailed schema information. I specified field names and their data types. Before we get into more details about schema, let’s understand supported data types in Pig.

Scalar Data Types

Pig supports scalar and complex data types. We will focus on scalar data types first. Below table shows list of all supported scalar data types.



There is nothing much to discuss about these data types except they are implemented internally using java classes. This makes them easy to be used with user defined functions and their behavior resembles Java to their original Java counterpart.

Complex Data Types

Pig has three complex data types Map, Tuple and Bag. Tuple is the simplest one amongst other complex data types in Pig. We have already seen tuples in above example where we loaded a tuple of four fields. Most of the time you will start with loading data from your source as tuples and then working on it to transform them. Most important property of tuple is that it is an ordered list of fields so you can refer them using positional notation. When you load data as tuples, absence of data or some error in data makes field set to NULL. While defining schema for the tuple we use parentheses i.e. () to represent a tuple and fields in tuple are separated by comma. See example below.
Next complex type is Bag which is a collection of tuple. Bag is nothing but a collection of tuples. When you load data, you get a bag which holds multiple tuples. Bag is like table in database and tuples are like rows in database tables however there are several differences which you will understand slowly. You should note that tuples are not ordered in bag and it may contain duplicate tuples also. That’s why bags do not support any positional notation.
Last complex type is Map. It’s a key-value pair. These key value pairs are separated by # and data type for key must be chararray. We use [key#value] to represent a key value pair in pig. We will see an example below.
We will see some examples for all of these complex data types to get a better understanding.
Let’s come back to schema. Schemas enable you to assign names to fields and declare types of fields. They are optional but we should use them whenever possible. We get first opportunity to define schema when we load data using LOAD statement. Later we can redefine schema with FOREACH statements. Let’s take an example for both the methods. Let me show you my data before I load it.
It’s a tab separated file with four fields as name, city, state and country. Let’s load it without specifying schema. We will use DESCRIBE command to show its schema and then we will use DUMP command to display it on screen. First three lines in below screen is my pig script and rest is output.

So pig reported that schema is unknown. A parenthesis in output represents tuple and we can see that fields are separated by comma. Let’s define schema for this same data using AS clause of LOAD statement. We will use same parenthesis and comma notations to define it.
We can see that Pig has responded back with the schema of our bag D. {} represent bag.
Let’s look at another example. In this example, I want to demonstrate two things. How to load Map data type and how to redefine schema using FOREACH statement? For this example, I have slightly changed by data file. Check below screen and you will notice that I have modified city field. I encoded it using format which is used by pig to represent a key values pair i.e. [Key#value].
Check below script and output. Hope you get it else read explanation below.
I loaded data specifying schema for my data. Schema is almost same as defined in earlier example except one change. Let’s look at m: map[chararray], m is filed name. map is a keyword informing pig that it’s a map type. chararray is data type for value. We can’t define data type for key as pig always assumes it to be a chararray.
Output generated by DESCRIBE command confirms same schema which I specified along with LOAD.
FOREACH statement extracts only city name from relation D and redefines schema for that field as CityName:chararry. When we again describe schema for relation X, we can see what we defined. Finally I displayed content of relation X.
I believe it’s too much for a beginner to grab about schemas, you try it yourself, digest it and we will cover more about schemas later.