Thrive School - Big data, Hadoop, MapReduce, Hive, Pig, Hbase Tutorials: November 2013

Friday, November 29, 2013

Apache Pig Latin – A Paradigm shift

What is Pig? You ask this question to anyone and answer will be somthing as “It is a parallel data flow language”. Such a flat answer does not give me a proper picture. In this post we will try to answer this question and learn some other Pig fundamentals. Let’s start.

There are at least three paradigms of computer programing techniques. Most of us are already familiar with two of them. Programming languages and Query languages.

In programing languages like Java and C, we control execution. When we write program, we define execution steps.

In query languages we ask questions.

A Java or C programer who doesn’t know query language will find SQL a difficult thing during early learning phases. It takes some time for him to get hold of this paradigm shift.
Pig Latin is another shift. It uses third programming technique known as data flow language or sometimes termed as data stream language. If you are not familiar with any data flow language, it’s another paradigm shift for you and it might take some extra effort to get familiar and comfortable with this method.

In this post, we will cover some basics about Pig Latin. Let’s Start.

Pig is a mixed case language. Keywords in Pig are case insensitive but names in pig are case sensitive. When I say names are case sensitive that means variable names, function names etc. When I say keywords are case insensitive that means LOAD, USING, FOREACH, GROUP BY etc.
Like any other scripting language, Pig Latin script support commenting. Single line comments are given using double hyphen (- -) and multiline comments are Java style i.e. pair of starting /* and ending */.
Pig is a data flow language. Before you do anything with data, you have to load it. Almost every pig script will begin with a load statement. An example is given below.
MYDATA = load ‘/user/pkp/test.txt’;

For simplicity, you can think of MYDATA as a variable. Above statement will load data from test.txt file and assign it to a variable MYDATA.

MYDATA is not a simple variable. It is similar to a table in any database which contains multiple rows of data. In our example, MYDATA holds multiple rows from test.txt.

Pig Latin uses different terminology for Table and Rows. Table in Pig terminology is a Bag. Row in Pig terminology is a Tuple. In this example MYDATA is a Bag which holds multiple Tuples from test.txt file.

When Pig executes load statement, it loads tuples from file into a Bag. By default load statement assumes that your tuples are separated by new line character.
If you are visualizing Tuple as a row, you must see fields there. By default, load statement assumes that your fields are tab delimited.
Once you loaded data into Pig, what do you want to do with it?
Answer will be processing. But I don’t want to call it processing. I will call it “data flow”. Because I will flow data from one bag to another bag. That’s all we do in Pig and that’s what a paradigm shift you have to understand. Pig is a data flow language and we don’t process data. We flow data from one bag to another bag. Bag and Relation are almost synonymous in Pig. You may think, what will I achieve moving data from one bag to another bag? Doubt is genuine. I will transform data from one format to another format in each step and new bag will hold data into a new format. Isn’t great? No? Let me ask, what is ETL process?
E- We Extract data from a source, that’s what we did in load statement.
T- We Transform data into a new format, that’s what I am saying that we will do in each step of our Pig Latin.
L- That’s a simple part. We will load our final Bag wherever we want, most of the time into another file.
Pig Latin is essentially an ETL scripting language. it is the most simple and powerful ETL tool as of today coupled with the power of hadoop.
When you are done with your transformations or a series of transformations, you have your final data into a Bag. Most of the time, you might want to store it into a new file. You will use below command for this.

store BAG_NAME into ‘FILE_NAME’;

if you just want to display your bag content on the screen instead of storing it into a file,
you can use dump BAG_NAME;

We will run a simple example. I will load a tab separated file test.txt and dump it on screen. Let me show you my file before we start. In below screen, my data file has 5 rows. Each row has 4 columns or fields and they are separated by a tab character.

Let’s load it into Pig Bag and dump it back on screen. My Bag name is MYDATA. Grunt throws so many messages on screen and at the end of processing, it shows output. I have eliminated all those intermediate messages and kept only output in the screen for better clarity.

In above screen, you notice that each tuple is enclosed by braces. If you remember, my original data file did not have these braces. It’s pig who is adding these braces. You can try storing it into a file (store BAG_NAME into ‘FILE_NAME’;)instead of dumping on screen and you will notice that pig does not add these braces into file. Which clarifies that pig does not enclose top level tuples into braces when store them back in file. What is top level tuple? I said so because we can nest tuples i.e. tuples inside a tuple. We will see nesting below in next step. Next thing to notice here is my fields in a tuple are separated by comma but my original file was separated by tab. That’s again a Pig behavior which does not happen when we store it back into file.

Now let’s perform a transformation, I will group it on third column and dump it on screen. Pig starts column count from 0 so I gave $2 for the third column.

Great! you saw first transformation and a new bag of your tuples. That’s why I mentioned in the beginning that, in pig we move data from one bag to another bag and perform some transformation at each movement. If you understand this concept, you will easily master data flow language. If you keep thinking like Java, C, SQL or any other language you know, you will have difficulties in understanding and developing Pig Latin scripts.

Let’s discuss some important observations from the output.

You can again see that each top level tuple is covered inside a pair of top level braces so what’s my first tuple? (HR,{(105,Ravi,HR,Pune)})
And what is second one? (IT,{(101,Prashant,IT,Bangalore),(102,Prakash,IT,Bangalore)})
Next question for you. In my second tuple, how many fields are there? If you answer two? You are right. My first column value is IT and rest of the string is second column.
You should have question for me. Why second field is enclosed by {}? Let me answer. Curly braces represents and inner Bag. Nesting has happened here. We have an outer Bag named MYGROUP. This bag contains three tuple. Each tuple contains two fields. First field is a string and second field is an inner Bag. You may ask why we need this inner Bag. Answer is simple. We need a Bag because we need to store multiple tuples for each group for example, for group IT, we have two tuples separated by comma.
Let’s store results into a file, check below screen. You will notice that Pig has again removed outer braces from top level tuples and fields are separated by tab instead of comma. This is default behavior of output function which actually writes data into file. We will again visit this behavior when we discuss output function in Pig.

My objective in this post was to help you cope up with paradigm shift from a programming language or query language to a data flow language. Once you start thinking in terms of data flow from one Bag to another Bag and from one form to another form, you will be able to speed up your Pig learning. Hope I thrived in my attempt. Your feedback is valuable.

Wednesday, November 27, 2013

Getting Started with PIG

The power of hadoop is its distributed file system and MapReduce processing framework. We already know that most flexible and powerful method for developing MapReduce jobs is using Java. But we may not need that level of control and power for doing most of our ETL tasks and Java turns out to be too much low level for doing simple tasks. This thought has given birth to Hive and Pig. These two were designed and developed by two different teams in two different companies. They both are given to open source and became famous due to their wide acceptance. Now a day, most of the companies using hadoop is using both of them. I have explained Hive in my several posts and will continue going deeper into Hive. By now we understand that Hive is a query like language and inherit most of SQL features and syntaxes. What is PIG then and how it differs from Hive?

It is officially known as Pig Latin and a Parallel data flow language. So it’s not a query language, its data flow language. What does that mean? In a query language, we tell what result do we want and let query engine decide how to get those results.
For example when we write a SQLSELECT DEPARTMENT, AVG (SALARY) FROM EMPLOYEES GROUP BY DEPARTMENT; we are just asking a question “What is average salary for each department?” Query engine will decide what to do to get this answered. But when we have to do it procedurally, it’s completely different approach. Think of writing a function in C or Java language to get this answered, that’s a procedural example. You have to open file and load data, divide it into groups, calculate average and finally write averages somewhere. Scared? PIG is not as low level as C or JAVA but yes, it is procedural. If you know HQL (Read Hive related Posts on this blog) or at least SQL, as you learn more on PIG you will realize differences so let’s not waste time discussing differences and start learning PIG Latin.

We know pig is a scripting language and like any other, we have two methods to execute our scripts. You can place all of your Pig code into a file and execute that file. Second method is to start pig shell and enter all your pig commands interactively. We will start with second method however first one is used in real solutions. We are choosing second method i.e. pig shell to start learning pig because it immediately compiles each command I enter and let me know if there is any error. In case of error, I correct it and enter my command again. Pig remembers only latest one so I need not to worry what I entered earlier. This feature makes it easy to work for starters. Once we are comfortable with pig syntax, we will move to executing out pig files.

I am assuming you have access to pig environment, if you don’t have, go back to my post here and get a virtual box running. You may need to read this post to be well versed with HDFS and virtual box.
Let’ login to you virtual machine and type below command which will start pig shell. Pig shell is known as Grunt. We will refer grunt instead of pig shell. pig –x mapreduce

Once you are in grunt, we will execute one simple data flow which will produce same results as you can expect from below SQL.SELECT DEPARTMENT, AVG (SALARY) FROM EMPLOYEES GROUP BY DEPARTMENT;

It’s a hello world for is in PIG. You will need a data file. I already have a data file which I used in some of my Hive tutorials which is shared here. You need to download this file and place it into you HDFS directory. If you are not aware how to move files into your hadoop file system, you should read this article.

Let’s do it. If you look at below screen, you will see that I am able to execute my HDFS shell commands from grunt. I used –ls command to show you that my data file is named hql-data.txt and it is placed at /user/pkp/hql directory in my hadoop file system. Commands inside red box are actually my Pig Latin commands.

Let me explain. First three lines starting with mydata till semicolon is a single pig command. There are three parts of it, load, using and as. Load is main command and it takes filename so I gave full path for my data filename. Using is a clause of load command. Since my file is a comma separated file I used this clause to specify that. By default, if I don’t specify using clause, pig will assume a tab separated file. Finally I have as clause where I specified column names. By default pig will assume all columns as bytearray which is fine for this example hence I have not specified any data type for my columns.
Fourth line created a group on department column.
Fifth line generates results by looping for each group and calculating salary averages for each group. We will learn more about all these things sometime later. In this post my objective is to begin with simple pig script.
Finally, sixth line will create a file and place result into it. Grunt will compile all pig statements as you enter it on grunt shell but it will not execute anything until it finds a store command. As soon as you press enter in above screen, grunt will execute entire flow and create a directory “dept_sal_avg” into my home directory and place a file loaded with results into that directory.

It’s time to check results. Check below screen.

Do you need any explanation? I don’t think.

Keep reading and keep learning.

Tuesday, November 26, 2013

Hive Bucketed Tables and Sampling

Bucketing is a simple idea if you are already aware. You create multiple buckets. You read each record and place it into one of the buckets based on some logic mostly some kind of hashing algorithm. This allows you to organize your data by decomposing it into multiple parts. You might think that we do achieve same thing using partitioning then why bother about bucketing. There is one difference. When we do partitioning, we create a partition for each unique value of the column. This may burst into a situation where you might need to create thousands of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file. We will see it in action.

In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;Once you ensured that above variable is set, lets create a bucketed table. In my previous post, we had an external table named employees. We will create a bucketed hive table named emphive and load data from employees table. Once data is loaded into bucketed table, we will go back to our warehouse to check if we have multiple data files created which should be equal to the number of buckets in our table.
Here is command to create a bucketed file. I have bucketed data on country column into 3 buckets.

I have executed this command on hive prompt and now it’s time to load data, below screen shows command to load data.

I have executed this command on hive prompt, now it’s time to go and check how many data files are created in our warehouse. If you have read all my hive articles, you already know where these files will be created. Check below screen and you will realize three files names as 000000_0, 000001_0 and 000002_0 are created these are our data files.

You might have question in your mind, how can we select data only from one bucket. I will answer that question. We can use sampling queries. Keep reading.

There are two benefits of bucketing. First, it is more efficient for certain types of queries in hive particularly join operation on two tables which are bucketed on same column. Second reason is your sampling queries are more efficient if they are performed on bucketed columns.
Do you understand sampling? No worries let me explain it. You might have a situation when accuracy is not very important and you just need run your queries on a smaller set of data instead of entire table. One use case for such situation is testing scenarios. You have huge data in table which takes lot of time to read and return your queries and you want to test it on just a smaller sample of you data instead of entire set. Another use case is when you need to draw some patterns or need some aggregation like average and accuracy is not your top priority. A general sense with a predictable degree of error is fine. You may just need to draw your patterns on a smaller sample set of data instead of full data set. Sampling can also be used to obtain a random subset of the actual rows in a table for auditing purposes. Hive supports two types of sampling which we will explore through examples.

Let’s count how many records we have in our different buckets. We will use hive sampling query for this.

You noticed that 3rd bucket has just one record. Let’s check which country this record belongs to and does my first and second files contain data for this country.

What does this mean, a specific country does not go into two buckets, that’s what we expect. Let’s check which countries bucket 1 and 2 contains?

So, bucket 1 contains data for CA and US whereas bucket 2 contains data for UK. I hope you understood what hive has done when we inserted data into this bucketed table. Let me summarize. Hive created three buckets as I instructed it to do so in create table statement. When I loaded data into this table, hive has used some hashing technique for each country to generate a number in range of 1 to 3. Based on the outcome of hashing, hive has placed data row into appropriate bucked. This hashing technique is deterministic so that when hive has to read it back or write more data, it goes in correct bucket.
One thing about tablesample clause is worth mentioning, it’s part of from clause so you can have where clause or other clauses like order by etc as usual. Check below example if you have any doubt.

Hive supports another type of sampling. That’s called block sampling. In this sampling we specify percentage of data to be sampled. Let’s see an example.

Oops! It doesn’t seem to be working correctly. I have total 106 records and I asked to sample 10% so I expect count to be something close to 10. How come it is 70? Answer is straight, because it is block sampling. When I asked hive to sample 10%, I actually asked to read approximately 10% blocks but I just have two blocks for my data into this table and minimum hive can read is one block. Yes, granularity of block sampling is at block level. For example, if your HDFS block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.

Keep Reading and Keep Learing.

Saturday, November 23, 2013

HQL - Hive Query Language

We have learned creating hive tables and loading hive data into those tables. Next step should be executing hive queries on those tables. I am not going to explain basics of SQL as I assume you are already familiar with SQL for relational databases. If you do not know SQL, step back and learn SQL basics before you start leaning hive query language. I am assuming that you have gone through all of my previous articles on this blog so I will not cover those concepts which I have already explained in other articles. If you are directly landing at this page first time, I recommend you to start reading from beginning to learn hive. I have followed a logical sequence for hive learning in this blog.
Editing HQL on hive CLI is troublesome. So we will write our HQL in a file and execute script from Hive CLI, in case of any errors or changes, we go back to our script and change it there and execute it again.

If you have already executed HQL on CLI, you must have noticed that Hive CLI throws lot of messages before printing query results. I feel little lost in those messages and want to disable it and see only query output on the screen. To do this we have to start hive in silent mode which you can do using hive –S to start hive CLI instead of using only hive command.
Let’s test it. I have a file named q.hql which I have stored in my home directory. It has a simple select statement as shown in below screen.

Now, I will start hive in silent mode using hive –S and execute this file using hive CLI as shown in below screen.

Ok, now you know how to start hive in silent mode and how to execute hql from a source file.

Hive query language is also known as HQL. It mostly resembles with SQL syntaxes of MYSQL database but there are lots of differences. I need a suitable table, loaded with appropriate data to demonstrate HQL. For this purpose, I created an employee 100+ employee records. I have shared data file for your reference and download here.
First step is to create an external table and map this data. I will not load it further into managed table and omit partitioning as our objective is to focus on HQL. You have already learned creating a partitioned managed table and loading data into it from external table. If you want, you can do it and execute all HQLs demonstrated here on your partitioned hive managed tables. You should not expect and differences in final output.

Command to create and external table is shown below.

You need to execute this command in hive CLI and copy data file in HDFS at location /user/pkp/hql before you start executing hql statements. Our table employee is mostly simple except three columns contacts, address and job_history.
Contacts column is a map which means it can store multiple key-value pairs like office phone, email, mobile etc. we will see it in more detail when we execute hql statements on them.
Address column is a struct with three members as city, street and zip. We will see how we can manipulate them in hql.
job_history is the most complex one. It’s an array which means it can contain multiple values starting with array index 0. Array and maps are similar in hive as they both can store multiple values. Difference is how they represent values. Array will use index starting from 0 whereas map will use key to identify value. If you don’t get it now, don’t worry, we you will get it well with examples shown below.
To make this array more complex and to showcase that hive support nesting of complex data types, I made it an array of struc. We will see how we can work with nested data types.

Let’s execute first and most simple hql statement, selecting only columns with primitive data type i.e. string and int in our case.

And output is as below.

Simple, no explanation required. Working with primitive data types is simple and most of your SQL skills will work here. Complex data types like maps, structure and array are the one we will focus and learn how to work with them.

Let’s select a map type column i.e. contacts for these two records an output is shown below.

You will notice following.

Hive returns them using JSON notation. You will see it same for other complex data types aslo.There are three map-key pairs for each record. These are office, email and mobile.This clarifies why I mentioned maps are similar to array because we can see that single column contacts actually contains three values. Does it not appear like an array of three elements?

How do I access an individual element of a map? Let’s do it.

Okay, I used array notation to access single element but instead of using index number, I used key. I have used where clause in above example just to emphasize that your SQL skills will work mostly in hql. I will not mention such things further and you should notice it yourself.
Let’s select a struct column.

How do I access an individual element of a struct? Check below.

Let’s select an array column.

Did you notice that there are two elements in the array? An entire record is enclosed in [ ]. Array elements are enclosed in { }. If you look at the output there are two sets of { } separated by comma, each representing one array element. Let’s select only first element of the array in next screen.

Now you understand that there were two elements in the array and looking at only one element is simple as using array notation with index value.
Each element of the array is again a struct of three members as you can see in the above screen. Let’s select only one member of first element of the array.

I hope by now you learned how to access complex columns in hive and how to access if they are nested. I will leave you at this stage to explore your SQL skills in HQL.
Few notes before we conclude this post.
You can use where clause in HQL as you might be using in SQL. Most of the operators are same like =, !=, <, IS NULL, IS NOT NULL LIKE etc. You can also use mathematical functions like round, floor, ceil, pow, sqrt etc. You can use aggregate functions like count, sum, avg, min, max etc. HQL also supports having clause along with aggregate functions. HQL also supports string functions like length, reverse, upper, lower, substr, trim, concat etc. You can use CASE WHEN THEN clause also. You can use sub queries like you use in SQL. You can use order by and union as well.
Joins are also supported with some limitations (only equijoins with AND, no OR is supported). Hive goes left to right when joining multiple tables and WHERE clauses are evaluated after joins are performed.

In summary, most of you SQL skills will work here. Let me show another example of nested query syntax supported in HQL which you may find different than some of the RDBMS.

There are lot of other things to cover in HQL which we will cover individually later. Keep learning.

Wednesday, November 20, 2013

Loading data into hive Tables

Want to read offline? Get pdf.

One important limitation in hive is that it does not support row-level insert, update, and delete operations. So the only way to load data into a table is to use one of the bulk load methods or simply write files in the correct directories.

By now you learned how to create tables in hive and these tables may be managed tables or external table. We have also seen that managed and external both could be partitioned. We learned some methods of loading data into these tables. For managed table whether it is partitioned or not, we used LOAD DATA LOCAL statement. For external table if it is not partitioned, we used LOCATION clause in create table itself. When external table was partitioned, we used ALTER TABLE to add new partitions to our external table. I just wanted to summarize data loading methods at one place hence this post.

We know LOAD DATA LOCAL statement can be used to load data into managed table. If table is partitioned, you will use PARTITION clause also otherwise it is not required. Let’s relook at the statement we used in one of my previous posts. I have placed same screen again here.

Few things need to be added here.

You already know that LOAD DATA will create appropriate target directory in hive warehouse and copy data file there. If directory already exist, hive will not create it. File name remains same.
Hive does not perform any verification on data for compliance with table schema.
If LOCAL keyword is used, source path is assumed in local file system. If it is omitted, path is assumed to be in HDFS. There is a catch here. Without LOCAL, hive will move data file instead of copying like in case of LOCAL. This is because hive assumes that you do not want duplicate copies in HDFS. If you don’t want to move it, use it there itself using external table method.
‘/home/pkp/bang-data.txt’ is a file name but by convention, we provide directory name which will allow hive to copy all files available in the directory at once. We already know that file names remains same when hive moves them into warehouse.
If files which are being copied by hive already exist in the target directory, they will be overwritten but all new files which does not exists will be copied. If want to clean target directory before moving any data you can use OVERWRITE INTO TABLE in second line of above screen shot.
We already know we can use PARTITION clause in LOAD DATA statement. But example we saw was for just one partition. This doesn’t make it very useful for us. We learned ALTER TABLE command earlier to add new partitions to external table. But ALTER TABLE is equally valid for managed table as well. So if we have 10 partitions to add, we can use single ALTER TABLE command to add all partitions at once as explained in previous post.

This completes some missing bits of what we already learned about loading data into hive. You might have noticed by now that I am not behind explaining syntax or covering all details of options available for various commands. That’s available on Apache hive wiki. We are running behind concepts and key capabilities of this tool called Apache Hive.

Let’s learn new method for loading data into hive tables.

In previous post, I mentioned external tables. They are mostly used by mapping them into hive. Once mapped, we execute our queries on them, prepare report and once done, we un-map them from hive using drop table statement. In that case, hive doesn’t remove data but only removed metadata. There might be requirements when we want to load data from those external tables into hive tables. This is normally required during ETL process when we want to transform staged data from an external table using hive query language and load them into hive table in new structure. It is not necessary that we do it using external tables. We may also need to transform a managed table into a new structure and store it as a new table. The question here is, does hive provide any method to load data from and existing table (managed or external) using hive select statement? And answer is yes. Let’s demonstrate this. We will use simple select * statement in this demonstration without transformation to keep focus on loading method instead of complicating it with transformation logic. In real life scenario, we may not need such simple loads without transformation. This transformation may be anything and using simple aggregation or some complex logic.

Let’s start and look at the data which I prepared for this demonstration. I have a plain text file of employee records with comma separated fields as listed below.
employee_id, first_name, last_name, email, hire_date, salary, department_name, country_abr, state_province, c.city, c.street_address, c.postal_code.
There are 105 records in this file. We will have a special focus on country and state because I want to demonstrate partitioning on these two columns. Country and state wise number of records in my file are shown in below table. I hope you can prepare such files easiyl but if you need to use my file, it is shared here.

Country Abbreviation	State	Number of records
US	Texas	5
US	Washington	18
US	California	45
CA	Ontario	2
UK	Oxford	34
DE	Bavaria	1

I have moved my data files (hr-data.txt) into HDFS directory /user/pkp/hr as shown in below screen.

I next step, I created an external table as shown in blow screen and queried first record to test if my table is created correctly and records are mapped properly into fields. We know hive does not check da with table schema and will read whatever is there in data file. So, it is almost always necessary for you to test if table schema and data are in compliance.

Before we load data into hive table, let’s create a hive table. We will name it as emphive and keep structure same as we are not doing any transformation. But unfortunately we have to remove country and state columns from our hive table because we want to partition our table on these columns. We already know that hive will append partition columns at the end. They are not actual columns but just virtual columns deduced by partitioning metadata. It is good in a sense that it removes unnecessary data from file. So let’s create emphive table with same structure excluding country and state column from table schema and we place them into partition clause. You have already learned it, but just for completeness, I added a screenshot below as a reference.

Ground work is done. Let’s load data into emphive from external table employees. We will load single partition for now. Command used for this is shown below.

It throws lot of messages and you find looking on those messages that it runs 3 MapReduce jobs to accomplish this task. Finally data is loaded in our tables, lets query records from newly loaded table. You find in screen below that only one record is returned which is as expected. You should also notice that returned records shows country and state columns as last two columns.

With OVERWRITE, any previous contents of the partition (or whole table if not partitioned) are replaced. If you do not use OVERWRITE and replace it with INTO, Hive appends the data rather than replacing it.

Let’s load data for two more partitions, but how? Do I have to write INSERT statement twice? No, there is another way to do it, using this method we will read source table once and write into as many partitions as you want. Let’s see how it works in below screen.

Above statement will again spawn multiple MapReduce jobs and throws several messages before it completes. Let’s query our emphive table again to check what is there now.

It now shows 8 records as expected. Five records for US/Texas, two for CA/Ontario which we just loaded and one record for DE/Bavaria which we loaded earlier.

Above statement will read all records from the source table named employees, then each record will be evaluated with each SELECT … WHERE … clause. If a record satisfied a given SELECT … WHERE … clause, it gets written to the specified table and partition. This statement is more powerful than it appears, each INSERT clause is independent as it can insert into a different table and some of them may be partitioned while other not. Try it yourself.

Let’s move further and learn dynamic partitioning. In my external table, I had six combinations of country/state. If I want to load all of the data into their respective partitions of my hive table, I have to write a long statement. This situation could be more daunting if I have a file with 100 such combinations resulting into several partitions. That’s where dynamic partitioning comes to rescue.

Dynamic partitioning in hive is controlled by configuration variables. I have used hive SET command to display current values of these variables.

All of above variables are briefly explained into table below. Their default values are alredy displayed in above screen.

Variable Name	Description
hive.exec.dynamic.partition	true/false for enable/disable dynamic partitioning
hive.exec.dynamic.partition.mode	strict/nonstrict as explained below.
hive.exec.max.dynamic.partitions.pernode	Max number of partitions each mapper/reducer can create.
hive.exec.max.dynamic.partitions	Max number of partitions allowed by single statement.
hive.exec.max.created.files	Max number of files that can be created globally.

All of the above variables are there to protect system to burnout by a faulty code. In my environment, hive.exec.dynamic.partition is already true so I should be able to create dynamic partitions but hive.exec.dynamic.partition.mode is strict. This will impose some restrictions on us (we will see it later) so let’s change it to nonstrict using below command.SET hive.exec.dynamic.partition.mode=nonstrict;

Now, we are ready to demonstrate dynamic partitioning in hive. Before we start, let me warn for a catch here. I will explain it later after I explain dynamic partitioning shown in below screen. Let’s look at the below screen.

Above statement will several things.

Remove all data from my table emphive caused by OVERWRITE clause.
Read all records from source table employees caused by SELECT …FROM.
Create a new directory to represent a new partition/sub-partition for every country/state if directory does not already exist. This effect is caused by PARTITON clause.
Create a new file to hold data for specific partition inside appropriate directory. This is first time we see hive creating file otherwise, it just used to copy file.
Insert data into appropriate file.

You can go back to HDFS file system and check if got 4 directories created for four countries we have in our example data file. You can further check your country directories and you will find sub directories created for each state inside country. You may execute some select statements with country and state conditions in where clause to test your work.

Let’s talk about that catch I mentioned. Did you already notice that? If yes, you are getting hive now. Any ways, catch is those last two columns country, state in my screen above. Those are necessary at the end in specific order. You can exclude them or change their order. Hive assumes, partitioned columns ate the end in your select statement in same order as you have specified it in your partition clause.

One last thing about dynamic partitioning I need to mention is why we changed to non-strict mode using set hive.exec.dynamic.partition.mode=nonstrict; command.

Strict mode puts a restriction on us. In this mode, we can’t use full dynamic partitioning. What does that mean? Which means I am forced to use at least one partition parameter as static? Let’s look at the new statement if I have to work in strict mode.

You can notice difference. I am forced to make three changes.

In partition clause, I must specify at least one leading partition value as I specified country=’US’.
As a result, I had to remove country column from select list as hive assumes all record to be for country=’US’.
Further I am forced to place a where clause in my select statement to restrict it for US only because hive assumes all records to be for US.

This ends my discussion about loading data into hive table.

Subscribe to: Posts ( Atom )