Saturday, November 23, 2013

HQL - Hive Query Language

We have learned creating hive tables and loading hive data into those tables. Next step should be executing hive queries on those tables. I am not going to explain basics of SQL as I assume you are already familiar with SQL for relational databases. If you do not know SQL, step back and learn SQL basics before you start leaning hive query language. I am assuming that you have gone through all of my previous articles on this blog so I will not cover those concepts which I have already explained in other articles. If you are directly landing at this page first time, I recommend you to start reading from beginning to learn hive. I have followed a logical sequence for hive learning in this blog.
Editing HQL on hive CLI is troublesome. So we will write our HQL in a file and execute script from Hive CLI, in case of any errors or changes, we go back to our script and change it there and execute it again.
If you have already executed HQL on CLI, you must have noticed that Hive CLI throws lot of messages before printing query results. I feel little lost in those messages and want to disable it and see only query output on the screen. To do this we have to start hive in silent mode which you can do using hive –S to start hive CLI instead of using only hive command.
Let’s test it. I have a file named
q.hql which I have stored in my home directory. It has a simple select statement as shown in below screen.

Now, I will start hive in silent mode using hive –S and execute this file using hive CLI as shown in below screen.
Ok, now you know how to start hive in silent mode and how to execute hql from a source file.
Hive query language is also known as HQL. It mostly resembles with SQL syntaxes of MYSQL database but there are lots of differences. I need a suitable table, loaded with appropriate data to demonstrate HQL. For this purpose, I created an employee 100+ employee records. I have shared data file for your reference and download here. 
First step is to create an external table and map this data. I will not load it further into managed table and omit partitioning as our objective is to focus on HQL. You have already learned creating a partitioned managed table and loading data into it from external table. If you want, you can do it and execute all HQLs demonstrated here on your partitioned hive managed tables. You should not expect and differences in final output.
Command to create and external table is shown below.
You need to execute this command in hive CLI and copy data file in HDFS at location /user/pkp/hql before you start executing hql statements. Our table employee is mostly simple except three columns contacts, address and job_history.
Contacts column is a map which means it can store multiple key-value pairs like office phone, email, mobile etc. we will see it in more detail when we execute hql statements on them.
Address column is a struct with three members as city, street and zip. We will see how we can manipulate them in hql.
job_history is the most complex one. It’s an array which means it can contain multiple values starting with array index 0. Array and maps are similar in hive as they both can store multiple values. Difference is how they represent values. Array will use index starting from 0 whereas map will use key to identify value. If you don’t get it now, don’t worry, we you will get it well with examples shown below.
To make this array more complex and to showcase that hive support nesting of complex data types, I made it an array of struc. We will see how we can work with nested data types.
Let’s execute first and most simple hql statement, selecting only columns with primitive data type i.e. string and int in our case.

And output is as below.


Simple, no explanation required. Working with primitive data types is simple and most of your SQL skills will work here. Complex data types like maps, structure and array are the one we will focus and learn how to work with them.
Let’s select a map type column i.e. contacts for these two records an output is shown below.
You will notice following.
Hive returns them using JSON notation. You will see it same for other complex data types aslo.There are three map-key pairs for each record. These are office, email and mobile.This clarifies why I mentioned maps are similar to array because we can see that single column contacts actually contains three values. Does it not appear like an array of three elements?
How do I access an individual element of a map?  Let’s do it.


Okay, I used array notation to access single element but instead of using index number, I used key. I have used where clause in above example just to emphasize that your SQL skills will work mostly in hql. I will not mention such things further and you should notice it yourself.
Let’s select a struct column.

 

How do I access an individual element of a struct? Check below.



Let’s select an array column.



Did you notice that there are two elements in the array? An entire record is enclosed in [ ]. Array elements are enclosed in { }. If you look at the output there are two sets of { } separated by comma, each representing one array element. Let’s select only first element of the array in next screen.



Now you understand that there were two elements in the array and looking at only one element is simple as using array notation with index value.
Each element of the array is again a struct of three members as you can see in the above screen. Let’s select only one member of first element of the array.




I hope by now you learned how to access complex columns in hive and how to access if they are nested. I will leave you at this stage to explore your SQL skills in HQL.
Few notes before we conclude this post.
You can use where clause in HQL as you might be using in SQL. Most of the operators are same like =, !=, <, IS NULL, IS NOT NULL LIKE etc. You can also use mathematical functions like round, floor, ceil, pow, sqrt etc. You can use aggregate functions like count, sum, avg, min, max etc. HQL also supports having clause along with aggregate functions. HQL also supports string functions like length, reverse, upper, lower, substr, trim, concat etc. You can use CASE WHEN THEN clause also. You can use sub queries like you use in SQL. You can use order by and union as well.
Joins are also supported with some limitations (only equijoins with AND, no OR is supported). Hive goes left to right when joining multiple tables and WHERE clauses are evaluated after joins are performed.


In summary, most of you SQL skills will work here. Let me show another example of nested query syntax supported in HQL which you may find different than some of the RDBMS.



There are lot of other things to cover in HQL which we will cover individually later. Keep learning.