Saturday, November 16, 2013

Data Types in Hive

Before we get into details of creating tables, loading data and executing queries in hive to learn power of hive as a data warehouse, let’s learn some basics. We have already understood what is a database or schema in hive however I will cover some more details about it in this tutorial. Then we will understand hive data types. So let’s start.
First concept related to hive is database. In hive, database is just a namespace of tables. They are mainly used to organize tables into logical groups. For hive, SCHEMA and DATABASE are synonymous, which means that you can use the keyword SCHEMA instead of DATABASE in all the database-related commands explained below.
Most simple syntax for creating a database is CREATE DATABASE <database_name>; which we have already seen earlier in getting started with hive.
When you are creating a script, it is valuable to avoid error which may come if database already exist. To accomplish this, hive supports following syntax.
CREATE DATABASE IF NOT EXISTS <database_name>;
To get list of databases that already exist you can use SHOW DATABASES; command. This command also support regular expression search using SHOW DATABASES LIKE 'pk*';
All databases are created under a top-level directory specified by the property hive.metastore.warehouse.dir. You can override this default using location clause in created database command as shown in below example.
CREATE DATABASE pkdb LOCATION '/pkp/mydatabases';
You can use DESCRIBE DATABASE <database_name>; command to describe it. Describe command will show database directory location and a comment about database which you can create using COMMENT clause as shown in below example.CREATE DATABASE pkdb COMMENT 'My default database';
Comment is a nice way to provide in place documentation. Similarly you can also associate key-value properties with the database which will be displayed if you use DESCRIBE DATABASE EXTENDED <database_name>; command.Syntax for associating key-value pairs is shown in below example.
CREATE DATABASE pkdb WITH DBPROPERTIES ('Created_By' = 'Prashant', 'Date' = '2013-01-01');
USE <database_name>; command is used to make any database as current active database.
To drop a database we can use DROP DATABASE IF EXISTS <database_name> CASCADE; command. Where IF EXISTS clause is optional which is used to suppress warning message if database does not exist.CASCADE is also optional. If CASCADE is present, it will drop all tables present in specified database. If CASCADE is not present, hive will not allow you to drop database unless it is empty.
As of now, hive does not support altering database once created except DBPROPERTIES. You can change key-value pairs associated with you database using alter command as shown in below example.
ALTER DATABASE pkdb SET DBPROPERTIES (' Date' = '2013-01-02');

Hive Data Types


Unlike other popular relational databases, hive supports collection data types along with many primitive data types. Most of the popular relational databases take control on how storage is managed on disk. But hive leave this control on us which make it easier to process data generated using various tools. We will get into those details later and let’s focus on data types for now.
Primitive Data Types
Hive supports 4 integers, 2 floating points, Boolean and string data types as shown in below table. For readability, I used uppercase but its case insensitive. Since all of it is internally implemented in JAVA, you can expect similarities in their behavior.
Data Type
Size
Example
TINYINT
1 Byte
10, -10
SMALLINT
2 Byte
10, -10
INT
4 Byte
10, -10
BIGINT
8 Byte
10, -10
FLOAT
Single precision float
10.8932
DOUBLE
Double precision float
10.8932
BOOLEAN
Boolean true or false
TRUE
STRING
Sequence of characters
‘Sample string’
TIMESTAMP
Integer, float or string values.
129357385
8929245.879395
‘2013-01-01 12:00:00.123456789’
BINARY
Array of bytes




TINYINT, SMALLINT, INT, BIGINT are four integer data types with only differences in their size. FLOAT and DOUBLE are two floating point data types. BOOLEAN is to store true or false.
STRING is to store character strings. Note that, in hive, we do not specify length for STRING like in other databases. It’s more flexible and variable in length.
TIMESTAMP can be an integer which is interpreted as seconds since UNIX epoch time. It may be a float where number after decimal is nanosecond. It may be string which is interpreted according to the JDBC date string format i.e. YYYY-MM-DD hh:mm:ss.fffffffff. Time component is interpreted as UTC time. Hive provides various functions to convert into specific time zones.
BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable for binary data.
If you need to cast one type to another like a string to an integer, you can use cast (str as int) function in hive.
Collection Data Types
Hive supports columns that are STRUCT, MAP, and ARRAY unlike other relational databases.
If you are familiar with C language, you might already know what STRUCT is. Hive STRUCT is analogous to STRUCT in C.
MAP is a collection of key-value pairs.
ARRAY is obviously an ordered sequence of same data type of variable which are accessed using zero based index.
We understand that hive is very flexible in offering collection data types but a genuine concern is how data is physically stored in flat files. Hive gives that flexibility as well in our hands. Let’s take an example to understand it.
I am assuming an employee table with below structure in hive.
CREATE TABLE employees (name STRING,
                                                 salary FLOAT,
                                                 subordinates ARRAY<STRING>,
                                                 deductions MAP<STRING, FLOAT>
                                                 address STRUCT<stree:STRING,
                                                 city:STRING,
                                                 zip:INT>
                                                )
                             ROW FORMAT DELIMITED
                                       FIELDS TERMINATED BY ‘\001’
                                       COLLECTION ITEMS TERMINATED BY '\002'
                                       MAP KEYS TERMINATED BY '\003'
                             LINES TERMINATED BY '\n'
                             STORED AS TEXTFILE;
Note that \001 is an octal code for ^A, \002 is ^B and \003 is ^C. for further explanation, we will use text instead of octal code. Let’s assume one record as shown in below table.

So, we have an employee named as Prashant. Salary is 10000. He has two subordinates Purna and Sudarshan. From his monthly salary, there are two deductions i.e. 750 for PF and 1000 as income tax. His address is Steet= Marathalli, City is Bangalore and zip code is 560037.
Let’s build this record in hive table format as described by hive table structure above.
Fields are terminated by ^A. We have five fields Name, Salary, Subordinates, Deductions and Address. You might see it like below.
Name^ASalary^ASubordinates^ADeductions^A Address
Let’s replace it with actual values for first two fields because they are straight.
Prashant^A10000^ASubordinates^ADeductions^A Address
Now, we have two subordinates and we know that collection items will be delimited by ^B giving below record. Same will apply to address parts.
Prashant^A10000^APurna Dash^BSudarshan Raju ^ADeductions^A Marathalli^BBangalore^B560037
Deduction is bit complex than others. You need to understand that deductions has two collection items (PF=750 and IT=1000) and each of those two collection items is a MAP. MAP will be delimited by ^C (PF^C750 and IT^C1000) then collection will be delimited by ^B giving below record.
Prashant^A10000^APurna Dash^BSudarshan Raju ^A PF^C750^B IT^C1000^A Marathalli^BBangalore^B560037
There are other file formats but for now you can consider that’s how data in Hive table will be stored. It’s a series of records in the above format where each record is terminated by a new line as specified in LINES TERMINATED BY '\n' clause of create table.
All delimiters used in above example are default delimiters for hive. If you do not specify them, default values will be assumed by hive. You are free to change these values for each table you have.
That’s kind of flexibility and looseness makes hive powerful I utilizing data generated by various ETL tools.