Sunday, November 17, 2013

Managed Tables in Hive & Portioned Tables in Hive

Syntax for creating a table is fairly simple and may not need any explanation. Tables which we create using method explained below are called Hive managed tables which means if you drop the table, hive will remove table and data both.
CREATE TABLE IF NOT EXIST  pkdb.employees(name STRING                               COMMENT ‘Employee Name’,
                                                                                   salary FLOAT                                COMMENT ‘Employee Salary’,
                                                                                   subordinates         ARRAY<STRING>            COMMENT ‘Name of Subordinates’,
                                                                                   deductions            MAP<STRING, FLOAT>   COMMENT ‘Deductions’,
                                                                                   address                  STRUCT<stree:STRING,
                                                                                                                                  city:STRING,
                                                                                                                                  zip:INT>              COMMENT ‘Home Address’
                                                                                 ) COMMENT ‘Employee Table’
                                                   TBLPROPERTIES(‘Created_By’=’Prashant’,’Created_Date’=’2013-01-01’);
Few Notes
  1. IF NOT EXISTS is optional clause. IF NOT EXISTS does only a name check ignoring structure.
  2. You can optionally prefix database name if you want to create table in other than current database.
  3. You can optionally give comments to columns.
  4. You can optionally give comments to table.
  5. You can optionally attach key-value pairs as table properties.
  6. Hive automatically adds two table properties: last_modified_by which holds username of the last user who modified it and last_modified_time which holds time of that modification.
  7. You can optionally add other clauses which we learned in previous lessons like ROW FORMAT DELIMITED, LINES TERMINATED BY and LOCATION etc.
Suppose you already have several tables in your database, you can use SHOW TABLES command to list all the tables in your current database or you can use SHOW TABLES IN <database_name> to list all tables in specified database name.
You can use DESCRIBE EXTENDED <database_name>; command to show details about the table. As you might have guesses extended is optional.
You can drop table using DROP <table_name>; if table is present in other than your current database, you can prefix database name with table as pkdb.mytable.
Now you know how to create table, let’s create a table and load data into it. We will take a simple example to understand how data is physically stored in hive managed tables.
I have a database named pkdb which is empty as of now as shown in below screen.
Let’s re check it using show tables; command. As shown in below screen, there are no tables. I will create a new table named employees. I have overridden default record delimiters and changed. Fields will be terminated by comma, collection items are terminated by colon, map key-value pairs are terminated by hyphen. Record is terminated by a new line character. Table is created and we are ready to load data into it.
Before we load data into employees table, let’s go to our database directory and check what’s created their as result of our create table statement. You will find a new directory named as employees created under pkdb.db directory. We don’t have files as there is no data in our table.
We need a data file to load it into our table. I have created a sample file with just two records to test data load into hive table. You will notice that, it’s a plain text file and record structure is in compliance with record structure which we defined while creating table i.e. fields are terminated by comma and so on. For your reference, I have displayed content of my file in below screen.
It’s time to load data from file to hive table. Command is very simple as shown in below screen.
Data is loaded into employees table. Let’s execute a select statement to ensure that table is now holding data.
So, we have created a table in hive and loaded data into it. We executed a select statement and getting results as well. Let’s go back to our database directory and check what is created there.

Simple, we got same table copied into my HDFS database directory. Let’s drop our table and check, does it removed directory and data file as well. I hope you will check it yourself.
Next important thing to understand about hive is partitioned table. Let’s assume we are a global company and we have millions of customers across the globe. We have large data available about our customers from all over the world and we want to load it into a data warehouse. Combined data file is of many terabytes or may a couple of petabytes. You can imagine hod your quries will perform in such situations if you have not partitioned your table smartly and your queries does not include appropriate partition filters. If you have any experience in writing database queries on very large databases, probably you are already aware with such problems. This makes partitioning an basic feature for any database table no matter its apache hive or any other database. However, hive implementation of table partitioning is completely different than most of other databases. We will understnd it using an example. I will use very small data size for this dimonstration but to explain concept behind it, size does not matter.
Let’s create a customers table which is partitioned on two keys, country and state. Syntax for creating partioned table is shown in screen below. You will notice that I haven’t created any column for country and state. There is a reason for this. Since country and state information is encoded into my partition, why do I need to repeat that data into table? That’s redundent data so we do not create country and state columns in our table. Any partition itself will be able to provide that information and all data within specific partion belongs to same value.
After creating a partitioned table, I loaded data into this table for one partition only i.e. country=india and state=karnatka. There are two things to note here. First, I need to tell hive which partition I am loading using
partition(country=’india’,state=’karnatka’) clause. Second, my table kar-data.txt will only have data for india/karnatka. Data is loaded and now you are readu to execute your queries on this table. Partitioned table does not place any restriction on querying data, you can still execute your queries for entire world, but if you query for specific partition, hive need not to read all partitions making your query faster.



Now, we need to go back to our database directory as always to check what is created there. Check below screen and you will notice following.
  1. My database ‘pkdb’ is actually a directory named as ‘pkdb.db’.
  2. My table ‘customers’ is again a directory inside ‘pkdb.db’ directory.
  3. My country partition for India is also a directory named as ‘country=india’ under ‘customers’ table directory. This means if I load data for Japan, hive will create another directory named as ‘country=japan’ under ‘customers’ table directory. Isn’t it simple?
  4. My sub-partition for Karnataka is again a directory named as ‘state=karnataka’ under my main partition for India.
  5. Finally, my data file is placed under my sub-partition for karnatka.
This makes life simple for hive. If I want to read data for entire India, hive will read all files under directory ‘country=india’ and similarly if I want to read data only for Karnataka, hive will read all files under directory ‘state=karnatka’.
You might think why i said “all files under directory ‘state=karnatka’” when we want to read all data for Karnataka. That should be just one file with all data for Karnataka. That’s a catch. You know HDFS is a distributed file system and MapReduce is distributed processing framework. Hive is built on top of these two to utilize their power. This brings some limitations which had to be handled by hive. HDFS is essentially designed for write once and read several times. So if I load some more data for Karnataka, it won’t update old file, instead it will create another file in same directory. And when hive has to read all data for Karnataka, it has to read both the files.
Do you see any problem in this? You might be thinking about impact on performance. If you do, then you might still don’t understand what MapReduce has to offer as parallel processing framework and how your files are actually broken into blocks which are sprinkled across various DataNodes in hadoop. This directory and file are just a namespace which is managed at NameNode. Read my posts about HDFS and MapReduce to understand it correctly.