Tuesday, November 19, 2013

Hive External Tables

So far, we have seen hive managed tables. We saw un-partitioned and partitioned tables. We loaded data into it and realized that hive will create appropriate directory and copy files into those directories. When we drop those tables, hive removes directory and data files both. That’s why they are called hive managed table and hive owns data.
It’s very common and actually more often that we need to use data within hive which is owned by some other tool. In that case, we know that data file is already there in HDFS directory but it is owned and used by some other tool like pig. Using managed table, we are doing an unnecessary work of copying data into hive warehouse. We just want to map that data into hive, perform some queries, prepare output and remove mapping and leave data at original place as it was. Hive provides external tables for that purpose. When you work with hive external tables, always remember that hive assumes that it does not own data or data files hence behave accordingly. Which means when you drop an external table, hive will remove metadata about external table but will leave table data as it was.
Let’s start. Check screen shots below and explanation.

To demonstrate external table, I created a directory named kar-data in HDFS file system under my home directory. I created two files kar-data.txt and kar-data2.txt containing two records in each file hence total 4 records. I moved both the files into HDFS under kar-data directory. You can see in screen above that these two files are present in HDFS /user/pkp/kar-data directory. If you are wondering what data is there inside these files? I have placed a screenshot below for you. It’s a plain text with comma separated values.


As we know, external tables in hive are created to map existing data files so I created this situation where I have existing data file in a specific directory, let’s create an external table to map it. Check below screen for commands and syntax.

I created an external table using create external table command. You can notice location clause at the end specifying ‘/user/pkp/kar-data’ where hive should expect actual data. Finally, I executed select statement on this table and getting 4 records as expected.
For demonstration, we created data files and placed them properly into /user/pkp/kar-data directory then we created an external table specifying this location into location clause. But this is not a requirement from hive perspective. In fact, create external table doesn’t care if directory or file does not exist. Only implication is that your select statement will not return any record. You can create directory and place files later. As soon as files are available at right place, your queries will start returning result.
Let’s look at partitioned external table. Let’s assume you have customer data from across the globe well organized into country/state/city files. You want to create an external table and map all of those files in appropriate partition. You might be wondering that you have to create an external table and provide a list of all partitioned files at once. That’s too big a statement to write. So we will create an external partitioned table without specifying location for any data file. Look at the command shown in screen below. It will create an empty partitioned external table.

Once we have table, we can use alter table to add partitions as and when required. Suppose I need to analyze only few partitions, I will add only those ignoring others. In below screenshot, I added only one location for india/karnataka. At this location I have two files. Both are now mapped to my customers table. I fired a select statement and you will notice that values in last two columns are India and karnataka. Note the fact that my data file actually does not have these values. My table structure shown in above screen too does not have any column for country and state. It’s there in partitioned by clause only hence just a metadata about table. Hive knows this directory, all data pertains to india/karnataka and that’s why it appends these two columns at the end of query result. This behavior is important concept to learn which we will see why in future posts. 
 

Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted. Try it yourself.
Next obvious question would be can I drop a partition? Yes of course. Use SHOW PARTITIONS <table_name>; to list all of your partitions and when you know which partition you want to delete use statement like below.
ALTER TABLE customers DROP IF EXISTS PARTITION (country=’india’, state=’karnataka’);
You already know IF EXISTS is optional here.
In above example we added just one partition after creating an external table. If you have several partitions to be added, you don’t need to write multiple alter table statements.
ALTER TABLE <table_name> ADD IF NOT EXISTS
PARTITION (country=’india’,state=’karnataka’) LOCATION '/home/pkp/kar-data/'
PARTITION (country=’india’,state=’Andhra’) LOCATION '/home/pkp/andhra-data/'
PARTITION (country=’india’,state=’Kerla’) LOCATION '/home/pkp/kerla-data/'
...;
This ends our discussion about external tables in hive.