Thursday, November 14, 2013

Hive Getting Started

In my previous post, we saw how we can execute MapReduce jobs using Java. Java is most flexible and powerful method for doing all MapReduce tasks but it requires lot of time and engagement. There is a lot which is repetitive during data analytics process and hence an opportunity for a high level tool to accomplish those things easily hiding all the complexity inside. That’s where Hive comes in.
It provides a familiar model for those who know SQL and allow them to think and work in database perspective. When commands and queries are submitted to hive, it goes to the driver. Driver will compile, optimize and execute those using steps of MapReduce jobs.
It seems obvious that driver will generate java MapReduce jobs internally but that’s not the fact. Hive has generic Mapper and Reduces module which operate based on information in an XML file.
When we create a table, our table schema and other system metadata is stored in a separate Meta data store. This metastore is a traditional relational database usually MySQL.
Hive gives a nice and quick startup for those who are familiar with SQL and a high level easy tool for all to accomplish data analysis.
Let’s get started with hive.
In case you do not have HortonWorks HDP Sand box setup available with you or you are new to HDP sandbox, I recommend going through at least below posts.
You need to login to your virtual box using root user and password hadoop. It is not advisable to work as root user so I created a new user for me to work with hive. In Linux you can use useradd user_name command to create a new user. After creating new user, you must set a password for it using passwd user_name command. Once new user is created, logoff root user and login using credentials you just created.
Follow below screen and explanation given here.
First command I executed is hive. This command will start hive command line interface and your prompt will change to hive>. Once you see this command line, you are in hive CLI and ready to execute hive commands. Starting Hive command Line Interface - CLI
If you are familiar with other relational databases like Oracle, MySQL or MS SQL Server, you must be aware with concept of database and schema. In hive, database base and schema are synonymous. In hive, both of them actually are just a namespace. They are just providing a method for organizing tables into a logical group. This grouping is valuable in large clusters when multiple people working in team to avoid table name conflicts. Next command which I have executed in above screenshot is a hive CLI command called set. In hive set command is used to set or display variables. We will talk about it in more detail later. For now, I have set a configuration variable hive.cli.print.current.db as true. Once this variable is set to true, hive prompt will also display current database you are working in. you must have noticed that after set command is executed, prompt is changed from hive> to hive (default)>. In this case, we are working in default database which is now displayed as part of prompt.
Next command is executed to demonstrate that we can use
set command to display value of a variable. In this case, I displayed value for a variable hive.metastore.warehouse.dir. This is another configuration variable in hive which stores directory location where hive will create all my databases and tables. We will demonstrate it in detail further down.
When we start hive CLI using hive command, it looks for a file named
.hiverc in your home directory. If .hiverc file is found, CLI will execute all commands placed in this file. Yes, you are right, you can place your set hive.cli.print.current.db=true; command in this file so every time you start your CLI, it shows your current database in prompt.
But I don’t want to use default database so let’s create a new database using create database database_name; command as shown in screenshot below.
You can use
describe database database_name; command to describe your database. You can see in the screen below that shows a URI as hdfs://sandbox:8020/apps/hive/warehouse/pkdb.db.
You must have noticed that
/apps/hive/warehouse is the location which we saw as output of set command in previous screen. So my database pkdb (automatically suffixed with .db) is created under this directory in HDFS file system.
In next command
use dabase_name; I changed my current database from default to newly created pkdb database. Hence hive prompt is also changed accordingly. Now i can place use pkdb; command in my .hiverc file so I am always using my own database instead of default.

Let’s go to /apps/hive/warehouse directory in hadoop file system to check what is created there. In the screen below, I have just listed content of this directory in hadoop file system. I can see pkdb.db is created as a directory under this location.
Great, which means, in hive, database is nothing more than a directory.

Let’s create a table in our newly created database. Method for creating a table is almost similar to other database, we can do it using create table table_name (column_name column_data_type); as shown in below screen. Table is created and I fired a select statement on this table. I do not get any records out from the table because there is no data in the table but isn’t it simple. If you already know SQL it’s just matter of days for you to learn hive. Keep reading.

Now, since I have table, I want to go back to the HDFS file system and check what is created in my database for my new table. You will be surprised to see that it’s again a directory.

Great, now you have learned getting into hive. Let’s drop the table and database we created.

One last thing, I mentioned in beginning that for hive, database and schema are synonymous. That means you can use create schema schema_name; instead of create database database_name; and result for both the command is same.

We will catch-up again on hive in more detail in future.
Keep reading…..Keep learning…..Keep growing.