Monday, December 16, 2013

Apache Sqoop – An Introduction

Hadoop as a platform gives us a framework to interact with data in parallel using powerful MapReduce framework. Hive and Pig gives us a high level tool to perform analysis and transformations very efficiently for very large set of data. But for doing all this we need data to be available in HDFS. Data comes from various sources and most of our traditional data is available in our RDBS systems. We need to bring it into HDFS to perform analysis and we may further need to send final results or transformed data back to RDBMS to take advantage of RDBMS capabilities.

There are various ways to move data back and forth between RDBMS and HDFS. You can write scripts or small programs using many of the available technologies to extract data from your RDBMS and vice versa. But this will be a repetitive task and you may have to handle various challenges like connectivity, native RDBMS syntax and efficiency. This gave a motivation for an Open Source tool to simplify this process and answer was Sqoop. Sqoop is a short form of SQL to Hadoop.

It is a simple command line tool till Sqoop 1.4.4 and Sqoop2 is bringing web interface as well. You can do a lot using Sqoop. For example you can import (RDBMS to HDFS) one table, Import all tables, use SQLs to import data and incremental imports etc. You can export (HDFS to RDBMS) data, update an existing set, merge i.e. update and insert both together, you can use stored procedure instead of insert statements etc. But the bottom line is, Sqoop is a tool to handle data movements between RDBMS and HDFS.
We will see some simple examples to get started on Sqoop. For demonstration, I am using HortonWorks Sandbox 2.0. If you want to play yourself and don’t have any hadoop environment, you can get sandbox. Steps to setup this box are shared here.
HortonWorks Sandbox comes with MySQL preinstalled into it. So let’s use it to create one table and load data from that table to HDFS.
As shown in below screen, once you logged in to sandbox, just type mysql and you will get MySQL prompt.
At MySQL prompt, I typed show databases; command and it shows that we have 2 databases. Let’s not touch information_schema. We will use test database as shown in next statement. This database does not have any table so we have to create one sample table.
As shown in below screen, I have created one table named mysql_data and inserted two records in this table.



We will import this table into HDFS. Size of the data for my example is very small but it’s good enough to show how Sqoop works. Sandbox comes with all basic drivers and connectors for Sqoop to connect with MySQL hence we do not need anything extra. We are ready to trigger Sqoop command as shown below.
Above command is the simplest form of Sqoop command to import data from MySQL database to HDFS file system. Import is the sub command of Sqoop which instructs Sqoop to start import into HDFS. Next part is --connect which specifies JDBC connection string. Finally, we specify table name to be imported. Let’s first check what is created by this command before we get into further details.
Above command once executed will create a new directory into my home directory. Name of this new directory will be same as table name. Let’s check what is created in this directory.
Sqoop has created two files each contains one record from the table. That’s why I mentioned that my test data is very small but it is good enough to explain what is happening inside Sqoop. Sqoop generates a java program to move data from MySQL into HDFS. This java program is nothing but a Mapper program, which creates multiple map jobs to move data in parallel. In this case there were only two mappers each created a separate file in same directory. We will discuss it again in more details.
Let’s execute another example to import data from Oracle. If you already have an Oracle instance you can use it. If you do not have any oracle instance, I have detailed out steps to install Oracle XE into your HortonWorks sand box.
Download oracle Xpress edition from below link.
You need to download Oracle XE for Linux x64 as shown in below screen shot.
You will get a zip file which you need to move to your Linux VM. I used WinSCP to move data from my laptop to Linux VM. You can use same or any other tool of your choice.
After copying your compressed installer into your VM, you need to unzip it as shown in below screen. It will create a new Directory Disk1 and various other sub directories inside it.
Change your present working directory to Disk1 and execute your installer using rpm -ivh. Check below screen if you have any doubts.
I performed all activities as root user. Once rpm is complete, Your Installation is done. You will get a message to execute oracle-xe configure as shown in above screen to complete post installation steps. Go ahead and execute it. This will configure Oracle XE and will ask you few questions like port numbers and passwords. Just accept all default port numbers and give an appropriate password.
It may take 10-15 minutes to complete configuration. Once configuration is complete, you need to set your necessary environment variables. Oracle XE provides a script oracle_env.sh to setup necessary environment variables. This script is placed in /u01/app/oracle/product/11.2.0/xe/bin directory. you can copy this script into /etc/profile.d directory so that it is automatically executed whenever any user logs into to system.
We are done with Oracle XE installation. Let’s logout root user and login using your working credential. Check below screen, after login I executed sqlplus system command to login to my oracle XE database using system user account. I gave password which I supplied during installation and finally I got SQL prompt. 
 
Next screen shows command to create a new user with password hadoop. I gave DBA role to this use.
Now, I will connect to oracle XE again using pkp/hadoop credentials and I will create a sample table then I will insert two records in this table. Commands to do all this is shown in below screen.
At this stage, my source table is setup and I am ready to import data from Oracle to HDFS. But I have to get JDBC driver for oracle. JDBC driver for MySQL was already present in sand box so i did not performed this step earlier but JDBC driver for Oracle is not present in my sand box. JDBC driver for oracle is freely available for anyone to download. You can download JDBC driver for Oracle from here and place it into Sqoop library which is /usr/lib/sqoop/lib. Below screen shows that I have already placed ojdbc5.jar at appropriate place.
We are ready to execute Sqoop command for import. Command to complete import activity is shown below. You will notice that JDBC connection string is now changed to reflect Oracle specific connection string.
I had to provide username and password also because Oracle does not allow me to access with appropriate credentials. You will notice that I gave one extra parameter –m 1. what is that? This option tells Sqoop to create only one Mapper as we know our table is very small and we don’t need multiple mappers to do import in parallel.
As expected, above Sqoop command will again create a directory as shown below. You will notice that this time we have only one file created by Sqoop which contains both the records from the table. This happened because we restricted Sqoop to create only one mapper.
We have seen something in action, now it’s time to understand how Sqoop Import works at high level. It’s a two-step process.
  1. Sqoop connects to RDBMS to gather necessary metadata information about the data being imported. In my examples above, I just mentioned table name to Sqoop but to accomplish this task, Sqoop need to know column names, their data types and size etc. That’s what a metadata Sqoop has to gather from RDBMS in first step.
  2. Second step is to submit multiple map only hadoop jobs that Sqoop will submit to your hadoop cluster. These jobs are responsible to perform actual data transfers. These are java mapper jobs which are already aware with metadata and source RDBMS. They connect to RDBMS and copy their part of data into a separate file in HDFS. By default these files are comma delimited fields and rows are new line terminated. But you can easily override these defaults. In fact, we can directly move data from RDBMS to Hive or HBASE tables.
It’s just beginning of Sqoop and there are lots of things to learn Sqoop. Sqoop 2 is coming with more powers and flexibility. If you are interested here is a fantastic video about Sqoop 2 features directly from Sqoop developers.
If you followed this article, you have an environment where you can play with your Sqoop commands and options, keep learning.