Monday, November 4, 2013

Using HDFS File system




In one of my previous posts, we understand HDFS as a distributed file system where we can do everything we can do with other file systems like create directory, create files, copy them, move them, rename them, delete them and many more operations. If you have your portable hadoop setup as explained in one of my previous posts, it’s time to experience HDFS using file system commands. HDFS file system commands are based on Linux file system commands because HDFS is primarily build on Linux platform.
In a typical hadoop installation, you will have three types of machines as NameNode, DataNode and client machine. You already understand NameNode and DataNode (if not read this post). Client machine is nothing but your local or remote computer having hadoop client components installed and configured to interact with NameNode and DataNode. Ideally your client machine does not run any NameNode or DataNode daemons.
In our current setup of portable Hortonworks hadoop sandbox, everything is running on single virtual machine. My windows laptop can’t act as a client machine in such setup because I do not have hadoop client components installed on my windows machine. But that’s not a worry to demonstrate this concept as our sand box can act like a client machine as well for us.
For our tutorial purpose, I will move one large file into sandbox local file system from my windows laptop. File will still not be in HDFS. That situation is exactly same as you have a file in your client machine and you are ready to move that file into a HDFS file system to take advantage of the power of HDFS.
As we all know easiest method for moving a file from a windows machine to a Linux machine is through an SSH client, I will use open source WinSCP for this purpose. If you have any other tool, you can use it or download WinSCP and use as demonstrated below.
  1. Start you’re SandBox VM and you will see screen similar to below.



  1. This means, you can ssh your VM using 127.0.0.1 at port 2222. Your VMs actual IP address is different than shown in above screen and ssh in your VM is actually running at port 22. But you can access it at 127.0.0.1 due to one of the Oracle VirtualBox features called port forwarding. Using this feature, you access your localhost ports using loopback interface and VirtualBox will forward it to VM on preconfigured ports. This feature will allow you to connect your laptop to any network and still keep your sandbox working.
  2. Let’s start WinSCP and connect to your Linux virtual machine. Select values as shown in below image and click login.


  1. On successful login, you will see WinSCP showing files on your local machine and remote machine side by side. In screen shown below, left side shows SampleLargeFile.tsv on my local machine and on right side is /root directory on my VM. Transferring file from my local machine to VM is now just a matter of drag and drop. Once you moved your file into /root directory of your VM, you are done with WinSCP so just close it for now.


  1. Now you need to go back to your VM window and press ALT+F5. You will get a Linux login screen. Login to your VM using root user name and password as hadoop. After successful login, execute clear screen command to get a clear screen.



  1. You can issue ls command to see your file is there. In my case it is SampleLargeFile.tsv. This file is available in my VM machine’s local file system. It is still not available in HDFS. This situation is exactly same as you have a file in your client machine and you are ready to move that file into a HDFS file system to take advantage of the power of HDFS. Now you are ready to use HDFS commands so let’s start.
  2. First command I will use is hadoop fs –ls /user
All hadoop file system commands follow general syntax as hadoop fs <args>in above example; I am using ls command to list contents of /user directory in HDFS. Please note that /user is an HDFS directory and you will not find it in local file system. If you have doubt, issue ls /user command and you will get an answer as “no such file or directory”.
  1. Next command I am going to use mkdir to make a directory under /user/hue directory and finally I will use put command to copy my SampleLargeFile.tsv into hadoop directory /user/hue/myfiles. You can see these syntaxes and results in below screen.


  1. You can get more details on various available file system commands from apache hadoop documentation and practice them in similar way I demonstrated some of them in here.
  2. We have seen hadoop file system shell interface in action. But that’s not the only interface offered by HDFS for users and developers to access and work with HDFS. Most basic at the bottom are Java APIs. Using those APIs open source community has developed a web interface to make it simple for end users. Let’s explore this web interface.
  3. Start your favorite browser while your VM is up and running and type in this url http://localhost:8888/
After completing one time registration process, you will see Hortonworks sandbox home page as below.


  1. Click on Go TO SNADBOX link and you will reach hadoop web interface called hue. We will talk about hue some other time. For now just click on file browser icon from the hue home page. It opens HDFS file browser and you should be able to see myfiles directory which we created in this tutorial. If you navigate through myfiles directory, you will find the file which we created in that directory.
  2. Hue file browser provides and easy interface to upload files directly into HDFS without following all those steps we used earlier to create our file in HDFS. You can perform all your day to day file system operation using hue file browser.
 
In a typical HDFS installation, NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at http://namenode-name:50070/ but In case of our sandbox it will be http://localhost:50070/. Screen look like below image.
It shows some basic reports about HDFS, NameNode and a link to DataNode. You can browse to DataNode to get reports for DataNode. You can browse the file system using link on the page. It did not worked for me until I change url again replacing actual IP address of my NameNode with localhost.
This ends a basic introduction to HDFS, The distributed file system. Next key concept of hadoop ecosystem is MapReduce process which actually makes hadoop powerful and purposeful.