What is Hadoop?

We will use question answer technique to properly understand Hadoop.

Question: What is Hadoop in simplest form?
Answer: It’s a distributed file system named as Hadoop Distributed File System and abbreviated as HDFS. It is similar to hierarchical file systems we use on our laptops i.e. users can create directories and they can create, move, remove and rename files inside these directories. Users must use HDFS client software for this purpose.

Question: What do you mean by distributed file system?
Answer: One single file is broken in chunks and distributed over multiple systems. These chunks are not random but actually equal sized blocks except last block as defined by HDFS block size. Let’s assume you have a 500 MB file. This file is broken into 7 blocks of 64 MB and 8^th block of 52 MB assuming that HDFS block size for the file is 64 MB. These 8 blocks of your file are sprinkled amongst various machines.

Question: How do we read such broken files?
Answer: To understand that, you need to understand architecture. HDFS is master/slave architecture. There is a master called NameNode. This master has one or many slaves called DataNode. NameNode and DataNode are nothing but specialized software. In a typical hadoop cluster, one machine runs NameNode software and other machines execute DataNode software. We usually have one instance of DataNode running per machine. It is NameNode who facilitates reading those broken pieces of your file.

Question: What are duties of NameNode and DataNode?
Answer:
Duties of NameNode

Manage file system namespace i.e. execute file system namespace operations like opening, closing and renaming files and directories.
Keep track of blocks; their mapping with file and their location on DataNode hence manage all metadata.

Duties of DataNode

Manage Storage attached to the machine they run on i.e. block creation, deletion and replication.
Serve read and write requests from client hence user data does not flow through NameNode but directly goes to DataNode.

Question: What if one of the DataNode fails? Do I lose some blocks of my file?
Answer: No, HDFS is designed with default fault tolerance. Application can define replication factor for a file. Let’s assume, you define replication factor for your file as 3. This means that HDFS will maintain 3 copies of each block of your file on different DataNode.

Question: Do we have a system wide HDFS block size and replication factor?
Answer: No, Replication factor and block size can be configured per file.

Question: How does NameNode and DataNode sync up?
Answer: Each DataNode periodically sends heartbeat and block report to NameNode. Receipt of heartbeat implies that DataNode is functioning properly. Block report contains list of all blocks on DataNode. NameNode marks DataNode without recent heartbeat as dead and does not forward any new I/O request to them. In this process, NameNode does not initiate any communication. It just responds to DataNode remote procedure calls. All these communications are layered on TCP/IP protocol.

Question: What if one of the DataNode fails causing replication factor to fall below specified value?
Answer: NameNode will initiate replication of blocks if necessary. It might be necessary due to many reasons like DataNode is dead, data block is corrupt, DataNode is alive but hard disk is failed making block unavailable or replication factor is increased.

Question: You talked about data block corruption. How it is identified that any specific block is corrupt?
Answer: HDFS client software which is used for file system operations implements a checksum on content of HDFS file. When a client creates a file, it computes a checksum on each block of the file and stores these checksums in a separate hidden file in same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.

Question: So it appears that there is enough protection against DataNode failure. What if NameNode fails?
Answer: That’s a single point of failure in HDFS as of now. If NameNode fails, it requires manual intervention. Automatic restart and failover of NameNode software to another machine is not yer supported.

Question: You mentioned that user data does not follow through NameNode but directly goes to DataNode. Can you explain this process?
Answer: HDFS client software is used to create data. In fact, initially the HDFS client caches the file data into a temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the new file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When the job is complete, client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store.

Question: How replication happens?
Answer: In the above process, client retrieves identity of destination NameNode for its data. This identity is a list of DataNodes. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

Question: What happens when I reduce replication factor of a file?
Answer: When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and frees space.

Question: Do we have any other method to access HDFS except client software?
Answer: HDFS can be accessed from applications in many different ways. HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, a typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

Question: Does HDFS supports delete and undelete files like other file systems?
Answer: When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed.

Question: Any special notes about HDFS?
Answer: There is lot more about HDFS and it is continuously improving in many areas. One last thing at this stage which is necessary to understand that HDFS is designed for large and very large files. A typical file in HDFS is gigabytes to terabytes in size. HDFS is tuned to support large files. It is designed with an assumption that you write file once and read many times. It assumes that once written, you won’t modify your files. That’s a typical case for data warehouses and data analytics.

Wednesday, October 30, 2013

What is Hadoop?