We will use question answer technique to properly
understand Hadoop.
Question: What is
Hadoop in simplest form?
Answer: It’s a distributed file
system named as Hadoop Distributed File
System and abbreviated as HDFS. It
is similar to hierarchical file systems we use on our laptops i.e. users can
create directories and they can create, move, remove and rename files inside
these directories. Users must use HDFS client software for this purpose.
Question: What do
you mean by distributed file system?
Answer: One single file is broken in chunks and
distributed over multiple systems. These
chunks are not random but actually equal sized blocks except last block as
defined by HDFS block size. Let’s assume you have a 500 MB file. This file is
broken into 7 blocks of 64 MB and 8th block of 52 MB assuming that HDFS
block size for the file is 64 MB. These 8 blocks of your file are sprinkled
amongst various machines.
Question: How do
we read such broken files?
Answer: To understand that, you need
to understand architecture. HDFS is master/slave architecture. There is a
master called NameNode. This master has one or many slaves called DataNode. NameNode
and DataNode are nothing but specialized software. In a typical hadoop cluster,
one machine runs NameNode software and other machines execute DataNode software.
We usually have one instance of DataNode running per machine. It is NameNode
who facilitates reading those broken pieces of your file.
Question: What
are duties of NameNode and DataNode?
Answer:
Duties of NameNode
Manage file system namespace i.e. execute file
system namespace operations like opening, closing and renaming files and
directories.
Keep track of blocks; their mapping with file
and their location on DataNode hence manage all metadata.
Duties of DataNode
-
Manage Storage attached to the machine they run
on i.e. block creation, deletion and replication.
- Serve read and write requests from client hence
user data does not flow through NameNode but directly goes to DataNode.
Question: What if
one of the DataNode fails? Do I lose some blocks of my file?
Answer: No, HDFS is designed with
default fault tolerance. Application can define replication factor for a file. Let’s
assume, you define replication factor for your file as 3. This means that HDFS
will maintain 3 copies of each block of your file on different DataNode.
Question: Do we
have a system wide HDFS block size and replication factor?
Answer: No, Replication factor and
block size can be configured per file.
Question: How
does NameNode and DataNode sync up?
Answer: Each DataNode periodically sends
heartbeat and block report to NameNode. Receipt of heartbeat implies that
DataNode is functioning properly. Block report contains list of all blocks on
DataNode. NameNode marks DataNode without recent heartbeat as dead and does not
forward any new I/O request to them. In this process, NameNode does not
initiate any communication. It just responds to DataNode remote procedure
calls. All these communications are layered on TCP/IP protocol.
Question: What if
one of the DataNode fails causing replication factor to fall below specified
value?
Answer: NameNode will initiate
replication of blocks if necessary. It might be necessary due to many reasons
like DataNode is dead, data block is corrupt, DataNode is alive but hard disk
is failed making block unavailable or replication factor is increased.
Question: You
talked about data block corruption. How it is identified that any specific block
is corrupt?
Answer: HDFS client software which
is used for file system operations implements a checksum on content of HDFS file.
When a client creates a file, it computes a checksum on each block of the file
and stores these checksums in a separate hidden file in same HDFS namespace. When a
client retrieves file contents it verifies that the data it received from each
DataNode matches the checksum stored in the associated checksum file. If not,
then the client can opt to retrieve that block from another DataNode that has a
replica of that block.
Question: So it
appears that there is enough protection against DataNode failure. What if
NameNode fails?
Answer: That’s a single point of
failure in HDFS as of now. If NameNode fails, it requires manual intervention.
Automatic restart and failover of NameNode software to another machine is not
yer supported.
Question: You
mentioned that user data does not follow through NameNode but directly goes to
DataNode. Can you explain this process?
Answer: HDFS client software is used
to create data. In fact, initially the HDFS client caches the file data into a
temporary local file. When the local file accumulates data worth over one HDFS
block size, the client contacts the NameNode. The NameNode inserts the new file
name into the file system hierarchy and allocates a data block for it. The
NameNode responds to the client request with the identity of the DataNode and
the destination data block. Then the client flushes the block of data from the
local temporary file to the specified DataNode. When the job is complete,
client then tells the NameNode that the file is closed. At this point, the
NameNode commits the file creation operation into a persistent store.
Question: How
replication happens?
Answer: In the above process, client
retrieves identity of destination NameNode for its data. This identity is a
list of DataNodes. This list contains the DataNodes that will
host a replica of that block. The client then flushes the data block to the
first DataNode. The first DataNode starts receiving the data in small portions
(4 KB), writes each portion to its local repository and transfers that portion
to the second DataNode in the list. The second DataNode, in turn starts
receiving each portion of the data block, writes that portion to its repository
and then flushes that portion to the third DataNode. Finally, the third
DataNode writes the data to its local repository. Thus, a DataNode can be
receiving data from the previous one in the pipeline and at the same time
forwarding data to the next one in the pipeline. Thus, the data is pipelined
from one DataNode to the next.
Question: What happens when I reduce replication factor of a
file?
Answer: When the replication factor of a file is reduced, the NameNode
selects excess replicas that can be deleted. The next Heartbeat transfers this
information to the DataNode. The DataNode then removes the corresponding blocks
and frees space.
Question: Do we
have any other method to access HDFS except client software?
Answer: HDFS can be accessed from
applications in many different ways. HDFS provides a Java API for applications
to use. A C language wrapper for this Java API is also available. In addition, a
typical HDFS install configures a web server to expose the HDFS namespace
through a configurable TCP port. This allows a user to navigate the HDFS
namespace and view the contents of its files using a web browser.
Question: Does
HDFS supports delete and undelete files like other file systems?
Answer: When a file is deleted by a
user or an application, it is not immediately removed from HDFS. Instead, HDFS
first renames it to a file in the /trash directory. The file can be restored quickly
as long as it remains in /trash. A file remains in /trash for a configurable amount
of time. After the expiry of its life in /trash, the NameNode deletes the file
from the HDFS namespace. The deletion of a file causes the blocks associated
with the file to be freed.
Question: Any
special notes about HDFS?
Answer: There is lot more about HDFS
and it is continuously improving in many areas. One last thing at this stage which is
necessary to understand that HDFS is designed for large and very large files. A
typical file in HDFS is gigabytes to terabytes in size. HDFS is tuned to
support large files. It is designed with an assumption that you write file once
and read many times. It assumes that once written, you won’t modify your files.
That’s a typical case for data warehouses and data analytics.