Monday, July 7, 2014

HDFS Federation

We understand that DataNode provides storage service in hadoop cluster by storing blocks on the local file system and allows read/write access. Whereas NameNode is responsible to manage entire file system namespace in HDFs cluster including block locations. We can summarize responsibilities of NameNode in two categories.
  1. Namespace Management
    1. Manage directories, files and block locations
    2. Support filesystem operations like create, delete, modify and list directories.
  2. Block storage Management
    1. Provide DataNode a membership by handling registration and periodic heart beats.
    2. Processes block reports and maintain block locations.
    3. Support block related operations create, delete, modify and get block location.
    4. Manages replica placement, replication of a block for under replicated blocks and deletes blocks that are over replicated.
The prior HDFS architecture allows only a single namespace for the entire cluster. Since NameNode keeps this entire information in memory, there are two specific issues which remain unaddressed.
  • For a very large cluster, memory becomes a limiting factor. Storage at data nodes can scale horizontally but memory at NameNode is vertically scalable. Performance of file system operations will suffer with this memory limitation. The only method to handle this performance issue is to add more memory at NameNode.
  • You may want to utilize your cluster for mix of purposes for example you may want to test some experimental applications on your production cluster. In that situation, you may want to isolate your experimental namespace from actual production namespace. Requirements for isolation may come from various operational reasons but HDFS architecture doesn’t allow this. This requirement even exists for small clusters.
HDFS Federation addresses these limitations of the prior architecture by allowing you to have multiple NameNodes in a single hadoop cluster. It is just like a large file system is broken into smaller pieces and distributed amongst multiple NameNodes. Each NameNode will manage one or more parts of the entire file system. For example /user and /tmp is managed my Node-1 while /projects is managed by node-2.  HDFS federation was introduced in hadoop 0.23 release and is part of hadoop 2.x.
It is important to understand following key points about HDFS federation.
  1. Federated NameNodes are independent and don’t require coordination with each other. If one NameNode goes down, it doesn’t impact namespace managed by other NameNode.
  2. DataNodes are common for all NameNodes that means a DataNode can store blocks from any of the NameNode.
  3. DataNodes will also send heartbeat/ block-report to all NameNodes and will handle commands from all NameNodes.
  4. All the blocks which are part of a single namespace and is managed by its NameNode and called block pool of the NameNode. So a NameNode and its block pool together is one namespace in HDFS terminology.
  5. HDFS federation mandates use of a ClusterID to identify all the NameNodes in the cluster. This helps to extend namespaces to a multi cluster environment.

Implementation

Once we understand basics of HDFS federation, next question comes, how it is implemented for use? There are two parts of the problem for implementing HDFS federation in a cluster.
  • How do we configure multiple NameNodes – This part of the problem is quite simple, we can configure and start multiple NameNodes just like we do single. We may have to address some issues like giving a unique id for each NameNode, configuring their RPC and HTTP address etc. all such configuration is placed in hdfs-site.xml file.
  • How do we map different namespaces with individual NameNodes – This part will have another implicit problem. This new problem is “How do we use file system after breaking it into smaller part and mapping it with different NameNodes”.
This second problem along with implicit third problem is solved by ViewFS. We will discuss about it but let’s take them in sequence.
Let’s assume that we want to configure two NameNodes in a cluster. To be able to do this, we will have to define a new property in hdfs-site.xml file as dfs.nameservices. A value for this property is a list of NameServiceID. Each NameServiceID is a unique id for a NameNode. An example is give below.
 <property>
    <name>dfs.nameservices</name>
   <value>ns1,ns2</value>
 </property>
In above example, ns1 and ns2 are two logical NameServiceIDs which represents two NameNodes. In the next step, we will have to map these logical ids with physical NameNode addresses. This part is also defined in hdfs-site.xml. An example is given below.
<property>
   <name>dfs.namenode.rpc-address.ns1</name>
    <value>node-1.mycompany.com:50070</value>
 </property>
<property>
   <name>dfs.namenode.rpc-address.ns2</name>
    <value>node-2.mycompany.com:50070</value>
 </property>
In the above example, ns1 is mapped with a node-1 at port 50070 and similarly ns2 is mapped with node-2 at port 50070.
Once these configurations are specified, you may need to format your new NameNodes using hdfs namenode -format [-clusterId <cluster_id>]. Cluster id is optional, if you don’t provide cluster id, a new id is provided automatically and you can use this id while formatting other NameNodes in same cluster. After formatting your new NameNodes, you should start them and refresh DataNodes to pick up new NameNodes. There are several other changes implemented to support operations like balances has been changed to work with multiple NameNodes, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. This completes our first problem.
With the knowledge of your NameNode URI (scheme://authority) as specified in configuration, you can use them for file system operations for example create directory and create files etc. Creating a new directory will be as simple as using hdfs  dfs  -mkdir with fully qualified path name URIs as scheme://authority/path.
At this stage, there is no global namespace defined. For a client, each NameNode looks like a separate HDFS cluster. To perform any operation, client needs to know NameNode URI. This makes life for client programs more difficult and less portable. This was our second problem. Hadoop attempts this problem using ViewFS.

ViewFS

The View File System (ViewFS) provides a way to manage multiple NameNodes in hadoop cluster and hence multiple namespaces in HDFS Federation. In the old HDFS implementation, a cluster has a single NameNode. The core-site.xml of the cluster has a configuration property that sets the default file system and maps NameNode of the cluster. An example of such property is shown below.
<property>
   <name>fs.defaultFS</name>
   <value>hdfs://NameNodeOfClusterX:port</value>
</property>
On such a cluster where core-site.xml is set as above, a typical path name will be something like /projects. Alternatively you can also specify fully qualified names while referring to HDFS location for example hdfs://NameNodeOfClusterX:port/projects. While both are valid, you should prefer relative name /user/hue because it makes your code portable to different cluster and independent from the underlying NameNode.
In a federated HDFS implementation, you have multiple NameNodes. We have two problems to be solved. First problem is to assign parts of file system to individual NameNodes. Second is to use relative file paths like /projects without any need to specify specific NameNode in the URI.
Hadoop solves this problem in similar using a mount table mechanism similar to Unix/Linux mount tables. In Linux mount table implementation, we specify a logical name to a mounted file system to be used by clients and mapping between logical mounted names and actual physical locations are stored in a mount table. Similar mechanism is used in ViewFS.
The first part is to specify that we are using ViewFS for specifying mount tables and second step is to create mount tables to resolve logical names with actual physical locations. Specifying that we are using ViewFS is done in core-site.xml file by changing value of fs.defaultFS property. In a non-federated setup, this property used to store NameNode URI but in a federated implementation it will store mount table name for the cluster. An example of such configuration is shown below.
<property>
   <name>fs.defaultFS</name>
   <value>viewfs://ClusterX </value>
</property>
In above example viewfs is the scheme which indicates that HDFS federation and ViewFS is implemented for this cluster and available NameNodes can be found in a mount table. So client will look further for mount table information within cluster configuration file. The authority ClusterX in above example is mount table name. In a multi cluster environment, it is recommended that you name mount tables with cluster names. In case, you just have a single cluster, you can leave authority unspecified which indicates that default mount table should be used. For a default mount table, an entry will look like <value>viewfs: /// </value>.
Next part is to define mount tables. The mount tables can be described again in core-site.xml however you can define it in a separate xml file and include it into your core-site.xml. An example of a mount table is given below.
<property>
   <name>fs.viewfs.mounttable.ClusterX.link./home</name>
   <value>hdfs://node-1.mycompany.com:50070/home</value>
</property>
 <property>
   <name>fs.viewfs.mounttable.ClusterX.link./tmp</name>
    <value> hdfs://node-1.mycompany.com:50070/tmp</value>
 </property>
 <property>
   <name>fs.viewfs.mounttable.ClusterX.link./projects </name>
    <value> hdfs://node-2.mycompany.com:50070/projects </value>
 </property>
In this example, there are two NameNodes, node-1 manages /home and /tmp while node-2 manages only /projects. Clients can directly use relative path names like /home, /tmp and /projects which gets resolved using configuration file.
If you are using default mount table, configuration will look like as below.
 <property>
   <name>fs.viewfs.mounttable.default.link./projects </name>
    <value> hdfs://node-2.mycompany.com:50070/projects </value>
 </property>
You can use fully qualified names like viewfs://ClusterX/projects if you are outside cluster. However, if you are within the cluster, you should avoid such naming convention and prefer relative paths like /projects. This will make your code portable from one cluster to another cluster.