Monday, July 7, 2014

HDFS Federation

We understand that DataNode provides storage service in hadoop cluster by storing blocks on the local file system and allows read/write access. Whereas NameNode is responsible to manage entire file system namespace in HDFs cluster including block locations. We can summarize responsibilities of NameNode in two categories.
  1. Namespace Management
    1. Manage directories, files and block locations
    2. Support filesystem operations like create, delete, modify and list directories.
  2. Block storage Management
    1. Provide DataNode a membership by handling registration and periodic heart beats.
    2. Processes block reports and maintain block locations.
    3. Support block related operations create, delete, modify and get block location.
    4. Manages replica placement, replication of a block for under replicated blocks and deletes blocks that are over replicated.
The prior HDFS architecture allows only a single namespace for the entire cluster. Since NameNode keeps this entire information in memory, there are two specific issues which remain unaddressed.
  • For a very large cluster, memory becomes a limiting factor. Storage at data nodes can scale horizontally but memory at NameNode is vertically scalable. Performance of file system operations will suffer with this memory limitation. The only method to handle this performance issue is to add more memory at NameNode.
  • You may want to utilize your cluster for mix of purposes for example you may want to test some experimental applications on your production cluster. In that situation, you may want to isolate your experimental namespace from actual production namespace. Requirements for isolation may come from various operational reasons but HDFS architecture doesn’t allow this. This requirement even exists for small clusters.
HDFS Federation addresses these limitations of the prior architecture by allowing you to have multiple NameNodes in a single hadoop cluster. It is just like a large file system is broken into smaller pieces and distributed amongst multiple NameNodes. Each NameNode will manage one or more parts of the entire file system. For example /user and /tmp is managed my Node-1 while /projects is managed by node-2.  HDFS federation was introduced in hadoop 0.23 release and is part of hadoop 2.x.
It is important to understand following key points about HDFS federation.
  1. Federated NameNodes are independent and don’t require coordination with each other. If one NameNode goes down, it doesn’t impact namespace managed by other NameNode.
  2. DataNodes are common for all NameNodes that means a DataNode can store blocks from any of the NameNode.
  3. DataNodes will also send heartbeat/ block-report to all NameNodes and will handle commands from all NameNodes.
  4. All the blocks which are part of a single namespace and is managed by its NameNode and called block pool of the NameNode. So a NameNode and its block pool together is one namespace in HDFS terminology.
  5. HDFS federation mandates use of a ClusterID to identify all the NameNodes in the cluster. This helps to extend namespaces to a multi cluster environment.

Implementation

Once we understand basics of HDFS federation, next question comes, how it is implemented for use? There are two parts of the problem for implementing HDFS federation in a cluster.
  • How do we configure multiple NameNodes – This part of the problem is quite simple, we can configure and start multiple NameNodes just like we do single. We may have to address some issues like giving a unique id for each NameNode, configuring their RPC and HTTP address etc. all such configuration is placed in hdfs-site.xml file.
  • How do we map different namespaces with individual NameNodes – This part will have another implicit problem. This new problem is “How do we use file system after breaking it into smaller part and mapping it with different NameNodes”.
This second problem along with implicit third problem is solved by ViewFS. We will discuss about it but let’s take them in sequence.
Let’s assume that we want to configure two NameNodes in a cluster. To be able to do this, we will have to define a new property in hdfs-site.xml file as dfs.nameservices. A value for this property is a list of NameServiceID. Each NameServiceID is a unique id for a NameNode. An example is give below.
 <property>
    <name>dfs.nameservices</name>
   <value>ns1,ns2</value>
 </property>
In above example, ns1 and ns2 are two logical NameServiceIDs which represents two NameNodes. In the next step, we will have to map these logical ids with physical NameNode addresses. This part is also defined in hdfs-site.xml. An example is given below.
<property>
   <name>dfs.namenode.rpc-address.ns1</name>
    <value>node-1.mycompany.com:50070</value>
 </property>
<property>
   <name>dfs.namenode.rpc-address.ns2</name>
    <value>node-2.mycompany.com:50070</value>
 </property>
In the above example, ns1 is mapped with a node-1 at port 50070 and similarly ns2 is mapped with node-2 at port 50070.
Once these configurations are specified, you may need to format your new NameNodes using hdfs namenode -format [-clusterId <cluster_id>]. Cluster id is optional, if you don’t provide cluster id, a new id is provided automatically and you can use this id while formatting other NameNodes in same cluster. After formatting your new NameNodes, you should start them and refresh DataNodes to pick up new NameNodes. There are several other changes implemented to support operations like balances has been changed to work with multiple NameNodes, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. This completes our first problem.
With the knowledge of your NameNode URI (scheme://authority) as specified in configuration, you can use them for file system operations for example create directory and create files etc. Creating a new directory will be as simple as using hdfs  dfs  -mkdir with fully qualified path name URIs as scheme://authority/path.
At this stage, there is no global namespace defined. For a client, each NameNode looks like a separate HDFS cluster. To perform any operation, client needs to know NameNode URI. This makes life for client programs more difficult and less portable. This was our second problem. Hadoop attempts this problem using ViewFS.

ViewFS

The View File System (ViewFS) provides a way to manage multiple NameNodes in hadoop cluster and hence multiple namespaces in HDFS Federation. In the old HDFS implementation, a cluster has a single NameNode. The core-site.xml of the cluster has a configuration property that sets the default file system and maps NameNode of the cluster. An example of such property is shown below.
<property>
   <name>fs.defaultFS</name>
   <value>hdfs://NameNodeOfClusterX:port</value>
</property>
On such a cluster where core-site.xml is set as above, a typical path name will be something like /projects. Alternatively you can also specify fully qualified names while referring to HDFS location for example hdfs://NameNodeOfClusterX:port/projects. While both are valid, you should prefer relative name /user/hue because it makes your code portable to different cluster and independent from the underlying NameNode.
In a federated HDFS implementation, you have multiple NameNodes. We have two problems to be solved. First problem is to assign parts of file system to individual NameNodes. Second is to use relative file paths like /projects without any need to specify specific NameNode in the URI.
Hadoop solves this problem in similar using a mount table mechanism similar to Unix/Linux mount tables. In Linux mount table implementation, we specify a logical name to a mounted file system to be used by clients and mapping between logical mounted names and actual physical locations are stored in a mount table. Similar mechanism is used in ViewFS.
The first part is to specify that we are using ViewFS for specifying mount tables and second step is to create mount tables to resolve logical names with actual physical locations. Specifying that we are using ViewFS is done in core-site.xml file by changing value of fs.defaultFS property. In a non-federated setup, this property used to store NameNode URI but in a federated implementation it will store mount table name for the cluster. An example of such configuration is shown below.
<property>
   <name>fs.defaultFS</name>
   <value>viewfs://ClusterX </value>
</property>
In above example viewfs is the scheme which indicates that HDFS federation and ViewFS is implemented for this cluster and available NameNodes can be found in a mount table. So client will look further for mount table information within cluster configuration file. The authority ClusterX in above example is mount table name. In a multi cluster environment, it is recommended that you name mount tables with cluster names. In case, you just have a single cluster, you can leave authority unspecified which indicates that default mount table should be used. For a default mount table, an entry will look like <value>viewfs: /// </value>.
Next part is to define mount tables. The mount tables can be described again in core-site.xml however you can define it in a separate xml file and include it into your core-site.xml. An example of a mount table is given below.
<property>
   <name>fs.viewfs.mounttable.ClusterX.link./home</name>
   <value>hdfs://node-1.mycompany.com:50070/home</value>
</property>
 <property>
   <name>fs.viewfs.mounttable.ClusterX.link./tmp</name>
    <value> hdfs://node-1.mycompany.com:50070/tmp</value>
 </property>
 <property>
   <name>fs.viewfs.mounttable.ClusterX.link./projects </name>
    <value> hdfs://node-2.mycompany.com:50070/projects </value>
 </property>
In this example, there are two NameNodes, node-1 manages /home and /tmp while node-2 manages only /projects. Clients can directly use relative path names like /home, /tmp and /projects which gets resolved using configuration file.
If you are using default mount table, configuration will look like as below.
 <property>
   <name>fs.viewfs.mounttable.default.link./projects </name>
    <value> hdfs://node-2.mycompany.com:50070/projects </value>
 </property>
You can use fully qualified names like viewfs://ClusterX/projects if you are outside cluster. However, if you are within the cluster, you should avoid such naming convention and prefer relative paths like /projects. This will make your code portable from one cluster to another cluster.

Wednesday, February 26, 2014

Big Data – Enterprise security requirements

In my previous post, I talked about need for securing big data environments and also mentioned 8 key areas of enterprise security. Let’s discuss these areas in little more detail to understand concept behind all these terms. Such understanding will be necessary to know what we intend to do under each vertical of enterprise security and form a basis for requirement. Before that, I just want to share some stats from PWC survey “The Global State of Information Security® Survey 2014. Detailed report of this survey is available here.







 

Authentication

Authentication is first level in security, one of the simplest term to understand but one of the most complex things to implement. Authentication means you need to be authenticated to ensure that you are the one who you claim to be. Everyone must have experienced authentication while giving user name and password to any system they want to use. But there are several questions which you may ask while implementing it, some of them are listed below.
  1. Who (Server) will authenticate users?
  2. When you login from a client machine, do you want to send your password to server for verification?
  3. Do you want all communication between client and server to be encrypted?
  4. How you want to manage encryption key?
  5. How you want to manage list of all valid users?
  6. Do you want users to allow to login from any machine or device like their mobile?
  7. What are valid applications (command line interface, commercial applications and custom build applications) which will be allowed to access your systems?
  8. How will you ensure those applications are using secure method to establish connection?
  9. Do you want single authentication for all your services?
  10. How will you integrate it with existing authentication system?

Authorization
Once you are authenticated and system knows “who you are”. You can get into the system, what’s next? You would want to perform some task, execute some programs/scripts/commands, read or write some files, access some services or other systems within network. That’s where authorization plays its role, it controls, what is that you are allowed to do after entering into the system? In fact, authorization and authentication are tightly coupled in any system. Authorizing any user may again trigger a need for subsequent authentication. For example, you are authorized to execute a program which connects to a database, and then there is need for you to qualify for database authentication. This is a typical situation where you may need a single sign-on to be implemented, in absence of single sign-on you will end up creating many credentials, but question is how you want to do it in your system and how you want to manage such passing-on authorizations in a secure manner? Do you want to embed such credentials within your applications, I see such implementations where all application users use single credential to connect to specific services like databases whereas user management is controlled through another layer within application, this approach might be good as long as you can implement mechanism to trace application users till the end point within database activities.
Another consideration under this vertical is how you want to manage machine to machine or service to service authentication and authorization within your system. For example in a hadoop system, when a node wants to register itself as a data node, how do you check if it is authorized to be registered as data node and more importantly how do you authenticate that it is the node it claims to be. Such issues are addressed under authorization.
Access Control
If you are familiar with any RDBMS technologies, probably you already understand access control. Authorization tells you what you can do for example which files you can read. But authorization does not extend it further to fine grained level. RDBMS are excellent in providing such fine grained access control. In a typical RDBMS system, you can control which tables a user can access or read. It goes further and allows you to restrict specific columns, some databases gives you capability to restrict rows as well. Those databases who don’t give such capabilities of restricting directly rows provide it indirectly using views. Your big data environment is actually going to store data. In that respect it is not any way different than your database. Without such access control capabilities, how can you imagine a system which will store data? It is very obvious requirement that you may not want everyone to see everything in a file. It doesn’t ends there but requires much more than that, like you may want to build various access profiles, assign access credentials to profiles and then assigns profile to users or you may want some different ways of defining policy based or role based access control. You may also want policy based resource management for example who can use how much disk, cpu and memory etc. Such things are considered under access control.
Encryption & Masking
Access control gives ability to restrict access of individual data items to specific users. But they don’t solve data security completely. There are many complex requirements for data security. For example PCI-DSS is a necessary compliance requirement for any one dealing with payment card holder data. One of such card holder data element is PAN (Primary account number). As per PCI-DSS compliance requirement, you can store PAN into your system but you can’t reveal or render this number in readable form to anyone. That’s why your card numbers are always printed on your receipt as *********wxyz. It’s a complex requirement. You want to store PAN accurately, you want your users to have read access for this PAN but you don’t want them to be able to understand actual number. How would you implement it in your system as a security rule? Yes, you guessed it correctly. Answer is encryption.
There are some other requirements related to privacy and some weird examples are there where a retailer knows that the girl is pregnant before her father knows it. A mobile operator will have enough data to draw patterns of where do you spend most of your time, easily determine and identify your relationships. There are enough regulations and laws to protect people privacy. Your big data system will have to comply for such regulations. Masking and tokenization are most suitable techniques to take you closer to such compliance requirements.
Network Security
Network security is all about securing your network from unauthorized access. It’s all about drawing a virtual boundary across your network and restricting all accesses and entries into the network from one or more very well secured gates. By doing this you control every in and out movement of data and information from your network. Firewalls, proxies and gateways are best answers for such requirements. Apart from these considerations, you may also have to consider protecting data on the fly when it is being transmitted over network.
System security
What comes under system security, it’s about file system security, software, patches and updates etc. it’s very obvious that outdated software and lack of patches and updates leaves scope for vulnerabilities into your system. You have to have a mechanism for deployments, maximum possible automation and methods to easily identify such issues and mechanism to fix them. File system is another key part of hadoop security and you need to pay special attention for it. We all understand that HDFS is not a real POSIX compliant file system and data lies into blocks by default exposed to everyone who has access to data nodes. You will have to secure block level data, encryption/description may be a good solution in this case but there are many complexities to be addressed with respect to hadoop.
Infrastructure security
Infrastructure security is more about controlling physical access to your infrastructure but not limited to actual physical access. Remote access to your systems is as good as physical access except protection to physical damages. This is the vertical where you will have to plan for disaster recovery, backups/restores and business continuity considerations. 
Audit & Monitoring
Audit and monitoring is an extremely complex and vast area in terms of security considerations. You will need easy and workable mechanism for monitoring your system that everything is in place and working as per expectations. You need automated and manual mechanism of discovering and alerting about unusual events and activities. Just look at the PWC survey results, current employees and insider trusted partners covers major chunk of likely sources of creating security breach incidents. In such an environment you are not secure by just implementing enough security measures to cover items which we discussed above. No security system is perfect and you will have to have a monitoring system in place to monitor who is doing what and draw unusual patterns of activities so that you can strengthen your system further. 
 
Implementing such monitoring system is may be extremely complex requirement in absence of right methodology and technology. To implement an effective monitoring system, you will have to enable extended logging, before that you will have to understand where and what is to be logged. Once you have all required logging in place you get into a new problem of collecting all those logs from various individual machines and systems to a central repository. Another problem to be addressed after that, how will you draw information out of that raw data, what reports will you prepare and what are your KPIs and thresholds to trigger alerts, who will have access to such reports and alerts, what actions to be taken. Logs and audit information is more valuable in case of incident occurrence, you need them for your investigation and forensics. In absence of logs and traces, you can’t complete your RCA, can’t understand what other vulnerabilities are caused by incident and can’t evaluate damages.
In this post I tried to discuss basics of various aspects around enterprise security which must be addressed by you implementation of big data environment. It is not limited to my discussions above but there are many more considerations apart from my discussions above for example you will have to have a well-defined security policy and practice, you will have to train your people about security and associated risks and obligation, and you will have to have a risk management practice in place. This could be a very long list and there are enough materials over internet. But I wanted to summarize key concerns from enterprises regarding to security and best reference which I found was in TOGAF 9.1 documents. Here is a summary of Generally Accepted Areas of Concern as per TOGAF™ 9.1.
 
Any implementation of Big Data solution will have to address all of above concers to be successfull.
Good luck to your implementations.

Tuesday, February 18, 2014

Big Data – Biggest security risk

Have you encountered questions like these?
  1. How secure is hadoop ecosystem?
  2. How users, administrators and analysts will use big data in secure manner?
  3. How to fit in hadoop ecosystem into existing enterprise security models?
Just open hadoop architecture document “assumptions and goals” here and you will notice security was never there in consideration. Hadoop was not built with enterprise security in mind. But when enterprises start adopting it, they will definitely ask questions similar to those I mentioned.
If you are using hadoop in a closed secure environment and no one except few trusted members are accessing it for performing some POCs, you may ignore security for a while.
But this is not how enterprises are adopting hadoop and building hadoop based systems. If you are interested in understanding more about how enterprise hadoop adoption is progressing, you may have to spend some time goggling around but for today’s discussion, I will refer a simplified version of high level architecture shown by
HortonWorks.
Just a quick scan of above diagram shows, your hadoop systems will be accessed by various applications and users in many ways across the globe over secured and unsecured networks. In any of such platforms, security is a serious concern but for hadoop, it is more vital.
Why? Why more vital for hadoop?
Just look back again at architecture diagram, data is flowing into hadoop from every possible sources, your CRM, ERP, Logs, Click stream, Sensors, Social media etc. This data is precious for enterprise and we must secure it from all possible security threats.

Your big data platform doesn’t only store your big data but also all the insights, patterns and analytics results which you have derived or discovered from your big data. You can’t even ignore those intermediate results which you generated during the process of discovering insights from your big data.
Security is a big serious concern and an important aspect of big data technology which you have to take care. You will also have to manage risk associated with big data security.
Next question comes what are the aspects of security which we have to consider and cover for. To answer this question, we have to consider enterprise perspective of security.
What is an enterprise security?
Enterprise security is mainly driven by three things.
  1. Legislation
  2. Internal policies
  3. Business drivers
Legislation forces some regulatory and standard compliance on various enterprises. It might be based on global regulations or may be local laws and regulatory needs. Few examples of such standards and regulations are given below.

Global Standards

ISO/IEC 27002:2005 – Code of Practice for Information Security Management
ISO/IEC 27001:2005 – Information Security Management System Requirements
ISO/IEC 15408 – Evaluation Criteria for IT Security
ISO/IEC 13335 – IT Security Management
PCI-DSS – Payment card industry – data security standards
COBIT – Control objectives for information and related technology
ITIL – ISO/IEC 20000 SERIES

Regulation in US

SOX – Sarbanes-Oxley Act of 2002
COSO – Committee Of Sponsoring Organizations of the Tread way Commission
HIPPA – Health Insurance Portability And Accountability Act 1996
FISMA – Federal Information Security Management Act
FIPS – Federal Information Processing Standards

Regulations in EU

Data Protection Act 1984 amended 1998 – UK
Data Protection Act 2004 – France
Directive 95/46/EC of the European Parliament and of the Council – 1995 – EU
RIP/RIPA – Regulation of Investigatory Powers Act 2000 – EU
Federal Data Protection Act 2006 – Germany
Internal policies and business drivers are specific to enterprise and vary from industry to industry.
Based on all this discussion and in-depth analysis of various needs I am trying to build a broader view of enterprise security using below diagram.

Above diagram clearly lists 8 key verticals of enterprise security and any big data solution will have to address all of these verticals to be adopted by any enterprise.