Thrive School - Big data, Hadoop, MapReduce, Hive, Pig, Hbase Tutorials: Big Data

Showing posts with label Big Data. Show all posts

Wednesday, February 26, 2014

Big Data – Enterprise security requirements

In my previous post, I talked about need for securing big data environments and also mentioned 8 key areas of enterprise security. Let’s discuss these areas in little more detail to understand concept behind all these terms. Such understanding will be necessary to know what we intend to do under each vertical of enterprise security and form a basis for requirement. Before that, I just want to share some stats from PWC survey “The Global State of Information Security® Survey 2014. Detailed report of this survey is available here.

Authentication

Authentication is first level in security, one of the simplest term to understand but one of the most complex things to implement. Authentication means you need to be authenticated to ensure that you are the one who you claim to be. Everyone must have experienced authentication while giving user name and password to any system they want to use. But there are several questions which you may ask while implementing it, some of them are listed below.

Who (Server) will authenticate users?

When you login from a client machine, do you want to send your password to server for verification?

Do you want all communication between client and server to be encrypted?

How you want to manage encryption key?

How you want to manage list of all valid users?

Do you want users to allow to login from any machine or device like their mobile?

What are valid applications (command line interface, commercial applications and custom build applications) which will be allowed to access your systems?

How will you ensure those applications are using secure method to establish connection?

Do you want single authentication for all your services?

How will you integrate it with existing authentication system?

Authorization

Once you are authenticated and system knows “who you are”. You can get into the system, what’s next? You would want to perform some task, execute some programs/scripts/commands, read or write some files, access some services or other systems within network. That’s where authorization plays its role, it controls, what is that you are allowed to do after entering into the system? In fact, authorization and authentication are tightly coupled in any system. Authorizing any user may again trigger a need for subsequent authentication. For example, you are authorized to execute a program which connects to a database, and then there is need for you to qualify for database authentication. This is a typical situation where you may need a single sign-on to be implemented, in absence of single sign-on you will end up creating many credentials, but question is how you want to do it in your system and how you want to manage such passing-on authorizations in a secure manner? Do you want to embed such credentials within your applications, I see such implementations where all application users use single credential to connect to specific services like databases whereas user management is controlled through another layer within application, this approach might be good as long as you can implement mechanism to trace application users till the end point within database activities.
Another consideration under this vertical is how you want to manage machine to machine or service to service authentication and authorization within your system. For example in a hadoop system, when a node wants to register itself as a data node, how do you check if it is authorized to be registered as data node and more importantly how do you authenticate that it is the node it claims to be. Such issues are addressed under authorization.

Access Control

If you are familiar with any RDBMS technologies, probably you already understand access control. Authorization tells you what you can do for example which files you can read. But authorization does not extend it further to fine grained level. RDBMS are excellent in providing such fine grained access control. In a typical RDBMS system, you can control which tables a user can access or read. It goes further and allows you to restrict specific columns, some databases gives you capability to restrict rows as well. Those databases who don’t give such capabilities of restricting directly rows provide it indirectly using views. Your big data environment is actually going to store data. In that respect it is not any way different than your database. Without such access control capabilities, how can you imagine a system which will store data? It is very obvious requirement that you may not want everyone to see everything in a file. It doesn’t ends there but requires much more than that, like you may want to build various access profiles, assign access credentials to profiles and then assigns profile to users or you may want some different ways of defining policy based or role based access control. You may also want policy based resource management for example who can use how much disk, cpu and memory etc. Such things are considered under access control.

Encryption & Masking

Access control gives ability to restrict access of individual data items to specific users. But they don’t solve data security completely. There are many complex requirements for data security. For example PCI-DSS is a necessary compliance requirement for any one dealing with payment card holder data. One of such card holder data element is PAN (Primary account number). As per PCI-DSS compliance requirement, you can store PAN into your system but you can’t reveal or render this number in readable form to anyone. That’s why your card numbers are always printed on your receipt as *********wxyz. It’s a complex requirement. You want to store PAN accurately, you want your users to have read access for this PAN but you don’t want them to be able to understand actual number. How would you implement it in your system as a security rule? Yes, you guessed it correctly. Answer is encryption.
There are some other requirements related to privacy and some weird examples are there where a retailer knows that the girl is pregnant before her father knows it. A mobile operator will have enough data to draw patterns of where do you spend most of your time, easily determine and identify your relationships. There are enough regulations and laws to protect people privacy. Your big data system will have to comply for such regulations. Masking and tokenization are most suitable techniques to take you closer to such compliance requirements.

Network Security

Network security is all about securing your network from unauthorized access. It’s all about drawing a virtual boundary across your network and restricting all accesses and entries into the network from one or more very well secured gates. By doing this you control every in and out movement of data and information from your network. Firewalls, proxies and gateways are best answers for such requirements. Apart from these considerations, you may also have to consider protecting data on the fly when it is being transmitted over network.

System security

What comes under system security, it’s about file system security, software, patches and updates etc. it’s very obvious that outdated software and lack of patches and updates leaves scope for vulnerabilities into your system. You have to have a mechanism for deployments, maximum possible automation and methods to easily identify such issues and mechanism to fix them. File system is another key part of hadoop security and you need to pay special attention for it. We all understand that HDFS is not a real POSIX compliant file system and data lies into blocks by default exposed to everyone who has access to data nodes. You will have to secure block level data, encryption/description may be a good solution in this case but there are many complexities to be addressed with respect to hadoop.

Infrastructure security

Infrastructure security is more about controlling physical access to your infrastructure but not limited to actual physical access. Remote access to your systems is as good as physical access except protection to physical damages. This is the vertical where you will have to plan for disaster recovery, backups/restores and business continuity considerations.

Audit & Monitoring

Audit and monitoring is an extremely complex and vast area in terms of security considerations. You will need easy and workable mechanism for monitoring your system that everything is in place and working as per expectations. You need automated and manual mechanism of discovering and alerting about unusual events and activities. Just look at the PWC survey results, current employees and insider trusted partners covers major chunk of likely sources of creating security breach incidents. In such an environment you are not secure by just implementing enough security measures to cover items which we discussed above. No security system is perfect and you will have to have a monitoring system in place to monitor who is doing what and draw unusual patterns of activities so that you can strengthen your system further.

Implementing such monitoring system is may be extremely complex requirement in absence of right methodology and technology. To implement an effective monitoring system, you will have to enable extended logging, before that you will have to understand where and what is to be logged. Once you have all required logging in place you get into a new problem of collecting all those logs from various individual machines and systems to a central repository. Another problem to be addressed after that, how will you draw information out of that raw data, what reports will you prepare and what are your KPIs and thresholds to trigger alerts, who will have access to such reports and alerts, what actions to be taken. Logs and audit information is more valuable in case of incident occurrence, you need them for your investigation and forensics. In absence of logs and traces, you can’t complete your RCA, can’t understand what other vulnerabilities are caused by incident and can’t evaluate damages.

In this post I tried to discuss basics of various aspects around enterprise security which must be addressed by you implementation of big data environment. It is not limited to my discussions above but there are many more considerations apart from my discussions above for example you will have to have a well-defined security policy and practice, you will have to train your people about security and associated risks and obligation, and you will have to have a risk management practice in place. This could be a very long list and there are enough materials over internet. But I wanted to summarize key concerns from enterprises regarding to security and best reference which I found was in TOGAF™ 9.1 documents. Here is a summary of Generally Accepted Areas of Concern as per TOGAF™ 9.1.

Any implementation of Big Data solution will have to address all of above concers to be successfull.
Good luck to your implementations.

Tuesday, February 18, 2014

Big Data – Biggest security risk

Have you encountered questions like these?

How secure is hadoop ecosystem?

How users, administrators and analysts will use big data in secure manner?

How to fit in hadoop ecosystem into existing enterprise security models?

Just open hadoop architecture document “assumptions and goals” here and you will notice security was never there in consideration. Hadoop was not built with enterprise security in mind. But when enterprises start adopting it, they will definitely ask questions similar to those I mentioned.
If you are using hadoop in a closed secure environment and no one except few trusted members are accessing it for performing some POCs, you may ignore security for a while.
But this is not how enterprises are adopting hadoop and building hadoop based systems. If you are interested in understanding more about how enterprise hadoop adoption is progressing, you may have to spend some time goggling around but for today’s discussion, I will refer a simplified version of high level architecture shown by HortonWorks.

Just a quick scan of above diagram shows, your hadoop systems will be accessed by various applications and users in many ways across the globe over secured and unsecured networks. In any of such platforms, security is a serious concern but for hadoop, it is more vital.
Why? Why more vital for hadoop?
Just look back again at architecture diagram, data is flowing into hadoop from every possible sources, your CRM, ERP, Logs, Click stream, Sensors, Social media etc. This data is precious for enterprise and we must secure it from all possible security threats.

Your big data platform doesn’t only store your big data but also all the insights, patterns and analytics results which you have derived or discovered from your big data. You can’t even ignore those intermediate results which you generated during the process of discovering insights from your big data.
Security is a big serious concern and an important aspect of big data technology which you have to take care. You will also have to manage risk associated with big data security.

Next question comes what are the aspects of security which we have to consider and cover for. To answer this question, we have to consider enterprise perspective of security.
What is an enterprise security?
Enterprise security is mainly driven by three things.

Legislation

Internal policies

Business drivers

Legislation forces some regulatory and standard compliance on various enterprises. It might be based on global regulations or may be local laws and regulatory needs. Few examples of such standards and regulations are given below.

Global Standards

ISO/IEC 27002:2005 – Code of Practice for Information Security Management
ISO/IEC 27001:2005 – Information Security Management System Requirements
ISO/IEC 15408 – Evaluation Criteria for IT Security
ISO/IEC 13335 – IT Security Management
PCI-DSS – Payment card industry – data security standards
COBIT – Control objectives for information and related technology
ITIL – ISO/IEC 20000 SERIES

Regulation in US

SOX – Sarbanes-Oxley Act of 2002
COSO – Committee Of Sponsoring Organizations of the Tread way Commission
HIPPA – Health Insurance Portability And Accountability Act 1996
FISMA – Federal Information Security Management Act
FIPS – Federal Information Processing Standards

Regulations in EU

Data Protection Act 1984 amended 1998 – UK
Data Protection Act 2004 – France
Directive 95/46/EC of the European Parliament and of the Council – 1995 – EU
RIP/RIPA – Regulation of Investigatory Powers Act 2000 – EU
Federal Data Protection Act 2006 – Germany

Internal policies and business drivers are specific to enterprise and vary from industry to industry.

Based on all this discussion and in-depth analysis of various needs I am trying to build a broader view of enterprise security using below diagram.

Above diagram clearly lists 8 key verticals of enterprise security and any big data solution will have to address all of these verticals to be adopted by any enterprise.

Tuesday, October 29, 2013

Big Data Analytics - Getting Started

So far we have seen what big data is? And what businesses can do with it? We clearly understand that business will mostly use big data for analytics. Now, it becomes an IT problem to facilitate this analytics. Data analytics has been there since a long time and IT industry has developed various solutions to facilitate it. But big data places new challenges to it.

Data analytics has three major areas. It’s fairly simple to understand these three areas.

1. Data collection: We have to collect data to perform any kind of analytics.

2. Data crunching: Once data is available, we perform analysis.

3. Data visualization: Finally, analysis needs to be presented in intuitive manner, which might be in form of tables, charts, maps and patterns etc.

There is one more step being followed in traditional data analytics called data transformation. Data transformation is all about structuring data before data crunching. Transformation requires schema design and developing data transformation and loading routines to load data into these schemas. Data transformation has its own challenges. Biggest challenge is cost and time to design and load these structures knowing the fact that there is no structure which fits all dynamic business requirements. This makes it repetitive and reduces ROI for businesses. Need is to be able to deal with unstructured data quickly in more agile manner. That’s the first problem. Second problem is to be able to read and process large data. To read/write 1 TB of data on today’s realistic hard drives at 100 Mb/s, it takes more than 2.5 hours and at 500 Mb/s, it will still take more than 30 minutes.

Answer to first problem i.e. dealing with unstructured data is No-SQL database. The next level on No-SQL database is tools and scripting languages to deal with such databases. There are proprietary and open source solutions available.

Answer to second problem i.e. dealing with read/write and computing speed is Hadoop and Map Reduce. This is purely an open source solution. Hadoop and Map Reduce have emerged as core technology for big data. In fact No-SQL database and other related technologies are also developed on top of Hadoop and Map Reduce. Reason is obvious, we need to deal with unstructured data and we need speed as well.

So, what are the core technologies? Answer is Java and Linux. Hadoop is developed in Java on Linux platform. Anyone willing to get into big data analytics should start learning Linux and Java first. There are many things you can learn and get started into hadoop without knowing Java and Linux but you can’t go too far without Java and Linux.

Other than Linux and Java, there are many things to learn under hadoop ecosystem, some of them are listed below.

Hadoop or HDFS
YARN, MapReduce and TEZ
HBase and Cassandra
Hive & HCatalog
Pig
Oozie
Zookeeper
Ambari
Sqoop
Hue
Mahout
Lucene and Solr
Flume
Arvo

Before we start learning individual components of hadoop ecosystem, it is good to get your portable hadoop environment. I will cover it in next post.

Tuesday, October 22, 2013

Data Monetization and Business Metamorphosis

Data has always been most important asset for any organization. But limitations on amount of data which can be stored, maintained and processed in cost effective and efficient manner has been constraining businesses to take their full advantage. With the advent of various big data tools and technologies to overcome these limitations, data has become a real game changer. Think of google maps, all of the services offered by Google related to maps and spatial are impossible if they do not have that huge amount of digital data or they are not able to process it so efficiently.

In this post, we will try to understand the highest levels of big data usage and understand how organizations are transforming themselves taking big data of their advantage. Obviously, at the time, Google is best in this category. But, let’s take example of a telecom company to understand how they are selling their data to generate additional revenue and create new products to capture new markets and transform themselves. We will look at one of the world leader in the telecommunication sector Telefónica. You can visit below link and I am sure you will get it easily.

http://dynamicinsights.telefonica.com/488/smart-steps

Rest of the post is just replicating the same information available at above link.

This telecom major has used the fact that people keep their mobile within few meters (less than 5 meters) from them. They collected mobile locations 24x7 and build a continuously updating data store. This database is transformed using some proprietary logic to make it anonymous, aggregated and extrapolated to avoid compromising people privacy without losing relevance. A software product is developed on top of this transformed database using big data technologies to perform so called Crowd Analytics. This crowd analytics is highly important in delivering insights for industries like retail sector having direct business correlation with foot fall rate. As per company it helps to deliver key business insights. Some examples are given below.

How does my store performance compare to the performance of the locations in which I trade?
What is the best location for me to invest in opening a new store? And what format of store should I open?
What are the best opening times and staffing profiles for each of my stores?
Where are people travelling from to my stores?
Are there specific areas that I should target my marketing campaigns? How should I vary my message in different parts of my catchment?
Where am I competing for customers?
What is the profile of the crowd passing through areas A, B & C where I am looking to place an advert? How does that change during the day and the week?

Company itself has started its transformation under a new division Telefónica Digital. It’s happening not only in telecom but in almost all industries, all size of organizations. Raise your interest and Google it, any specific industry, a little hard work will provide you a wealth of information on how organization are moving towards harnessing the power of big data. Good news, it’s a beginning of lots of innovation in and around data.

This post is end of the series “What business will do with big data?” However I will keep posting some interesting use cases from various industries in future.

What’s next?

We will explore “How to do it?”

Monday, October 21, 2013

Business Insight and Optimization

In this post, we will try to understand what business insight means and how we can achieve business optimization. We will take simple examples to get a feel of these concepts.

Business Insight Example:

An executive from an on line retail company was concerned about customers abandoning shopping cart. He wanted to quantify the opportunity and understand root cause. He initiated analysis of logs generated by their shopping cart. They noticed that a good number of their customers leave their cart at some point without closing their purchases. Doing further analysis, they realized that items in those carts sums to approximately 18% of their revenue. Further analysis of what their customers talking about them on social media, an important insight is figured out that customers find it more cheaper to buy those items from their neighboring local market. However online retailers on portal are offering reasonable discounts but delivery charges on such orders make them more expensive compared to local market prices. Most of such orders are smaller than 750 INR. Portal is offering free delivery for orders more than 750 INR and 12% out of those 18% uncompleted orders are actually more than 650 INR.

As a result of this analysis, portal decided to reduce minimum order value for free delivery charges to 650 INR for a trial period and realized reduction in abandoned shopping cart during that period.

This is a very simple example of discovering business insight by analyzing data from various sources.

Business Optimization Example:

This one is most complex to understand and implement as well. Since such models need no or minimal manual intervention, accuracy is a key concern.

One most common and well-known example of such implementation is automatic filtering of spams in Gmail. Gmail has an automated system that helps detect spam by identifying viruses and suspicious messages, finding patterns across messages, and learning from what Gmail users like us commonly mark as spam.

We will consider another example.

For a Retailer, it is extremely important to manage their out of stock rate, which means they ideally never want to be out of stock for any product. But at the same time, they have to manage write-off rate, which means they do not want to throw away or sell at discounted price for expired or spoiled products. Both of these indicators are in conflict, over managing one will distort other.

Existing method at a retail chain with excellent processes might be as described.

At every midnight during a specified window, all sales predictions are made using standard prediction mechanism. Based on those predictions, orders are calculated for each store and sent to store manager. Store manager then looks at the orders and manually adjusts the order by reducing/increasing quantities based on his experience and local knowledge. What’s that local knowledge? It is location, season, whether, price changes, promotions, events etc. These all factors and many more are evaluated on human experience basis to adjust orders.

We are assuming that all these orders are generated automatically and are just manually cross checked and adjusted, assume amount of work it requires if we want to do it seriously. A big retail superstore would easily have 10K+ products. Cross checking them and adjusting them manually every day by humans is not simply possible except relying on standard computer generated orders. These orders have direct relation with out of stock rate directly impacting sales. Other hand they also have direct relation with write-off rate directly impacting profit.

Manual ordering is impossible and if order generating systems has that local knowledge or access to respective data and speed to infer local knowledge. We can automate this decision making process and optimize sales and profit.

Saturday, October 19, 2013

Mark Hurd and Team @ Oracle Open World - Talking Big Data - Oracle Stack

Friday, October 18, 2013

Business Monitoring

This post will try to build and idea on business monitoring through a use case.
Most common BI implementation has focus on monitoring business operations and performance.
They achieve it through following key tools.

Reports with aggregation and drill down capabilities.
Dashboards with collaboration capabilities.
Key Performance Indicators (KPIs) and metrics.
Alerts and notifications.

Let’s take a use case to explain it.
Cash flow from customers and overdue balances is a key indicator to monitor in any business. Exceeding average overdue balance over a threshold should trigger an automatic alert to responsible executive. Executive should be able to get into BI system and analyze the problem and take corrective actions.
Typically, he should be able to get an aggregated summary of the problem and might be interested in getting answers to follow up questions.

What’s an overdue pattern? For example, how much we receive in a period of 30 days, 60 days and 90 days.
If 90 days slot is a problem for him, he might be interested in drilling down to get customer wise, region wise or product wise pattern for 90 days transactions. He should be able to analyze and get insights to narrow down the problem to specific region or customer or may be product.
Doing all this analysis by looking on various reports, He may discover a simple reason that specific customer is causing this alert. Hence identifying a problem with a customer.
It may be bit more complex discovery that his invoices of particular product are being paid in 90 days in specific region by majority of customers. Hence identifying a problem with specific product in particular region.
He might be interested in quarterly, half yearly and annual patterns. He should be able to compare this year, previous year, this quarter, previous quarter, same quarter previous year etc to understand and discover some patterns indicating root cause.

Once root cause is understood, appropriate corrective actions can be taken for example if a customer is causing this issue, we may reduce the risk by holding further supplies till overdue is cleared.
Above example is a clear case of business monitoring. Such implementations can also deliver or can be extended to deliver some level of business insights.
Implementing such systems requires a mix of relational and dimensional modeling techniques to model data. These implementations are termed as OLAP (online analytical processing) systems. OLAP implementation gives tremendous capabilities. Some of them are listed below.

Calculating across dimensions and across hierarchies.
Analyzing trends
Drilling up and down through hierarchies
Rotating to change the dimensional orientation
Forecasting
What-if analysis.

OLAP implementations for BI are delivering great for business monitoring. They have been reasonable in delivering Business Insights. But due to performance problems and limitations on amount of data such systems can handle, they has not been able to move beyond that towards Business Optimization, data monetization and business metamorphosis.

Keep reading, Next post will try to cover business insights through a use case.

Wednesday, October 16, 2013

What business will do with big data?

There are numerous possibilities and it is almost impossible for anyone to prepare a list of how effectively data can be used. It varies on business to business and their needs. It is actually an intellectual property of business to be envisioned, build and nurtured. The most important thing for businesses is to get started. A typical business will start it from their existing investment on BI Infrastructure and extend it against a defined roadmap. There is no such standard road map which fits all businesses but the most typical roadmap which I found interesting was the one presented by Bill Schmarzo (CTO EMC Consulting) in his presentation during O’REILLY Strata Conference 2013. It could be a valuable resource in your learning to get access to the video compilation of the conference. It can be purchased from O’Reilly website. (http://shop.oreilly.com/product/0636920029618.do) or can be accessed from your safari books account.
In his presentation Bill defined 5 levels of BI implementations.

Business Monitoring: Monitoring business performance and flag areas of interest.
Business Insights: Uncover relevant insights buried into data and use predictive analytics to generate recommendations and facilitate decision making into operational processes.
Business Optimization: Create self-sustained analytic models that automate and optimize business processes.
Data Monetization: leverage your business data, insights and investment in developing your analytics IPR to identify new revenue opportunities, may be through your customers or third parties.
Business Metamorphosis: Use insights about customers, products and market trends to identify new products, services and markets.

For a business, it is vital to understand concepts and capabilities of these implementations but they are equally important for a big data professional to excel in his job. Big data technologies will extend them to new extremes. Best way to understand them is through examples which I will try in subsequent posts.

Sunday, October 13, 2013

What is BIG DATA?

Big data is a buzzword in business. This term is used to describe a problem caused by exponential growth and availability of data. Industry has defined this term as the three Vs.

Velocity
Variety
Volume

Velocity: In today’s world data is being generated at never before speed. Think of social media sites on internet, millions of people generating social interaction data every second. This size could be terabytes at least every hour. This growth bursts when machines start generating data instead of humans. Think of sensors and tracking devices etc.

Variety: Data today comes in all types of formats. This may be structured data as in our traditional databases, unstructured text data like documents, emails, various logs etc and may be pictures, images, audio and video.

Volume: When verities of data comes in extremely high velocity from various sources like business and financial transactions of years, social interactions of millions of people every day and machine to machine interactions etc volume is a real problem at the bottom. Earlier days, storage used to be a problem but decreasing cost of storage allows businesses to be able to store all of them. This makes a really BIG DATA available to business which was not earlier.

Ok, then what’s the problem here? Well, the problems are two.

What to do with this data?
How to do it?

It is important to understand that the first problem “What to do with this data?” is really a business problem. Second one (How to do it?) is probably a shared problem between business and IT.

Subscribe to: Posts ( Atom )