Thursday, October 31, 2013

Geting a portable hadoop environment

Before we start learning individual components of hadoop ecosystem, it is good to get your portable hadoop environment. There are various options available, but after playing with some of them, I personally find hortonworks sandbox virtual machine image for Oracle VirtualBox as easiest for getting started on hadoop ecosystem.
Everything you need is freely available and fairly simple to setup. Follow these steps you are novice in downloading and installing software.
  1. Download Oracle VirtualBox from here.
I have Microsoft Windows 7 on my laptop so I need VirtualBox for windows hosts as shown in below image.
 
 
 
  1. In previous step you downloaded VirtualBox-4.3.0-89960-Win.exe file. Now you just need to execute it and follow on screen instructions. In fact you just need to click next button three times, Yes button once, install button once and finally finish button. It’s done. You have powerful virtualization software on your laptop.
  2. After installation Oracle VirtualBox should start automatically which looks like image shown below. You can close it now and when you need to start it again, you can find it in your start menu. 

  1. After installing Oracle VirtualBox on Microsoft Windows 7, I found a new network connection setup. Presence of this network stopped me to connect through internet using my data card. If you face similar problem, you need to disable this network. Follow instructions below if you want to disable this network.
Start Control Panel->Network and Internet->Network and Sharing Center->Change Adapter Settings. You should see a network connection as below. Right click and disable it. You may have to reconnect your internet once again.

  1. Now you are ready to setup virtual machine into Oracle VirtualBox. For our purpose, you need to download hortonworks sandbox appliance from here. 



  1. You downloaded Hortonworks+Sandbox+2.0+VirtualBox.ova file.
  2. Start Oracle VirtualBox and select File->Import Appliance as shown in image below.




  1. You will see a dialog box, click browse button and select Hortonworks+Sandbox+2.0+VirtualBox.ova file which you have downloaded from hortonworks website.



  1. Click Next button and then Import button when new dialog box appears. Your Virtual machine import will start and complete in few minutes.


  1. After completion you will see Hortonworks Sandboc 2.0 listed in your Oracle VirtualBox as shown below. 
  2. You will notice that your virtual machine is configured with 2048 MB memory. This means your guest OS once started will consume 2 GB of your RAM. If you have 3 GB or more RAM on your laptop, you don’t need to worry. If you have less than that, you can click on the System button just above and reduce memory it to 1 GB.
  3. To start your virtual machine click Start Button. Once started, you will get your portable Hadoop running which is a familiar Linux terminal interface.


  1. To shut down your virtual machine, Click ACPI Shutdown as shown in below screen.




Wednesday, October 30, 2013

What is Hadoop?

We will use question answer technique to properly understand Hadoop.

Question: What is Hadoop in simplest form?
Answer: It’s a distributed file system named as Hadoop Distributed File System and abbreviated as HDFS. It is similar to hierarchical file systems we use on our laptops i.e. users can create directories and they can create, move, remove and rename files inside these directories. Users must use HDFS client software for this purpose.
Question: What do you mean by distributed file system?
Answer:  One single file is broken in chunks and distributed over multiple systems.  These chunks are not random but actually equal sized blocks except last block as defined by HDFS block size. Let’s assume you have a 500 MB file. This file is broken into 7 blocks of 64 MB and 8th block of 52 MB assuming that HDFS block size for the file is 64 MB. These 8 blocks of your file are sprinkled amongst various machines.
Question: How do we read such broken files?
Answer: To understand that, you need to understand architecture. HDFS is master/slave architecture. There is a master called NameNode. This master has one or many slaves called DataNode. NameNode and DataNode are nothing but specialized software. In a typical hadoop cluster, one machine runs NameNode software and other machines execute DataNode software. We usually have one instance of DataNode running per machine. It is NameNode who facilitates reading those broken pieces of your file.
Question: What are duties of NameNode and DataNode?
Answer:
Duties of NameNode
  • Manage file system namespace i.e. execute file system namespace operations like opening, closing and renaming files and directories.
  • Keep track of blocks; their mapping with file and their location on DataNode hence manage all metadata.
Duties of DataNode
  • Manage Storage attached to the machine they run on i.e. block creation, deletion and replication. 
  • Serve read and write requests from client hence user data does not flow through NameNode but directly goes to DataNode.
Question: What if one of the DataNode fails? Do I lose some blocks of my file?
Answer: No, HDFS is designed with default fault tolerance. Application can define replication factor for a file. Let’s assume, you define replication factor for your file as 3. This means that HDFS will maintain 3 copies of each block of your file on different DataNode.
Question: Do we have a system wide HDFS block size and replication factor?
Answer: No, Replication factor and block size can be configured per file.  
Question: How does NameNode and DataNode sync up?
Answer: Each DataNode periodically sends heartbeat and block report to NameNode. Receipt of heartbeat implies that DataNode is functioning properly. Block report contains list of all blocks on DataNode. NameNode marks DataNode without recent heartbeat as dead and does not forward any new I/O request to them. In this process, NameNode does not initiate any communication. It just responds to DataNode remote procedure calls. All these communications are layered on TCP/IP protocol.
Question: What if one of the DataNode fails causing replication factor to fall below specified value?
Answer: NameNode will initiate replication of blocks if necessary. It might be necessary due to many reasons like DataNode is dead, data block is corrupt, DataNode is alive but hard disk is failed making block unavailable or replication factor is increased.
Question: You talked about data block corruption. How it is identified that any specific block is corrupt?
Answer: HDFS client software which is used for file system operations implements a checksum on content of HDFS file. When a client creates a file, it computes a checksum on each block of the file and stores these checksums in a separate hidden file in same HDFS namespace.   When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
Question: So it appears that there is enough protection against DataNode failure. What if NameNode fails?
Answer: That’s a single point of failure in HDFS as of now. If NameNode fails, it requires manual intervention. Automatic restart and failover of NameNode software to another machine is not yer supported.
Question: You mentioned that user data does not follow through NameNode but directly goes to DataNode. Can you explain this process?
Answer: HDFS client software is used to create data. In fact, initially the HDFS client caches the file data into a temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the new file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When the job is complete, client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store.
Question: How replication happens?
Answer: In the above process, client retrieves identity of destination NameNode for its data. This identity is a list of DataNodes.   This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
Question: What happens when I reduce replication factor of a file?
Answer
: When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and frees space.
Question: Do we have any other method to access HDFS except client software?
Answer: HDFS can be accessed from applications in many different ways. HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, a typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.
Question: Does HDFS supports delete and undelete files like other file systems?
Answer: When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed.
Question: Any special notes about HDFS?
Answer: There is lot more about HDFS and it is continuously improving in many areas. One last thing at this stage which is necessary to understand that HDFS is designed for large and very large files. A typical file in HDFS is gigabytes to terabytes in size. HDFS is tuned to support large files. It is designed with an assumption that you write file once and read many times. It assumes that once written, you won’t modify your files. That’s a typical case for data warehouses and data analytics.

Tuesday, October 29, 2013

Big Data Analytics - Getting Started

So far we have seen what big data is? And what businesses can do with it? We clearly understand that business will mostly use big data for analytics. Now, it becomes an IT problem to facilitate this analytics. Data analytics has been there since a long time and IT industry has developed various solutions to facilitate it. But big data places new challenges to it.

Data analytics has three major areas. It’s fairly simple to understand these three areas.
1. Data collection: We have to collect data to perform any kind of analytics.
2. Data crunching: Once data is available, we perform analysis.
3. Data visualization: Finally, analysis needs to be presented in intuitive manner, which might be in form of tables, charts, maps and patterns etc.
There is one more step being followed in traditional data analytics called data transformation. Data transformation is all about structuring data before data crunching. Transformation requires schema design and developing data transformation and loading routines to load data into these schemas. Data transformation has its own challenges. Biggest challenge is cost and time to design and load these structures knowing the fact that there is no structure which fits all dynamic business requirements. This makes it repetitive and reduces ROI for businesses. Need is to be able to deal with unstructured data quickly in more agile manner. That’s the first problem. Second problem is to be able to read and process large data. To read/write 1 TB of data on today’s realistic hard drives at 100 Mb/s, it takes more than 2.5 hours and at 500 Mb/s, it will still take more than 30 minutes.
Answer to first problem i.e. dealing with unstructured data is No-SQL database. The next level on No-SQL database is tools and scripting languages to deal with such databases. There are proprietary and open source solutions available.
Answer to second problem i.e. dealing with read/write and computing speed is Hadoop and Map Reduce. This is purely an open source solution. Hadoop and Map Reduce have emerged as core technology for big data. In fact No-SQL database and other related technologies are also developed on top of Hadoop and Map Reduce. Reason is obvious, we need to deal with unstructured data and we need speed as well.
So, what are the core technologies? Answer is Java and Linux. Hadoop is developed in Java on Linux platform. Anyone willing to get into big data analytics should start learning Linux and Java first. There are many things you can learn and get started into hadoop without knowing Java and Linux but you can’t go too far without Java and Linux.
Other than Linux and Java, there are many things to learn under hadoop ecosystem, some of them are listed below.
  1. Hadoop or HDFS
  2. YARN, MapReduce and TEZ
  3. HBase and Cassandra
  4. Hive & HCatalog
  5. Pig
  6. Oozie
  7. Zookeeper
  8. Ambari
  9. Sqoop
  10. Hue
  11. Mahout
  12. Lucene and Solr
  13. Flume
  14. Arvo
Before we start learning individual components of hadoop ecosystem, it is good to get your portable hadoop environment. I will cover it in next post.

Tuesday, October 22, 2013

Data Monetization and Business Metamorphosis

Data has always been most important asset for any organization. But limitations on amount of data which can be stored, maintained and processed in cost effective and efficient manner has been constraining businesses to take their full advantage.  With the advent of various big data tools and technologies to overcome these limitations, data has become a real game changer. Think of google maps, all of the services offered by Google related to maps and spatial are impossible if they do not have that huge amount of digital data or they are not able to process it so efficiently.  

In this post, we will try to understand the highest levels of big data usage and understand how organizations are transforming themselves taking big data of their advantage.  Obviously, at the time, Google is best in this category. But, let’s take example of a telecom company to understand how they are selling their data to generate additional revenue and create new products to capture new markets and transform themselves. We will look at one of the world leader in the telecommunication sector Telefónica. You can visit below link and I am sure you will get it easily.   
Rest of the post is just replicating the same information available at above link.
This telecom major has used the fact that people keep their mobile within few meters (less than 5 meters) from them. They collected mobile locations 24x7 and build a continuously updating data store. This database is transformed using some proprietary logic to make it anonymous, aggregated and extrapolated to avoid compromising people privacy without losing relevance. A software product is developed on top of this transformed database using big data technologies to perform so called Crowd Analytics. This crowd analytics is highly important in delivering insights for industries like retail sector having direct business correlation with foot fall rate. As per company it helps to deliver key business insights. Some examples are given below. 
  • How does my store performance compare to the performance of the locations in which I trade?
  • What is the best location for me to invest in opening a new store? And what format of store should I open?
  • What are the best opening times and staffing profiles for each of my stores?
  • Where are people travelling from to my stores?
  • Are there specific areas that I should target my marketing campaigns? How should I vary my message in different parts of my catchment?
  • Where am I competing for customers?
  • What is the profile of the crowd passing through areas A, B & C where I am looking to place an advert? How does that change during the day and the week?
Company itself has started its transformation under a new division Telefónica Digital. It’s happening not only in telecom but in almost all industries, all size of organizations. Raise your interest and Google it, any specific industry, a little hard work will provide you a wealth of information on how organization are moving towards harnessing the power of big data. Good news, it’s a beginning of lots of innovation in and around data.
This post is end of the series “What business will do with big data?” However I will keep posting some interesting use cases from various industries in future.
What’s next?
We will explore “How to do it?

Monday, October 21, 2013

Business Insight and Optimization

In this post, we will try to understand what business insight means and how we can achieve business optimization. We will take simple examples to get a feel of these concepts.
 
Business Insight Example:
An executive from an on line retail company was concerned about customers abandoning shopping cart. He wanted to quantify the opportunity and understand root cause.  He initiated analysis of logs generated by their shopping cart. They noticed that a good number of their customers leave their cart at some point without closing their purchases. Doing further analysis, they realized that items in those carts sums to approximately 18% of their revenue. Further analysis of what their customers talking about them on social media, an important insight is figured out that customers find it more cheaper to buy those items from their neighboring local market. However online retailers on portal are offering reasonable discounts but delivery charges on such orders make them more expensive compared to local market prices. Most of such orders are smaller than 750 INR.  Portal is offering free delivery for orders more than 750 INR and 12% out of those 18% uncompleted orders are actually more than 650 INR.
As a result of this analysis, portal decided to reduce minimum order value for free delivery charges to 650 INR for a trial period and realized reduction in abandoned shopping cart during that period.
This is a very simple example of discovering business insight by analyzing data from various sources.
 
Business Optimization Example:
This one is most complex to understand and implement as well. Since such models need no or minimal manual intervention, accuracy is a key concern.
One most common and well-known example of such implementation is automatic filtering of spams in Gmail. Gmail has an automated system that helps detect spam by identifying viruses and suspicious messages, finding patterns across messages, and learning from what Gmail users like us commonly mark as spam.
We will consider another example.
For a Retailer, it is extremely important to manage their out of stock rate, which means they ideally never want to be out of stock for any product. But at the same time, they have to manage write-off rate, which means they do not want to throw away or sell at discounted price for expired or spoiled products. Both of these indicators are in conflict, over managing one will distort other.
Existing method at a retail chain with excellent processes might be as described.
At every midnight during a specified window, all sales predictions are made using standard prediction mechanism. Based on those predictions, orders are calculated for each store and sent to store manager. Store manager then looks at the orders and manually adjusts the order by reducing/increasing quantities based on his experience and local knowledge. What’s that local knowledge? It is location, season, whether, price changes, promotions, events etc. These all factors and many more are evaluated on human experience basis to adjust orders.
We are assuming that all these orders are generated automatically and are just manually cross checked and adjusted, assume amount of work it requires if we want to do it seriously. A big retail superstore would easily have 10K+ products. Cross checking them and adjusting them manually every day by humans is not simply possible except relying on standard computer generated orders. These orders have direct relation with out of stock rate directly impacting sales. Other hand they also have direct relation with write-off rate directly impacting profit.
Manual ordering is impossible and if order generating systems has that local knowledge or access to respective data and speed to infer local knowledge. We can automate this decision making process and optimize sales and profit.

Friday, October 18, 2013

Business Monitoring

This post will try to build and idea on business monitoring through a use case.
Most common BI implementation has focus on monitoring business operations and performance.
They achieve it through following key tools.
  1. Reports with aggregation and drill down capabilities.
  2. Dashboards with collaboration capabilities.
  3. Key Performance Indicators (KPIs) and metrics.
  4. Alerts and notifications.
Let’s take a use case to explain it.
Cash flow from customers and overdue balances is a key indicator to monitor in any business. Exceeding average overdue balance over a threshold should trigger an automatic alert to responsible executive.  Executive should be able to get into BI system and analyze the problem and take corrective actions.
Typically, he should be able to get an aggregated summary of the problem and might be interested in getting answers to follow up questions.
  • What’s an overdue pattern? For example, how much we receive in a period of 30 days, 60 days and 90 days. 
  • If 90 days slot is a problem for him, he might be interested in drilling down to get customer wise, region wise or product wise pattern for 90 days transactions. He should be able to analyze and get insights to narrow down the problem to specific region or customer or may be product. 
  • Doing all this analysis by looking on various reports, He may discover a simple reason that specific customer is causing this alert. Hence identifying a problem with a customer.
  • It may be bit more complex discovery that his invoices of particular product are being paid in 90 days in specific region by majority of customers. Hence identifying a problem with specific product in particular region.
  • He might be interested in quarterly, half yearly and annual patterns. He should be able to compare this year, previous year, this quarter, previous quarter, same quarter previous year etc to understand and discover some patterns indicating root cause.
Once root cause is understood, appropriate corrective actions can be taken for example if a customer is causing this issue, we may reduce the risk by holding further supplies till overdue is cleared. 
Above example is a clear case of business monitoring. Such implementations can also deliver or can be extended to deliver some level of business insights.
Implementing such systems requires a mix of relational and dimensional modeling techniques to model data. These implementations are termed as OLAP (online analytical processing) systems. OLAP implementation gives tremendous capabilities. Some of them are listed below.
  1. Calculating across dimensions and across hierarchies.
  2. Analyzing trends
  3. Drilling up and down through hierarchies
  4. Rotating to change the dimensional orientation
  5. Forecasting
  6. What-if analysis.
OLAP implementations for BI are delivering great for business monitoring. They have been reasonable in delivering Business Insights. But due to performance problems and limitations on amount of data such systems can handle, they has not been able to move beyond that towards Business Optimization, data monetization and business metamorphosis.

Keep reading, Next post will try to cover business insights through a use case.

Wednesday, October 16, 2013

What business will do with big data?

There are numerous possibilities and it is almost impossible for anyone to prepare a list of how effectively data can be used. It varies on business to business and their needs. It is actually an intellectual property of business to be envisioned, build and nurtured. The most important thing for businesses is to get started. A typical business will start it from their existing investment on BI Infrastructure and extend it against a defined roadmap. There is no such standard road map which fits all businesses but the most typical roadmap which I found interesting was the one presented by Bill Schmarzo (CTO EMC Consulting) in his presentation during O’REILLY Strata Conference 2013. It could be a valuable resource in your learning to get access to the video compilation of the conference. It can be purchased from O’Reilly website. (http://shop.oreilly.com/product/0636920029618.do) or can be accessed from your safari books account.
In his presentation Bill defined 5 levels of BI implementations.
  1. Business Monitoring: Monitoring business performance and flag areas of interest.
  2. Business Insights: Uncover relevant insights buried into data and use predictive analytics to generate recommendations and facilitate decision making into operational processes.
  3. Business Optimization: Create self-sustained analytic models that automate and optimize business processes.
  4. Data Monetization: leverage your business data, insights and investment in developing your analytics IPR to identify new revenue opportunities, may be through your customers or third parties.
  5. Business Metamorphosis:  Use insights about customers, products and market trends to identify new products, services and markets.
For a business, it is vital to understand concepts and capabilities of these implementations but they are equally important for a big data professional to excel in his job. Big data technologies will extend them to new extremes. Best way to understand them is through examples which I will try in subsequent posts. 

Sunday, October 13, 2013

What is BIG DATA?

Big data is a buzzword in business. This term is used to describe a problem caused by exponential growth and availability of data. Industry has defined this term as the three Vs.
  1. Velocity
  2. Variety
  3. Volume
Velocity: In today’s world data is being generated at never before speed. Think of social media sites on internet, millions of people generating social interaction data every second. This size could be terabytes at least every hour. This growth bursts when machines start generating data instead of humans. Think of sensors and tracking devices etc.

Variety: Data today comes in all types of formats. This may be structured data as in our traditional databases, unstructured text data like documents, emails, various logs etc and may be pictures, images, audio and video.

Volume: When verities of data comes in extremely high velocity from various sources like business and financial transactions of years, social interactions of millions of people every day and machine to machine interactions etc volume is a real problem at the bottom. Earlier days, storage used to be a problem but decreasing cost of storage allows businesses to be able to store all of them. This makes a really BIG DATA available to business which was not earlier.

Ok, then what’s the problem here? Well, the problems are two.
  1. What to do with this data?
  2. How to do it?
It is important to understand that the first problem “What to do with this data?” is really a business problem. Second one (How to do it?) is probably a shared problem between business and IT.

Why this blog?

  1. To express myself.
  2. To share my thoughts, knowledge and learning.
  3. To create a nice place for beginners in learning BIGDATA and related technologies.