Data analytics has three major areas. It’s fairly simple to understand these three areas.
1. Data collection: We have to collect data to perform any kind of analytics.
2. Data crunching: Once data is available, we perform analysis.
3. Data visualization: Finally, analysis needs to be presented in intuitive manner, which might be in form of tables, charts, maps and patterns etc.
There is one more step being followed in traditional data analytics called data transformation. Data transformation is all about structuring data before data crunching. Transformation requires schema design and developing data transformation and loading routines to load data into these schemas. Data transformation has its own challenges. Biggest challenge is cost and time to design and load these structures knowing the fact that there is no structure which fits all dynamic business requirements. This makes it repetitive and reduces ROI for businesses. Need is to be able to deal with unstructured data quickly in more agile manner. That’s the first problem. Second problem is to be able to read and process large data. To read/write 1 TB of data on today’s realistic hard drives at 100 Mb/s, it takes more than 2.5 hours and at 500 Mb/s, it will still take more than 30 minutes.
Answer to first problem i.e. dealing with unstructured data is No-SQL database. The next level on No-SQL database is tools and scripting languages to deal with such databases. There are proprietary and open source solutions available.
Answer to second problem i.e. dealing with read/write and computing speed is Hadoop and Map Reduce. This is purely an open source solution. Hadoop and Map Reduce have emerged as core technology for big data. In fact No-SQL database and other related technologies are also developed on top of Hadoop and Map Reduce. Reason is obvious, we need to deal with unstructured data and we need speed as well.
So, what are the core technologies? Answer is Java and Linux. Hadoop is developed in Java on Linux platform. Anyone willing to get into big data analytics should start learning Linux and Java first. There are many things you can learn and get started into hadoop without knowing Java and Linux but you can’t go too far without Java and Linux.
Other than Linux and Java, there are many things to learn under hadoop ecosystem, some of them are listed below.
- Hadoop or HDFS
- YARN, MapReduce and TEZ
- HBase and Cassandra
- Hive & HCatalog
- Pig
- Oozie
- Zookeeper
- Ambari
- Sqoop
- Hue
- Mahout
- Lucene and Solr
- Flume
- Arvo
Before we start learning individual components of hadoop ecosystem, it is good to get your portable hadoop environment. I will cover it in next post.