Thursday, August 14, 2014

Hadoop Introduction

Persistence is the ability for data to live beyond the life cycle of a process. Software as part of interaction generate a lot of data which needs to live beyond them. In the beginning, the data storage systems were file oriented. However files did little to provide a good structure to the stored data. The other main issue with files was there proprietary format, which made it impossible to read them in the absence of certain programs. Also processing of data in file across versions of software was another challenge to deal with, as run time structures to hold the data do change across releases. Also in terms of scalability, files were not very good in terms of searching and doing local updates on data. To solve this, industry came up with RDBMS type of system.

RDBMS is a very structured way of defining the data. The rigorous structuring and ability to
constrain provided the capability to keep the integrity of data also popularly known as ACID properties. For the amount of data the world was dealing with. RDBMS systems were sufficient and provided the required scalability to an extent. This was the time when to get indexed in a search engine one has to submit the web link to search engines. People managed higher scalability with concepts like sharding however at the cost of increased complexity. (Developers love to make things complex).

Google came and changed the world. The amount of data they start dealing with just got larger by any sane means of measurement. They inverted the concept of web indexing. Rather then people submitting links, they started crawling web, downloading pages locally and indexing them. However with this approach came the problem of large data popularly termed as "Big Data" now. There was a parallel development happening. Doug cutting was working on a search engine infrastructure project called Nutch however he ran into issues of scaling it further. A paper from Google around that time on how Google handles Big Data using GFS or Google File System. Based on the paper, Doug started working on Hadoop on the line of GFS and used the concept of Map Reduce. Around those time, Doug was employed by Yahoo and the project got the required funding and resources. Today Hadoop is a clear winner in terms of handling Big Data problems and is used in many enterprises who deal with Big Data. The core of Hadoop is the ability to push the processing logic to the data itself without moving a lot of data around. This is different from technologies like Grid computing which move both the data and processing logic. Compared to Hadoop, RDBMS systems do the opposite of pulling the data to the processing logic. Having said that it does not means that we are going to throw RDBMS and adopt Hadoop everywhere. Both tend to complement each other and have their own unique strength and weaknesses. A high level comparison of Hadoop is as follows:

Hadoop
  • Data integrity low
  • Non structured data
  • Linear/Horizontal scaling
  • Good for initial insert and then multiple reads

RDBMS
  • Data integrity high (ACID compliant)
  • Structured Data
  • Vertical/non linear scaling. Techniques like sharding can do horizontal scaling but it has its own challenged
  • Good for multiple updates

No comments:

Post a Comment