Big data is the most popular term in software industry. There are companies build around this notion. Generally big data is characterized by three attributes:
- Volume - It's the amount of data that one has to deal with. Big data landscape is full of stories of terabytes and petabytes of data and even more than that. The volume is only increasing with time. A quick search in the internet about number of messages Facebook handles in a second or number of search queries Google handles can give the idea of the size of data we are dealing with.
- Velocity - Velocity is about the speed with which the data is created. Having large amount of data is one thing but if the data is created in a shorter span then it requires ways and means to handle the data on regular basis. If a system is generating Terabytes or Petabytes of data on daily basis, then that data has to be processed on daily basis otherwise it will start creating a data debt. Data debt is the data that is still waiting to be processed.If the system is not able to process the data on timely fashion, the data needs to be discarded as there is new set of data, that has arrived. However this might result in lost revenue opportunities.
- Variety - Variety is the kind of data. The variety can be temporal in nature but more so it is because of data being pulled from multiple systems. Variety has its own challenges as this requires handling of different formats. Variety itself is not a big data concept but variety makes the processing of big data more complex. Complexity has another dimension attached to it.
Big data has become synonymous with Apache Hadoop. Though there are different technologies now to deal with high volume of data, Hadoop is still considered the most eligible technology to process the big data. Hadoop generally can be divided into two parts:
- Storage - The clusters where data is stored. It can be HDFS or HBase. Replication being supported inherently in Hadoop, the data is automatically replicated in clusters for fail-over situations.
- Computation framework: The infrastructure to process the data. The data is of no use if we cannot process it and that's where Map reduce frameworks come into picture.
|Hadoop Analytics Architecture|
One issue with Hadoop kind of approach is that frameworks like Hadoop are not real time computation engines. Hadoop is good for processing large amount of data but it cannot deliver the results back in near real time. Hadoop is not a suitable technology for that.
Let's look at present day world. We have more and more devices getting connected to each other . I know the immediate next word that comes in the mind is Internet of Things (IOT). Everyone of us love these fancy words. However what does these words mean. It means a more connected world. It means devices and sensors talking to each other. It might mean devices and sensors controlling us in the long run. That's a scary idea but that's where the world is moving towards. That world will have its own good and bad parts. The time will only tell how everything will evolve. So much so for sermons, let's get back to technology.
Sensors are real time things. If I have a sensor in my body and it's monitoring my Blood pressure and it founds some anomaly, the action need to be taken now. I don't want that information to go as a log file in my Hadoop file system. The logs will get processed on nightly basis and in the morning I get a mail in my inbox about my raised blood pressure. Who knows by that time I am in different world, literally I would like the information to be processed and the value intimated at real time. Though I would still want the data to be put in a long term storage so that analytics can be run on long term basis.
Take another example of fraud detection. If there is a pattern identified for fraud, the pattern needs to be raised at real time so that either the fraud can be stopped or a reaction can go in a quick manner.
For such real time value retrieval, the data needs to be processed fast. Fastness is not about the velocity but the need to process it. Data beyond a timeline starts loosing its real time value. So we need to put an architecture in place which treats both big data analytics need and fast data processing at equal footing. Nathan Marz has given a name to this architecture as Lambda architecture. Let's look into what are different elements of Lambda architecture in next post.