Friday, August 1, 2014


With the advent of many no-sql databases, this has become a matter of debate for many organizations that which data persistence strategy to follow. In the past, in relational world the debate used to be which database to use. The relational world landscape is pretty matured with good choices in both commercial and open source world. With the advent of SQL, for many applications changing the database along the way also remained an option. Frameworks like Hibernate helped in bringing neutrality to that choice, though there are still features of databases which makes migration hard.

In the past, computing technology basically belonged to corporates and academics, who were more
structured and hence the structured databases fit well in that environment. However with the new age web or I would say social web has changed the landscape. The computers are becoming pervasive and the kind of data it is generating is not only huge in size but high in velocity and momentum. Take a look around any social site and you will hear numbers like 1 billion photos uploaded in an span of an hour kind of thing. The data comes with high velocity and it keeps coming, day and night. If you don't process it now, you may never be able to process it unless you have a good persistence strategy in place.

The big debate is of course in what situation to go for SQL and when to go for No-SQL. Is it a black and white decision or where we can start crossing the boundaries. Like any choices, this choice also has its finer aspects to take care of. There is no one wins all kind of situation. A good way is to define criterias and drive your decision based on that. I am listing down some of the points worth considering
  • You need data integrity. It's about bank transactions or sales invoices. Don't ever think of anything else other then relational database. Or I would say don't evaluate anything which cannot guarantee ACID. 
  • On the other end, is the data primarily free formed. For example logs of people surfing the web. This data is enormous in volume. So you need a storage mechanism which is free formed. Why relational database is not a good idea here? Relational databases work with constraints. And constraints results in locking tables and rows when inserting and updating the data. When there is a situation where a large volume of data has to be inserted continuously, the locking mechanisms can reduce the performance and at worst can lead to database failures in handling the data.
  • If you have well defined schemas then relational is the choice to go. However if the schema is not well defined or the schema is evolving all the time, no-SQL is a better choice. However you need to take care of other points mentioned, before taking a decision.
  • No-SQL databases can scale out easily than SQL databases. SQL databases can also scale out horizontally but you need to be careful about schema constraints. However beyond a point, no-SQL databases are easy to scale out, as there are no relationships between data buckets.
  • In terms of querying, relational databases provide more richness then no-SQLdatabases. No-SQL databases limit the richness of the querying because of the way they are stored. For example in key-value databases, it's easy to query till you are using key. However if you want to query based on other attributes, it's very costly unless you replicate the data with the other attribute as key. That's what Hadoop's Map reduce framework does. It creates the dataset around keys from different datastores and than process(reduce) them. With SQL databases, ad-hoc queries are easy to build. SQL databases are designed around data structures and No-SQL databases are often designed around querying needs.
  • Some no-sql databases like graph databases provide a sense of directionality and strength of link between two data points, which relational databases cannot handle directly. However this can always be handled by evolving the schema design.
  • The ecosystem of relational databases is quite mature in terms of vendors and their offering. No-SQL database is a rapidly evolving field.
So, which one to choose? Look at your needs and then come up with a solution. Data storage mechanism is an important decision for any software system. It's also is fine to use a combination of both but be careful about the need of expertise and maintenance required to manage the systems. Rather than SQL or no-SQL, the important criterias to establish are the need for data storage mechanism. Once you define those criterias, the choice will become easier. You might want to answer the following questions before doing a choice:
  • ACID requirement
  • Schema constraints and Data integrity
  • Velocity, Volume and Momentum of data. Velocity is about the speed, volume is about size and momentum is about if the high velocity is consistent or there are peaks.
  • Querying requirements. Are queries predetermined or there will ad-hoc querie?
  • How much and how long you want to keep the data?
  • Does query needs to return immediately or the results can be returned after a while?

No comments:

Post a Comment