Does NoSQL = NoDBA?

There's a joke doing the rounds at SQL conferences and seminars: three DBAs walk into a NoSQL bar and leave when they can't find a table. You may have heard it before, but it made Matt Hilbert sit down and ponder. What's happening? Is there a division opening up between the newly fashionable NoSQL followers and DBAs? Matt bravely enters the shiny new world of NoSQL to investigate.

What, no SQL?

NoSQL databases appear to be popping up all over the place. Google, Amazon, Facebook, LinkedIn, and Twitter all use them. New entrants in the NoSQL arena like MongoDB, CouchDB, Cassandra, and Riak are claiming converts everywhere. Articles about NoSQL (including this one) are being written, talked about, and discussed.

The figures behind it all sound impressive as well. In May 2014, the ‘NoSQL Market Forecast 2015-2020’ from Market Research Media forecast that the global NoSQL market would reach $3.4 Billion in 2020, representing a compound annual growth rate of 21%.

What’s going on? The term NoSQL was first used by Carlo Strozzi in 1998 to name the file-based database he was developing. It was, in fact, a relational database but it didn’t use a SQL interface. The term re-emerged a decade later when a growing number of non-relational, distributed data stores started to appear.

Today, while there’s still no formal definition of NoSQL, the boundaries have been clearly established. NoSQL databases still don’t use SQL and the relational model has been completely abandoned. They’re designed to run on clusters. They tend to be Open Source. And perhaps most importantly, they forego fixed schemas in favor of allowing any data to be stored in any record.

Where’s the DBA?

Perhaps most interestingly, the rise of NoSQL seems to be at the expense of DBAs. In September 2013, Nick Heudecker, a Gartner Analyst, conducted an informal survey of NoSQL adopters to find out who was using NoSQL and why.

The result? Just 5.5% of respondents were DBAs. In Nick Heudecker’s words: “DBAs simply aren’t a part of the NoSQL conversation. This means DBAs, intentionally or not, are being eliminated from a rapidly growing area of information management.”

Instead, Application Developers and Software Engineers, followed by Enterprise Architects and the rather worrying job title ‘Management’ were responsible for NoSQL.

So why are companies abandoning a relational mode that is widely used and understood and turning to an unstructured model that leaves the four principles of ACID – and, apparently, DBAs – in the dust?

Let’s talk Big Users

Facebook started just ten years ago but today over a billion users worldwide access it every second of every day, adding data, changing data, updating data. That’s just one example. Think Amazon, Google, Twitter and LinkedIn and they all face the same challenge: constantly scaling out – or in – to handle rapidly changing usage patterns.

And that’s the issue. Relational databases scale up rather than out and are a poor fit for easy, dynamic scalability. Scaling up requires a bigger machine. Scaling horizontally – having large clusters of machines – costs less, and new machines can be added at will.

So if a new app or website launches and grows from a few thousand users to a million users in a very short time (it does happen), a relational database would have a hard time coping. A NoSQL database, on the other hand, could scale simply by adding more machines to the cluster.

Now let’s talk Big Data

Relational databases are great for handling structured data, but Big Data isn’t just structured data. Typically, it’s large, complex data sets that are a mix of structured, unstructured, semi-structured or multi-structured data. Think text-heavy social media posts at one end of the scale, and web log data with a combination of text and visual images at the other.

And Big Data – the data that is anathema to schema-based relational databases – is going to get bigger. In its ‘Worldwide Big Data Technology and Services 2013-2017 Forecast’, IDC predicts that the Big Data technology and services market will be worth over $32 billion in 2017, and will have grown six times faster than the overall ICT market. That’s a lot of data.

So what’s the big idea behind NoSQL?

The CAP Theorem

The CAP Theorem emerged from the University of California, Berkeley, in 1998, when computer scientist Eric Brewer posed that it was impossible for a distributed computer system to simultaneously guarantee Consistency, Availability and Partition tolerance.

2076-cap.png

In a relational database, Consistency and Availability are regarded as essential qualities at the expense of Partition Tolerance. As a consequence, if a network partitioning event occurs, causing a loss of connection, the database can’t service all requests. This Enforced Consistency ensures that all clients always have the same view of data.

NoSQL databases take a different approach, by offering Availability and Partition Tolerance at the expense of Consistency. The idea is that network partitioning events can occur because the database is run on clusters. If one node fails, another one steps in, allowing read and write operations to continue. Updates are then propagated asynchronously so that Eventual Consistency is achieved.

For a bank where transactions have to be consistent, that just wouldn’t work. For companies like Google, it’s acceptable. It doesn’t really matter if someone in Ohio has a slightly different search result from someone in Syracuse. There are still hundreds of thousands, sometimes millions, of results.

To find the answer, we need to pop over to Amazon. In 2007, Amazon discovered that every 100ms of latency on the Amazon website cost 1% in sales. At the time their annual sales were around $14.7 billion. And 1% of $14.7 billion is a lot of sales to lose.

The problem was that, according to The CAP Theorem (see sidebar), it is impossible for a distributed computer system to simultaneously offer Consistency, Availability and Partition tolerance. Only two can ever be guaranteed at the same time and, in a standard relational database model, Consistency and Availability are always chosen over Partition tolerance.

But it was the Partition tolerance that was causing the latency problem.

So Amazon did a very clever thing. In their seminal 2007 white paper, ‘Dynamo: Amazon’s Highly Available Key-value Store’, they outlined an approach for a new kind of database. One that guaranteed Availability and Partition tolerance at the expense of Consistency.

Rather than the Enforced Consistency of a traditional database where all clients always have the same view of the data, they opted for Eventual Consistency, where data would be consistent in the end. In effect, they transferred the latency issue from the time delivering the website content to users, to the data being delivered.

They figured that absolute Consistency wasn’t absolutely necessary. Even if a customer paid for a book and the last copy was sold five seconds before, they could simply order another book and the customer would wait another day or so.

They resolved their latency problem, they regained those lost sales – and at the same time, their white paper inspired the development of many other NoSQL databases including Cassandra, Voldemort and Riak.

Saying yes to NoSQL

That development, interestingly, was being led by developers, not DBAs. There has always been a divide between developers and DBAs, one side wanting change, the other wanting continuity, one side wanting to move fast, the other demanding caution.

The move to NoSQL by giants like Amazon gave developers the ammunition they needed to escape the strictures of DBA thinking. They saw NoSQL as a fundamental enabler for Agile practices in contrast to long-cycle DBA-centric migration processes.

They also saw that whereas relational databases have fixed schema that enforce strict rules, NoSQL opened up a world of storing lot of different kinds of data from document to graph, key value to time-series. Quickly. Simply. Now.

The figures behind the facts

All this talk about NoSQL sounds disheartening for DBAs but – and this is a big but – is NoSQL really grabbing the headlines as well as the sales?

DB-Engines, for example, is an online initiative that ranks the popularity of database management systems, and updates the list monthly. In September 2014, only two NoSQL databases, MongoDB and Cassandra were in the top ten. The clever people at DB-Engines also provide deeper insights into the ranking, one of which is job offers. And do you know what? A lot more jobs are out there for Oracle, Microsoft SQL Server, and DB2 developers and DBAs than those in the NoSQL world.

Similarly, the ‘2014 State of Database Technology Survey’ from Information Week also bucked the trend. It found that 75% of respondents were using Microsoft SQL Server and 47% were using Oracle. Compare that with 13% using Hadoop and 5% using MongoDB. Even Filemaker was ahead of new NoSQL companies like Cassandra and Riak.

As the survey points out: “While Riak and other newer databases may be making inroads into enterprises, they don’t seem to be displacing existing RDBMSes as much as they are being used for greenfield applications.”

So what’s the deal, then?

The deal is that we need to use possibly the ugliest phrase in the technology dictionary: polyglot persistence.

Polyglot persistence simply means using different databases depending on the level and type of data persistence you require.

Many companies will keep their relational databases for applications like OLTP where the level of data persistence is, by default, very high.

At the same time, when new needs arise because of Big Users or Big Data, revolutionary apps or cloud-based offerings, they’ll think non-relational.

And in some cases, both will be chosen. A relational database, for example, is an expensive way to store data, so lots of people will use, say, Hadoop to store the raw data and then process into a relational database for fast service and interactive queries.

So it’s actually not a question of SQL or NoSQL, it’s more one of SQL and NoSQL. Rather than shying away from NoSQL, it ‘s probably a good idea to start learning about it, because soon SQL and NoSQL will be co-existing alongside each other in the same company.

And we’ve saved the best news for last. US Bureau of Labor statistics forecast a 15.1% employment growth for database administrators between 2012 and 2022.

The Bureau also advises an ‘above average’ stress level for DBAs, but you probably knew that anyway.


The Mexican standoff

So, what’s in store for the future of databases, and database developers? Will managing NoSQL databases become a necessary skill for everyone, or do traditional RDBMSs continue to be a safe bet? Or will SQL on Hadoop and other crossover options win out? I’d like to see what you think, there’s a quick poll in the blogs section, as well as a chance to win a sombrero if you leave your vote and email address in the comments section.