Splout SQL's philosophy

There are many Big Data problems whose output is also Big Data.

Splout targets these problems offering a solution for batch-processing big datasets and deploying them atomically to a read-only cluster that can serve SQL queries over the resulting data.

Why read-only? Why Big Data views?

Database-centric systems are complex. Handling incremental state is harsh and error-prone. There are many use cases where this complexity can be avoided. Batch processing of all historical raw data is a simpler data processing paradigm that has many advantages. For example, it is very easy to change the entire data model or data processing logic without much effort, and it is very easy to scale - this is very well explained by Nathan Marz in his Lambda Architecture idea.

Splout makes it dead-simple to serve an arbitrarily big dataset by batch processing, indexing and deploying it. Splout does it carefully by keeping atomicity and efficiency in mind. Instead of updating Splout with random inserts, Splout atomically replaces already-built database files without affecting query serving. There are other tools that do that such as Voldemort, but they lack all the features of a rich query language such as SQL.

Why SQL?

In the past few years we have lived a revolution in the Big Data processing and serving world. We have slowly abandoned monolithic database-centric systems after Big Data and replaced them for a wide variety of systems: from Hadoop-centric data flows to NoSQL-centric systems (document-oriented databases, graph-oriented databases and so on).

The combination of Hadoop and NoSQL offers the best of both worlds, however, there are few options when it comes to serve Big Data - and the powerfulness of richer query languages such as SQL has been lost in the way. And there are many problems where one would desire to be able to pre-compute certain aggregations while still being able to aggregate data in real-time.

Imagine, for example, a Google-analytics-like application. With Hadoop we can easily pre-compute the pageviews for each day and page, however, we want to offer the user the possibility of having aggregated statistics within arbitrary date periods. With SQL, we can just add the appropriate indexes and perform real-time GROUP-BYs over the already pre-calculated aggregations.
A Key/Value store is not sufficient for this case, neither any database that doesn’t offer true real-time aggregations. It is also not feasible to precompute all possible date periods as there are way too many.

Recapitulating

Splout is, on one hand, a datastore that has been thought from the start for atomically deploying and web-serving Hadoop-scale datasets, and, on the other hand, a rich datastore that offers full-SQL query language. This achievement is made by posing two simple constraints: read-onlyness and data partitioning. This means that data will be entirely replaced in hours time, not in minutes or seconds time. For applications that need real-time, Splout is not the ultimate solution, but it can be used together with a real-time layer to form a “Lambda Architecture” like Nathan Marz explains here.