Getting started with Splout SQL

Installing Splout

You need to have Java => 1.6 preinstalled. Download a release of Splout from Maven Central:

splout-*-mr2 version for Hadoop >= 2.X (YARN)
splout-*-mr1 version for Hadoop < 2.X
Others (cdh5, etc) for your particular Hadoop distribution

Other distros: We don't provide builds for every existing Hadoop distribution and version. If you need Splout working for your distro, you'll probably need to recompile Splout adapting a few dependencies. You can have a look to cdh5 profile at the different pom.xml to have a reference.

Download Splout

After that, you just need to decompress it.

Environmental variables

You'll need to set properly a few environmental variables (i.e. at ~/.bashrc)

For Hadoop >= 2.X (YARN)

SPLOUT_HADOOP_COMMON_HOME	Pointing to the folder where the hadoop-common-*.jar can be found
SPLOUT_HADOOP_HDFS_HOME	Pointing to the folder where where the hadoop-mapreduce-client-*.jar can be found
SPLOUT_HADOOP_MAPRED_HOME	Pointing to the folder where the hadoop-common-*.jar can be found
SPLOUT_HADOOP_CONF_DIR	Optional: Optionally, specify the Hadoop configuration folder (e.g. /etc/hadoop/conf). Will default to SPLOUT_HADOOP_MAPRED_HOME/conf

Example of env variables for Cloudera CDH5 using parcels

export SPLOUT_HADOOP_COMMON_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export SPLOUT_HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-hdfs
export SPLOUT_HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
export SPLOUT_HADOOP_CONF_DIR=/etc/hadoop/conf

For Hadoop < 2.X

HADOOP_HOME must be properly defined pointing to your Hadoop installation folder.

Running a local cluster

The best thing for getting started with Splout is launching a local server in your machine:

bin/splout-service.sh qnode start
bin/splout-service.sh dnode start

Now, if everything went fine, and assuming you had nothing running on port 4412 before, you will see a nice panel in localhost:4412 like this:

Notice how there is no tablespace loaded in the system. That is expected, we will load an example one later. In logs/ folder you will find the logs of both the QNode and the DNode service. You can shutdown the services anytime with:

bin/splout-service.sh qnode stop
bin/splout-service.sh dnode stop

Splout comes with a few examples, one of which consists of miscellaneous world facts. We will load it and play with it to get used to the system and its behavior.

First, upload to HDFS the examples folder, as it contains the input files for all examples.

hadoop fs -put examples examples

Single Table Loading:

Load the table city from the CSV file city.csv and use the country_code field to partition it into 4 partitions:

hadoop jar splout-*-hadoop*.jar simple-generate --input examples/world/city.csv --output \
  database-files --tablespace city_pby_country_code --table city --separator , --escape \\ --quotes \
  \"\"\" --nullstring \\N --schema \
  "id:int,name:string,country_code:string,district:string,population:int" --partitionby country_code \
  --partitions 4

Finally, deploy the results into Splout, using replication 2. That is, there will be 2 replicas per each partition:

hadoop jar splout-*-hadoop*.jar deploy --root database-files --tablespaces city_pby_country_code \
  --replication 2 --qnode http://localhost:4412

In this particular example, as you have only one dnode, replication will be adjusted automatically to 1. Look at the panel to see the information of the created tablespace:

http://localhost:4412/tablespace.html?tablespace=city_pby_country_code

Particularly, look at how partitions has been distributed using different country_code ranges. Now, you can launch some queries at the query console:

http://localhost:4412/console.html

Perform the following query to show all the tables in the tablespace:

Tablespace	Partition key	Query
city_pby_country_code	(empty)	SELECT * FROM sqlite_master WHERE type='table';

Now, let's find the cities with country_code

Tablespace	Partition key	Query
city_pby_country_code	JPN	SELECT * FROM city WHERE country_code = "JPN"

Note that the shard 2 (partition 2) was hit. That is because we provided "JPN" as partition key. Otherwise, a wrong partition would have been hit, and no results would have been presented. Perform a test. Repeat the query, but keep the partition key empty. You'll see no results and partition 0 being hit.

The former query can be performed just using the REST Splout SQL interface. For the former query the URL would be the following:

http://localhost:4412/api/query/city_pby_country_code?key=JPN&sql=SELECT%20*%20FROM%20city%20WHERE%20country_code%20%3D%20%22JPN%22

Multiple Tables:

Multiple tables per tablespace are allowed. But all of them must be partitioned by the same key. We are going to create two different tablespaces:

world-pby-country:

country	partitioned by country
city	partitioned by country_code
country_language	full table present in all partitions

world-pby-continent-region:

country	partitioned by continent and region
country_language	full table present in all partitions

The structure of these tablespaces, and the the specification of the input data are in the files: world-pby-country.json and world-pby-continent-region.json

Launch the following command to create both tablespaces:

hadoop jar splout-*-hadoop*.jar generate --output database-files --tablespacefile \
  examples/world/world-pby-country.json --tablespacefile examples/world/world-pby-continent-region.json

Have a look to the database-files folder:

hadoop fs -ls database-files

At this point, the tablespace files have been created, but they are not still present at Splout SQL. We have to deploy them. We can deploy both tablespaces at the same time atomically, so we can be sure that information between tablespaces will be consistent. Whichever of the following command performs the deploy of the tablespaces.

hadoop jar splout-*-hadoop*.jar deploy -root database-files -ts world-pby-continent-region -ts world-pby-country -r 2 -q http://localhost:4412

hadoop jar splout-*-hadoop*.jar deploy --config-file examples/world/deployment.json --qnode http://localhost:4412

Now both tablespaces should be present at Splout SQL. Have a look to the console.

You can also check out the following queries:

	Tablespace	Partition key	SQL Query
All Japan Languages:	world-pby-country	JPN	SELECT country_language.* FROM country, country_language WHERE country.code = country_language.country_code AND country.code = "JPN"
Distinct districs in Japan:	world-pby-country	JPN	SELECT country.name, count(distinct district) as num_districts FROM country, city WHERE country.code = "JPN" and country.code = city.country_code
Population of Western Europe:	world-pby-continent-region	EuropeWestern Europe	SELECT continent,region,sum(population) total_population FROM country WHERE continent = "Europe" AND region = "Western Europe";
Biggest countries on Central Africa:	world-pby-continent-region	AfricaCentral Africa	SELECT name, surface_area FROM country WHERE continent = "Africa" AND region = "Central Africa" ORDER BY surface_area DESC;
Most talked languages on Central Africa:	world-pby-continent-region	AfricaCentral Africa	SELECT language, sum((percentage/100)*population) as people FROM country, country_language WHERE country.code = country_language.country_code AND continent = "Africa" AND region = "Central Africa" GROUP BY language ORDER BY people DESC;

Note that all the queries presented before are compatible with the partitioning decided for the tablespace. Otherwise they would have not worked properly. Also, be careful to properly define the needed indexes in order to answer queries fast.