How to start with SQL in Spark? After reading this article you will be able to run your first SQL queries against CSV files with a managed solution that takes away maintenance overhead – Databricks.
SQL is the obvious choice for Spark
Behold. Starting from the major version 3.0 released in June 2020 you may operate with ANSI SQL mode in Apache Spark, which makes this engine even more SQL friendly. SQL always played an important role in Spark and this is another step in this direction.
What is the blocker before starting to learn some useful library or tool? You need to learn some API, ways how to deal with data. For example, if you want to do data wrangling with Pandas, you better learn Pandas API, how to transform a Dataframe and so on. Spark is different from other frameworks in this way as it introduces a full-blown parallel processing. You may call it ‘a better Hadoop’ (of course Hadoop believers are going to kill me know..). If you ever had issues with analysis of big amounts of data, especially if we say about unstructured data requiring preprocessing, Spark is a go-to. Of course you could learn its RDD or DataFrame API in any language of choice: Python, Scala, R. But no worries, SQL is also here!
SQL is an important part of Spark framework. To start playing with it there is no need to buy heavy and costly hardware. Neither need you buy any costly licences – there is a very friendly opportunity to try it with the online Spark cluster. All you need to do is go to https://databricks.com/try-databricks and sign up for a free Spark engine. Then go to https://community.cloud.databricks.com/ and enjoy your Spark environment – Databricks.
Load and query data with Spark SQL
Disclaimer: The dataset is not truly big data here, however we will use it to showcase the functionality of Spark SQL. With this size of dataset your calculations will be fast and you are able to reproduce it easily yourself.
Let us give it a try and import a dataset. You may use any of the datasets you own. We will use an open dataset, everyone can enjoy, just go there and take one… Living now, during COVID-19 for sure it is safer to use a bike than public transportation. Better you own your bike, yourself, but otherwise you can hire one. In my city, Wroclaw (Poland), we have a bike sharing company, and archive data (March 2019-January 2020) available here. Maybe we can ask some questions on our bike data, like: “what are the most popular stations”. With our historical CSV files we are going to run some queries using Spark. Ready? Buckle up!
To have our data read, we have to use some cluster. You can create one for free. Name it as you want.
Then you need to wait a few moments for the cluster to be created and started. If you don’t see any clusters on your list at start, you need to wait a few seconds and refresh the page. Then you should see your cluster Pending state
After it has been created and started up you may continue. Loading of the data is super simple using Databricks UI. When you open its UI, it is enough to go to Import data section which is displayed here.
After choosing the correct item there is Create New Table dialogue displayed. This is the place where you actually put your data to.
All you need to do is to fetch the data from the link mentioned above and add your files here. They will be sent to the Databricks file system one after another. Just drop them there or find them with a browse link. When loaded correctly you may notice upload results.
Let us choose the option to Create Table with UI. Select your newly created cluster and Preview Table. Let us rename our table to “rides”. Choose options shown as checkboxes : First row is header and Infer schema. Now it’s time to Create Table. After a few seconds the table is ready to be used.
Now, let’s query our table! You can do so by easily writing some SQL queries in a SQL notebook. Let us check if the table is there. Go to the home Databricks page and choose Create a Blank Notebook command. We choose the name `rides_analyze` but you can feel free to use any notebook name you like 🙂
In a notebook cell write the following SQL command
``` select count(*) from rides ```
You should receive the amount of 39926 if you used the same dataset as we did. What about our question, which station is the most popular? It seems that most people hired bikes from outside of an official docking station (Polish “Poza oficjalna stacja”). Apart from that, we have the Main Railway Station being the most popular rental spot. Such a result could be expected.
``` SELECT rental_station, count(*) cnt FROM rides GROUP BY rental_station ORDER BY cnt DESC ```
I hope, after digesting this article you will have a good feeling how to start with Spark. There really is no need to install anything. You just need to sign up for the Community edition of Databrick, upload some data and start querying. Obviously, for each tool there is an API. The problem with it is you have to learn some specific language, again. Thus, reusing SQL by Spark, is definitely a speed-up, which makes learning it easier.