Introduction to Apache Spark — Part 2

Ch Ravi Raj
5 min readJun 25, 2021

--

While big data tools have made a huge impact on the world, as we process data from billions of devices, it takes a deeper understanding of these tools to truly leverage their potential. Maximizing the efficiency in terms of — response speeds, storage, and stability is of utmost importance. Hence, understanding the nuances of these tools will allow users to deploy better solutions on a larger scale. To further this goal and share a little of what I know, in the previous article, we saw the structure of Apache Spark. Now that we know the architecture, we can now proceed into the data processing aspect of Apache Spark.

What is SparkSQL?

SparkSQL is the query language that we use to handle databases and other data sources for the Spark engine. It resembles the commonly used SQL in terms of syntax and has most of the keywords resembling SQL. Then the question arises — Why SparkSQL?

  1. SQL works with RDBs, and cannot accommodate alternative data sources.
  2. SQL lacks the optimizations that are required to speed up queries.

SparkSQL handles different data sources that are based on an HDFS module. It also has an active optimizer to improve its response times. While handling data of such volumes, this optimizer plays a key role in real-time analytics.

The optimizer performs 2 main tasks:

  • Fail fast for type or naming errors. Type matching and checking the data catalog for the names are done before the query starts executing on the data. If an error of this type arises, the command stops executing immediately and points you in the exact direction of the error.
  • Reorganizing query logic. Spark takes your logic(the query) and breaks it down into a sequence of logical plans. It then looks for the most efficient way to accomplish that task.

Optimization of queries in Spark

Assume that we run the query to join the content of 2 tables a and b based on the first_name and select the people who are born after Jan 1st, 2000:

CREATE OR REPLACE TEMPORARY VIEW joined_data AS
SELECT a.first_name, to_date(a.birthday)
FROM a
JOIN b.first_name
ON a.first_name=b.first_name;
CREATE OR REPLACE TEMPORARY VIEW filtered_data AS
SELECT first_name, count(first_name)
FROM joined_data
WHERE birthday > = “2000–01–01”
GROUP BY first_name, birthday;

The query is complex (for demonstration purposes) and when executed, undergoes the following optimization:

Image by Databricks about query optimization.

Since join is the more expensive operation, the data is now first filtered by the optimizer and then joined, although the sequence of the queries is different.

Image by Databricks, describing the working of the optimizer.

Setting up Spark using Databricks

Setting up a notebook on Databricks community edition to run SQL queries is one of the simplest ways to use the power of Apache Spark while having no physical overhead on the personal machine.

After first creating an account on Databricks community edition, we make a cluster to run our computations.

Step 1: Login and Cluster creation
After creating an account and logging in, on hovering to the left, you will find a menu like below. Click on CREATE to start your personal cluster. (Only for the first time. From the next time you log in, you can use the same old one.)

Step 1 — Login and Select

Step 2: Cluster Specification
Assign your cluster name and pick the Databricks runtime version and click on Create Cluster. (Takes a while to create the cluster)

Step 2 — Create Cluster

Once the cluster is made, when you click on the COMPUTE option, as shown in Step 1, you will find the running instance of your cluster. (As shown below)

Step 2 — Create Cluster

Step 3: Data import
Upload the data you have in your local via the UI. This step can also be done directly via SQL commands from files stored in a cloud storage system However, for beginners with local data, this is the easiest method. For the sake of the demo, we shall use the Boston Pricing data set with the data as follows:

Boston Pricing Dataset Column Description

To start with the data upload process, click on Import and Explore data.

Step 3 — Import Data

A new page opens up like the screenshot below. Drag and drop your data file, select CREATE TABLE WITH UI, and select your active cluster.

Step 3 — Import Data

After selection of the cluster, you will find the option to describe the dataset, define the data types of the columns and see the data that has been uploaded. You can set the table name and set various parameters as shown. In the end, click on CREATE TABLE.

Step 3 — Import Data

You can check the dataset’s presence by clicking on the DATA option in the picture of Step 1.

Step 3 — Imported Data

Step 4: Notebook creation
Once your cluster is ready and your data is uploaded, you can go ahead and create your first notebook, to try Spark and its functionalities. You again hover back onto the CREATE option as seen in Step 3, and you will find another option — to create a notebook.

Step 4 — Create Notebook

Step 5: Running the queries
Clicking on CREATE as in the previous picture will start our notebook instance. You can now start running your SQL commands on this data set as seen in the next 2 screenshots.

Step 5 — Running SQL queries
Step 5 — Running SQL queries

In this manner, we set up our notebooks on Spark via the UI. Now you can start writing your queries using the Spark SQL syntax (It has a few differences from MySQL). All available functions and query syntax for SparkSQL can be found here.

Thank you for reading this blog. Please do leave a clap if you like the article. If there are any suggestions on how to improve the quality of my work, I’d like to hear them. :)
Suggestions on topics to cover in the next blog are also welcome. :D

--

--

Ch Ravi Raj

Data Scientist | ML | NLP | Data Analyst | Problem Solving | Poetry | Books| Coffee