Getting Started With Apache Spark Using Databricks Community Edition

6 min readJan 17, 2022

Remaining ruins of city of Babylon (source: Wikipedia)

This is a mini-tutorial aiming at helping the reader get started using Databricks Community Edition to process their data on an Apache Spark cluster. The tutorial includes a set of steps that one needs to take to be able to get familiar with Databricks Community Edition’s environment.

Step 1: Getting started

The first thing to do is go to the webpage for Databricks Community Edition and create a user account (if you do not already have one).

Databricks Community Edition’s login page

When you have finished creating your account, log into Databricks Community Edition.

Databricks Community Edition’s environment after you have logged into it

Now go to https://github.com/habedi/datasets and download the contents of the folder datascience.stackexchange.com to your machine. It includes the data that we are going to use here.

Step 2: Creating a Spark cluster

Now, we need to create an Apache Spark cluster to run our code on it. To do so, click on “Compute” on the vertical dark green line on the left side of the screen.

After pressing on “Compute” button this page will show up

Now press on “Create Cluster” to create your cluster. In Databricks Community Edition, you have to set a name for your cluster and select the Databricks runtimes of your cluster. Note that we can set more low-level configurations, but I’ll not go through their details in this tutorial. Moreover, different Databricks runtimes mainly differ on the version of their Spark; I chose the default Databricks runtime (runtime 9.1 LTS) that comes with Spark 3.1.2. Also, I named the cluster “MyCluster”.

Creating an Apache Spark cluster in Databricks Community Edition named “MyCluster”

The cluster has been created and is running and ready to use

Step 3: Installing libraries on the cluster

Usually, at the start, we have to install some additional libraries and packages on our cluster to do something useful with our data. Imagine we want to do some graph analytics on our Spark Cluster in Python. To do that, we have to install the appropriate Graphframes Spark library, a graph processing library for Apache Spark. To install libraries on our cluster, click on our cluster’s name in the “Compute” section and click on “Libraries”. Libraries can be installed from different sources, though; here we use the Maven repository to download and install the artefacts related to Graphframes in our cluster. Apache Spark has Python, R, and Java (and Scala) API. Depending on the API we are using, we can install libraries for these programming languages and environments.

We can install libraries in our Spark cluster by clicking on the name of our cluster on the page that opens after pressing “Compute”; then, by pressing “Install New” under the “Libraries” tab, we can open the page for installing a libraries

After pressing the “Install New” button, we can choose the repository of the library we want to install and install it on our cluster

Here, we search for Graphframes library and pick the newest version that matches the version of Spark instances installed on our cluster, which in our case is Graphframes 0.8.2 for Spark 3.1.x

Finally, after finding and selecting a library, we can press the “Install” button to install the library on the cluster

As you can see, Graphframes is installed on MyCluster and is ready to use

Now that the cluster is created and running, we can upload the data in the Databricks Community Edition environment.

Step 4: Uploading the data

In many real-world scenarios, we have the data in CSV format and want to do some analytics or train a machine learning model. Here, let’s assume we want to upload four compressed CSV files. (These are the files that you have downloaded to your machines in step 1 from https://github.com/habedi/datasets.) To move the files to our cluster, click on the icon named “Data” located on the vertical line on the left side of the screen to open up the user interface for managing the data.

The interface for data management can be accessed by clicking on the “Data” on the vertical dark green line on the left side of the screen

Now press the “Create Table” button to upload your data on the DBFS. DBFS is the file system’s name that Databricks Community Edition uses to store and access the data. In general, you can connect to different data providers to get your data, but in this tutorial, we assume that you have the data on your machine and need to move to DBFS.

To upload the data to DBFS, choose the “Upload File” button, select your files, and press “Open”.

Uploading files from your machine to DBFS

As you can see, files have been successfully uploaded from your machine to DBFS and are available under the path “/FileStore/tables/NAME_OF_FILE”

Optionally, at this stage, we select an uploaded CSV file and turn it into a Hive table. The main benefit of doing so is that we can directly run Spark SQL queries over a table that is stored as a Hive table on DBFS. But this is not a requirement, and we still will be able to load the compressed CSV files that we have already uploaded to DBFS into Spark DataFrames.

An example table with inferred schema using “Creating Table with UI”

Step 5: Processing the data on the cluster

At this stage, we are ready to start our real work, which is doing something interesting with the data we have already uploaded to our cluster. To do so, we can open a notebook and choose the API we want to use to work with the data in the cluster. By API, I mean the default programming language that we will use in our notebook. To do so, press on “Workspace” to open up the side panel for creating a Jupyter notebook.

We can create a new or access an existing notebook under “Workspace”

We can create a Jupyter notebook with the default programming language set to Python to process our data on the MyCluster Spark cluster

Now copy the code below into the notebook you have just created and run the notebook by pressing the “Run All” button on the top.

When you run the above code in the notebook that you have just created, you should be able to see something like this in the output:

Congratulations, you have a running Spark cluster ready to use to solve your Big Data and Data Science problems.

A final thing! — Remember that Databricks Community Edition is a free service, so it naturally comes with some limitations. Apart from the limitations on your cluster’s resources, like the number of CPUs and the amount of RAM, you should remember that after creating a cluster, if your cluster is inactive for two hours, it will be shut down. And when a cluster becomes shut down due to inactivity, you will need to delete it and create a new cluster. It can hinder, but it will not affect the data you have stored on DBFS. Usually, the only significant problems are the hassles of creating a new cluster and installing the appropriate libraries on it again.

Moreover, I suggest reading the following books to get familiar with Apache Spark and its applications:

Learning Spark — Lightning-Fast Data Analytics; https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
Learning Apache Spark with Python; https://runawayhorse001.github.io/LearningApacheSpark/
Spark: The Definitive Guide; https://pages.databricks.com/definitive-guide-spark.html