How to setup a very simple Apache Spark cluster on your Windows 10?

Hassan Abedi
5 min readFeb 12, 2020

Introduction

This is a step by step guide that aims to help the reader to install Apache Spark on his or her Windows machine. The Spark installation shown in this tutorial is the typical bare minimum single instance cluster installation that you need to get a useful Spark development environment. Typically, you could use this new Spark installation to develop, test and debug your Spark applications in languages like Java, Scala, and Python. The contents provided in this tutorial are mainly for getting a Spark environment in Ubuntu. Still, the main steps taken in this tutorial will apply to other Unix(-like) OSes such as macOS without much modifications. We assume through the tutorial your Windows machine is connected to the internet.

Currently Apache Spark is the most popular open-source Cluster Computing framework

Step 1: Installing Ubuntu 18.04 app

Windows 10 comes with a feature that allows its users to have a fully functional Linux environment within Windows. To install Ubuntu 18.04 LTS, you can go Microsoft Store and search for Ubuntu 18.04 LTS app there. To be able to install Ubuntu you must first make sure that you have installed something called Windows Subsystem for Linux on your machine. You could read more about how to install the Ubuntu app on Windows 10 operating system from here.

Ubuntu 18.04 LTS app in Microsoft Store

Step 2: Setting a username and a password for your new Ubuntu environment

At this point, we assume that you have already successfully installed Ubuntu 18.04 LTS app on your machine. When you start Ubuntu app for the first time, it asks you about the user and password you want to use when you are working in the Ubuntu app. For the sake of simplicity, we used sparkusr for both username and password here.

When you installed Ubuntu app, you can start it for the first time by pressing the Launch button
When you start Ubuntu app for the first time you need to provide a username and password; in our setup we used “sparkusr” for both username and passwords

Step 3: Setting up your Ubuntu environment

At this point, you need to properly configure your Ubuntu environment. You could do so by installing the software packages that you may need to start Apache Sparks services inside your newly installed Ubuntu. You can install packages mentioned above by running the following commands in your terminal emulator:

We need to install quite a few packages when we start our Ubuntu environment for the first time

One important thing that you should remember is that we need Java 8 Runtime Environment to be able to run the current stable version of Spark (version 2.4.x).

Step 4: Installing Apache Spark

At this step, you need to go to Apache Spark’s website and download Spark’s pre-built binary file. In time of writing this tutorial, the latest stable of Spark was 2.4.5 which you could download from here. To download and install Spark, you can execute the following commands in your terminal:

First, you need to download and extract Spark’s Java binary files; after that by setting the correct JAVA_HOME and SPARK_HOME environment variables, you get a working Spark installation!

At this point, if there was no problem in your setup, you should be able to see that both SPARK_HOME and JAVA_HOME environment variables are set up correctly and you are ready to go! You can check this by running the following commands echo $JAVA_HOME and echo $SPARK_HOME in your terminal; the output you are going to see should look like the in the picture below:

If JAVA_HOME and SPARK_HOME are correctly setup you should be able to see their correct values in the terminal

Step 5: Starting Spark services

At this point, you can start your Spark cluster services by executing the following commands in your terminal:

Running these command will start Spark’s master and slave services

You can check various indicators for your Spark cluster (e.g., number of worker, and available resources in your cluster) by going to Spark’s web interface dashboard. By default, the web UI is available at http://localhost:8080/.

Spark master service is running with no available workers to execute a Spark application in the cluster
When you’ve started the worker service successfully, it will show up as a worker in your cluster; it attaches itself to the master; ready to execute Spark jobs

You can start Spark’s shell via running spark-shell in your terminal.

In the Spark shell environment, you can run Spark-SQL and Scala codes that are going to be executed on your cluster

When you finished your work with the cluster, you need to stop the Spark services. The general pattern is that you need to first stop the worker (slave) service then stop the master. The easiest way to shut down our simple cluster with our current settings is to execute stop-slave.sh and stop-master.shin the terminal consequently. If everything went well when you shutdown the cluster you will see no Java processes belonging to you Spark cluster service. You can check this by running htop command in the terminal, and checking the output. There should be no Spark related Java processes running inside your Ubuntu environment.

When you shut down your cluster, or you did not start it in the first place, you should see no processes related to Spark running within your Ubuntu environment

Conclusions

In this brief tutorial, we described the general steps to install and set up a very rudimentary Apache Spark on a machine running Microsoft Windows 10 operating system. Apache Spark is a sophisticated open-source Cluster Computing framework with many different modes of operation and settings. What we described here is suitable for people who want to get familiar with Spark for the first time, or the people who wish to develop Spark applications on their own machines. The resulting cluster at the end of this tutorial can be used as a mock cluster for learning purposes rather than real-world utility. Nevertheless, working with something as exciting and useful as Apache Spark can always be enjoyable for everybody :0).

Additional resources:

With the exceeding popularity of machine learning nowadays, it is natural to think that people will use Apache Spark in their machine learning pipelines. So, here is a nice little Spark tutorial for those who want to deploy their machine learning models unto a Spark cluster: https://neptune.ai/blog/apache-spark-tutorial.

--

--