How to setup a very simple Apache Spark cluster on your Windows 10?
Introduction
This is a step by step guide that aims to help the reader to install Apache Spark on his or her Windows machine. The Spark installation shown in this tutorial is the typical bare minimum single instance cluster installation that you need to get a useful Spark development environment. Typically, you could use this new Spark installation to develop, test and debug your Spark applications in languages like Java, Scala, and Python. The contents provided in this tutorial are mainly for getting a Spark environment in Ubuntu. Still, the main steps taken in this tutorial will apply to other Unix(-like) OSes such as macOS without much modifications. We assume through the tutorial your Windows machine is connected to the internet.

Step 1: Installing Ubuntu 18.04 app
Windows 10 comes with a feature that allows its users to have a fully functional Linux environment within Windows. To install Ubuntu 18.04 LTS, you can go Microsoft Store and search for Ubuntu 18.04 LTS app there. To be able to install Ubuntu you must first make sure that you have installed something called Windows Subsystem for Linux on your machine. You could read more about how to install the Ubuntu app on Windows 10 operating system from here.

Step 2: Setting a username and a password for your new Ubuntu environment
At this point, we assume that you have already successfully installed Ubuntu 18.04 LTS app on your machine. When you start Ubuntu app for the first time, it asks you about the user and password you want to use when you are working in the Ubuntu app. For the sake of simplicity, we used sparkusr
for both username and password here.


Step 3: Setting up your Ubuntu environment
At this point, you need to properly configure your Ubuntu environment. You could do so by installing the software packages that you may need to start Apache Sparks services inside your newly installed Ubuntu. You can install packages mentioned above by running the following commands in your terminal emulator:
One important thing that you should remember is that we need Java 8 Runtime Environment to be able to run the current stable version of Spark (version 2.4.x).
Step 4: Installing Apache Spark
At this step, you need to go to Apache Spark’s website and download Spark’s pre-built binary file. In time of writing this tutorial, the latest stable of Spark was 2.4.5 which you could download from here. To download and install Spark, you can execute the following commands in your terminal:
At this point, if there was no problem in your setup, you should be able to see that both SPARK_HOME and JAVA_HOME environment variables are set up correctly and you are ready to go! You can check this by running the following commands echo $JAVA_HOME
and echo $SPARK_HOME
in your terminal; the output you are going to see should look like the in the picture below:

Step 5: Starting Spark services
At this point, you can start your Spark cluster services by executing the following commands in your terminal:
You can check various indicators for your Spark cluster (e.g., number of worker, and available resources in your cluster) by going to Spark’s web interface dashboard. By default, the web UI is available at http://localhost:8080/.


You can start Spark’s shell via running spark-shell
in your terminal.

When you finished your work with the cluster, you need to stop the Spark services. The general pattern is that you need to first stop the worker (slave) service then stop the master. The easiest way to shut down our simple cluster with our current settings is to execute stop-slave.sh
and stop-master.sh
in the terminal consequently. If everything went well when you shutdown the cluster you will see no Java processes belonging to you Spark cluster service. You can check this by running htop
command in the terminal, and checking the output. There should be no Spark related Java processes running inside your Ubuntu environment.

Conclusions
In this brief tutorial, we described the general steps to install and set up a very rudimentary Apache Spark on a machine running Microsoft Windows 10 operating system. Apache Spark is a sophisticated open-source Cluster Computing framework with many different modes of operation and settings. What we described here is suitable for people who want to get familiar with Spark for the first time, or the people who wish to develop Spark applications on their own machines. The resulting cluster at the end of this tutorial can be used as a mock cluster for learning purposes rather than real-world utility. Nevertheless, working with something as exciting and useful as Apache Spark can always be enjoyable for everybody :0).
Additional resources:
With the exceeding popularity of machine learning nowadays, it is natural to think that people will use Apache Spark in their machine learning pipelines. So, here is a nice little Spark tutorial for those who want to deploy their machine learning models unto a Spark cluster: https://neptune.ai/blog/apache-spark-tutorial.