Writing a HelloWorld Spark application with IntelliJ IDE and Python 3 in Windows 10

Florence, Italy; source: https://flic.kr/p/2jPh9KA

Introduction and context

In this tutorial, I want to show you how to set up a minimum working environment to develop Apache Spark applications in your Windows machine. So, without any more wait let’s go!

Step 1: installing Java SE Development Kit 11

First, go to Oracle’s website and download the Java SE development kit 11 (JDK 11) installer file for Windows 64bit from there. Then run the file you just downloaded to install the JDK 11 on your computer. After installation finishes, you can check if Java 11 is available on your computer by executing `java -version` in the Windows Command Prompt or PowerShell.

Step 2: installing Python 3

Go to Python’s website and download the Windows binary installation file for Python 3.9. Then execute the downloaded file to install Python 3. After the installation finishes, you should be able to start Python 3’s interactive shell by executing `python` in the Windows Command Prompt or PowerShell.

Step 3: downloading and configuring Spark 3

First, create a folder named `BigData` in `C directory` of your Windows. Go to Apache Spark’s website and download Spark 3’s binary files (for Hadoop 2.7) to the path `C:\BigData` in your computer. Then extract the downloaded file in this directory (I mean in `C:\BigData`). After this step content of the BigData folder should look like this:

Now rename the spark-3.* folder to spark3. Then go to https://github.com/cdarlint/winutils and download the files in this git repository as a Zip file using the green `Code` button on the top right corner. (By default the downloaded file will be named `winutils-master.zip`.)

Move the file you downloaded to path C:\BigData and extract it there.

Now you must set a few environment variables in Windows 10. To do so, type `edit the system environment variables` in the Windows search bar (on the bottom left corner) and press the ENTER. You must be able to see the `System Properties window` now (see the picture below).

Press the button with the label `Environment Variables` on the `System Properties window`. Now you must be able to see the `Environment Variables window` (see picture below).

Now press the New button on the top (`User variables`) and add the following environment variables.

Now select the `Path` environment variable in the top panel (or list) and press the `Edit button`.

Now add `%HADOOP_HOME%\bin` and `%SPARK_HOME%\bin` to the `Path` environment variable.

Now, open PowerShell and write spark-shell in it and press the ENTER. Wait until you enter the Spark’s interactive shell environment, then open in your web browser.

Step 4: installing IntelliJ IDE with Python plugin

Go to IntelliJ IDE’s website and download the IntelliJ IDE. You can download and use either the free community edition or the ultimate edition of the IDE. (Students can get a free license to use the ultimate edition of the IDE; they only need a university email to get a free license.) After the installation finished, open the settings window under the file tab and go to the plugins sub-menu. Make sure the Python plugin for IntelliJ is installed.

Step 5: Writing and executing a Hello World Spark application

In IntelliJ IDE create a new Python project (go to `File/New/Project`). And select Python 3.9 which you have already installed in the first step of this tutorial as the `Project SDK`. Then press the next button (then press the next again).

Now, pick a name for the project and a path to save the project files and press the finish button. (I chose the name PySparkHelloWork for my project and saved it in my Documents directory under a folder name IntelliJ, as you can see in the picture.)

Open the Terminal window in the button left corner of the project’s main window. And execute `pip install pyspark findspark` in it. Wait until `pip` finishes installing these three packages.

Now create a folder named `src` in the project and create a new Python script called main (or `main.py`) inside the `src` folder. Then copy the Python code you see in the box below into the `main.py` file and save it.

Run (or execute) the `main.py` script. You now should be able to see the results in the output.

Congratulations! You developed your fairs Spark application in Python. Happy making (more!) Spark applications. :0)