Configuration of Apache Spark and Spark UI using Google Colab

Luis Carlos

Inside Google Colab click in ‘file’, then in ‘Upload Notebook’, and finally choose GitHub, where you can paste the next link to open the Notebook with the code and instructions:


You can left a star in GitHub if this tutorial was useful for you.

The repository link is:


For Windows users, configure Spark could be a headache. Fortunately, we have the option to use Google Colab to work in the cloud. Here I’m going to explain what you need to use Apache Spark with Google Colab.

First, let’s install JAVA, Spark, and the module findspark. The last one is used when you work without Anaconda.

# InstallJAVA
!apt-getinstall openjdk-8-jdk-headless -qq > /dev/null# Install Spark
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz

# Installforusewith pypthon
!pip install -q findspark

Then, we need to define the Environment Variables.

import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

Now, let’s configure Spark UI. For that, we need to install ngrok and import some modules.

import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

# Install ngrok
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

# ConfigureSparkUI
conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)

To open Spark UI, you need to execute:

# Create a URL through you can access the Spark UI
get_ipython().system_raw('./ngrok http 4050 &')

And then, wait 10 seconds to run:

# Thenwaitfor10s toaccess the URL
!curl -s http://localhost:4040/api/tunnels 

If you don’t have any context active, Spark UI will show a failed connection. The context created before is only to configure Spark UI. Because of that, we stop it.

Finally, I’m going to show you how to attach files.

# Add the documents source from which the necessary files will be takenfrom google.colab import drive

Here, you need to give permissions to Google Colab. When you do it, Google will give you a password. Left to define our path, in which we have our files.

path = '/content/drive/MyDrive/Colab Notebooks/fundamentos-de-Spark/'

You need to fit your own path.

That’s all! Now you can work with Apache Spark.

If I have any errors please let me know.

Escribe tu comentario
+ 2