Inside Google Colab click in ‘file’, then in ‘Upload Notebook’, and finally choose GitHub, where you can paste the next link to open the Notebook with the code and instructions:
You can left a star in GitHub if this tutorial was useful for you.
The repository link is:
https://github.com/lcgc99/configure_Apache_Spark_using_Colab
For Windows users, configure Spark could be a headache. Fortunately, we have the option to use Google Colab to work in the cloud. Here I’m going to explain what you need to use Apache Spark with Google Colab.
First, let’s install JAVA, Spark, and the module findspark. The last one is used when you work without Anaconda.
# InstallJAVA
!apt-getinstall openjdk-8-jdk-headless -qq > /dev/null# Install Spark
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
# Installforusewith pypthon
!pip install -q findspark
Then, we need to define the Environment Variables.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"
Now, let’s configure Spark UI. For that, we need to install ngrok and import some modules.
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
# Install ngrok
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
# ConfigureSparkUI
conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
sc.stop()
To open Spark UI, you need to execute:
# Create a URL through you can access the Spark UI
get_ipython().system_raw('./ngrok http 4050 &')
And then, wait 10 seconds to run:
# Thenwaitfor10s toaccess the URL
!curl -s http://localhost:4040/api/tunnels
If you don’t have any context active, Spark UI will show a failed connection. The context created before is only to configure Spark UI. Because of that, we stop it.
Finally, I’m going to show you how to attach files.
# Add the documents source from which the necessary files will be takenfrom google.colab import drive
drive.mount('/content/drive')
Here, you need to give permissions to Google Colab. When you do it, Google will give you a password. Left to define our path, in which we have our files.
path = '/content/drive/MyDrive/Colab Notebooks/fundamentos-de-Spark/'
You need to fit your own path.
That’s all! Now you can work with Apache Spark.
If I have any errors please let me know.
Las instalaciones ya no son validas para correr Spark en Colab, si lo logro actualizo el comentario.
lo logré aunque aún no puedo usar el UI de Spark, anexo la liga de colab con el método que funciona a Julio 2021:
https://colab.research.google.com/drive/1fRCVu8nBmpgd8S7Agglif-VDIOleFSz0?usp=sharing