Inside Google Colab click in ‘file’, then in ‘Upload Notebook’, and finally choose GitHub, where you can paste the next link to open the Notebook with the code and instructions:
You can left a star in GitHub if this tutorial was useful for you.
The repository link is:
For Windows users, configure Spark could be a headache. Fortunately, we have the option to use Google Colab to work in the cloud. Here I’m going to explain what you need to use Apache Spark with Google Colab.
First, let’s install JAVA, Spark, and the module findspark. The last one is used when you work without Anaconda.
-8-jdk-headless -qq > /dev/null !wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz !tar xf spark-2.4.7-bin-hadoop2.7.tgz !pip install -q findspark!apt-getinstall openjdk
Then, we need to define the Environment Variables.
import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"
Now, let’s configure Spark UI. For that, we need to install ngrok and import some modules.
import findspark findspark.init() from pyspark.sql import SparkSession from pyspark import SparkContext, SparkConf # Install ngrok !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip !unzip ngrok-stable-linux-amd64.zip # ConfigureSparkUI conf = SparkConf().set('spark.ui.port', '4050') sc = SparkContext(conf=conf) sc.stop()
To open Spark UI, you need to execute:
# Create a URL through you can access the Spark UI get_ipython().system_raw('./ngrok http 4050 &')
And then, wait 10 seconds to run:
# Thenwaitfor10s toaccess the URL !curl -s http://localhost:4040/api/tunnels
If you don’t have any context active, Spark UI will show a failed connection. The context created before is only to configure Spark UI. Because of that, we stop it.
Finally, I’m going to show you how to attach files.
# Add the documents source from which the necessary files will be takenfrom google.colab import drive drive.mount('/content/drive')
Here, you need to give permissions to Google Colab. When you do it, Google will give you a password. Left to define our path, in which we have our files.
path = '/content/drive/MyDrive/Colab Notebooks/fundamentos-de-Spark/'
You need to fit your own path.
That’s all! Now you can work with Apache Spark.
If I have any errors please let me know.