Skip to main content

Spark : Installation

We can use programming languages like Java, Scala, Python to implement Spark.Also it is very important to understand version compatibility of different software's that we need to implement Spark.


 Important points to remember in Scala :

  • Java is object oriented programming language but not functional oriented
  • Scala is both object & functional oriented
  • It is built on the top of Java, we can call .java programmes directly from a Scala code


Version compatibility : We have to make sure that we are using compatible Scala, Java, Spark combinations while installing them in standalone mode for dev/testing purpose. If you are doing in Cloud(for example AWS) then no need to worry, it will be automatically taken care while creating a service like an EMR machine with required big data software's. Below are the compatible versions.
  • Scala-2.9,   Java-1.7+, Spark-1.x
  • Scala-2.10, Java-1.7+, Spark-1.x
  • Scala-2.11, Java-1.8+, Spark-1.x, Spark-2.x
  • Scala-2.12, Java-1.8+, Spark-2.x, Spark-3.x
  • Scala-2.13, Java-1.8+, Spark-3.x

Programming language backward compatibility :
  • Source code compatibility (.java in java, .py in Python)
  • Binary compatibility (.class in Java, pickle files in Python - this is what we use in real time)

Note :
  • Scala doesn't have support for binary compatibility, means if you compile in Scala-2.9 and created .class & jar files, then you can use those binaries only in Scala 2.9 version. You CAN'T use those binaries in higher version's of Scala.
  • But Java doesn't have this problem

For more information on compatibility, check spark website : https://mvnrepository.com/artifact/org.apache.spark/spark-core 



If you observed above image(from above mentioned website), Spark-3.5.4 is supporting 2 jars for Scala(2.13, 2.12) because Scala doesn't have support for binary compatibility. Hence we need 2 different binaries of same Spark code, once executed with Scala 2.12 and another executed with Scala 2.13.  It doesn't mean that we have to install 2 different versions of Scala in our productions environments, we just need to decide which version of Scala to use(from set of compatible versions of Scala) and install it.  


Let's go to installation part now : 

This installation process is for the purpose of development, as we are doing it on a plain computer(not in Cloud).

Prerequisites :
  • We need a Linux environment, we can either use VMWare for setting up a Linux system or can dual boot our windows machine
  • I have installed Ubuntu
  • Let's install Spark-3.x, Scala-2.x & Java 1.8

Command to install Java in Ubuntu : 
  • To install Java 8    : sudo apt-get install openjdk-8-jdk
  • To install Java 11 : sudo apt-get install openjdk-11-jdk
To install Scala, we need to go for binary approach :
  • Download Scala binaries from Scala website : https://www.scala-lang.org/download/2.12.0.html
  • Copy this file in our work directory inside Linux environment 
  • orienit@orienit:~/work$ pwd
    •  /home/orienit/work
    • This is my work directory inside 
  • Update '~/.bashrc' file with below changes
    • command: gedit ~/.bashrc
    • export SCALA_HOME=/home/orienit/work/scala-2.12.12
    • export PATH=$SCALA_HOME/bin:$PATH
  • Re-open the terminal
  • Verify installation with following command  
    • orienit@orienit:~$ echo $SCALA_HOME
          • /home/orienit/work/scala-2.12.12
  • We are done with Scala installation

Spark Installation :

  • Download Spark version from https://spark.apache.org/downloads.html
  • Always use some older version (latest versions will not be stable)
  • Extract downloaded folder and put it in home directory
  • My home directory is : /home/orient/work
  • Update ~/.bashrc usign command gedit ~/.bashrc and put below 2 lines 
  • export SPARK_HOME=/home/orienit/work/spark-3.4.0-bin-hadoop3
  • export PATH=$SPARK_HOME/bin:$PATH
  • Below 2 lines to enable PySpark support
  • export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
  • export PYSPARK_PYTHON=python3
  • Re-open terminal
  • Verify using command : echo $SPARK_HOME

Important commands :
  • To start Spark : $SPARK_HOME/sbin/start-all.sh (Pseudo mode installation)
  • To stop Spark : $SPARK_HOME/sbin/stop-all.sh (Pseudo mode installation)
  • Spark with java : spark-shell
  • Spark with scala : spark-shell
  • Spark with python : pyspark
  • Spark with R : sparkR

Note : I have installed Spark 3.4 (as I have done it long back)

Confirmations :

orienit@orienit:~$ echo $SPARK_HOME
/home/orienit/work/spark-3.4.0-bin-hadoop3

orienit@orienit:~$ python3
Python 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

orienit@orienit:~$ scala -version
Scala code runner version 2.12.12 -- Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc.

orienit@orienit:~$ spark-shell
25/01/30 12:57:22 WARN Utils: Your hostname, orienit resolves to a loopback address: 127.0.0.1; using 192.168.147.129 instead (on interface ens33)
25/01/30 12:57:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/30 12:57:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.147.129:4040
Spark context available as 'sc' (master = local[*], app id = local-1738222053714).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/
         
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_432)
Type in expressions to have them evaluated.
Type :help for more information.

scala>


orienit@orienit:~$ pyspark 
Python 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
25/01/30 12:58:12 WARN Utils: Your hostname, orienit resolves to a loopback address: 127.0.0.1; using 192.168.147.129 instead (on interface ens33)
25/01/30 12:58:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/30 12:58:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/30 12:58:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/

Using Python version 3.10.12 (main, Jan 17 2025 14:35:34)
Spark context Web UI available at http://192.168.147.129:4041
Spark context available as 'sc' (master = local[*], app id = local-1738222095884).
SparkSession available as 'spark'.
>>> 


All Apache older versions will be available in this site : https://archive.apache.org/dist/



Points to remember :
  • If you learn Spark with one programming language like Python then working with Scala is also easy, we just need to know basics of Scala
  • We have to understand which version of Scala, Java, Spark are compatible
  • Hadoop installation not required for Spark
  • Spark just use Hadoop libraries
  • We need to install Hadoop only when we need to read data from Hadoop while running Spark code
  • We can call Java code from Scala, we can't call Scala code from Java code
  • Similar to Hadoop installation modes, even Spark have 3 installation modes
    • Local mode
    • Pseudo mode
    • Cluster mode
  • Above installation is Local mode installation

Sample code to read a student.json file locally and printing the content:

Scala :
val df = sprak.read.json("file:///home/orienit/work/input/student.json")
df.show()

Python :
df = sprak.read.json("file:///home/orienit/work/input/student.json")
df.show()


Let's see more information in coming blogs. Have a great day!



Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111













Comments

Popular posts from this blog

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...