Spark : Installation

We can use programming languages like Java, Scala, Python to implement Spark.Also it is very important to understand version compatibility of different software's that we need to implement Spark.

Important points to remember in Scala :

Java is object oriented programming language but not functional oriented
Scala is both object & functional oriented
It is built on the top of Java, we can call .java programmes directly from a Scala code

Version compatibility : We have to make sure that we are using compatible Scala, Java, Spark combinations while installing them in standalone mode for dev/testing purpose. If you are doing in Cloud(for example AWS) then no need to worry, it will be automatically taken care while creating a service like an EMR machine with required big data software's. Below are the compatible versions.

Scala-2.9, Java-1.7+, Spark-1.x
Scala-2.10, Java-1.7+, Spark-1.x
Scala-2.11, Java-1.8+, Spark-1.x, Spark-2.x
Scala-2.12, Java-1.8+, Spark-2.x, Spark-3.x
Scala-2.13, Java-1.8+, Spark-3.x

Programming language backward compatibility :

Source code compatibility (.java in java, .py in Python)
Binary compatibility (.class in Java, pickle files in Python - this is what we use in real time)

Note :

Scala doesn't have support for binary compatibility, means if you compile in Scala-2.9 and created .class & jar files, then you can use those binaries only in Scala 2.9 version. You CAN'T use those binaries in higher version's of Scala.
But Java doesn't have this problem

For more information on compatibility, check spark website : https://mvnrepository.com/artifact/org.apache.spark/spark-core

If you observed above image(from above mentioned website), Spark-3.5.4 is supporting 2 jars for Scala(2.13, 2.12) because Scala doesn't have support for binary compatibility. Hence we need 2 different binaries of same Spark code, once executed with Scala 2.12 and another executed with Scala 2.13. It doesn't mean that we have to install 2 different versions of Scala in our productions environments, we just need to decide which version of Scala to use(from set of compatible versions of Scala) and install it.

Let's go to installation part now :

This installation process is for the purpose of development, as we are doing it on a plain computer(not in Cloud).

Prerequisites :

We need a Linux environment, we can either use VMWare for setting up a Linux system or can dual boot our windows machine
I have installed Ubuntu
Let's install Spark-3.x, Scala-2.x & Java 1.8

Command to install Java in Ubuntu :

To install Java 8 : sudo apt-get install openjdk-8-jdk
To install Java 11 : sudo apt-get install openjdk-11-jdk

To install Scala, we need to go for binary approach :

Download Scala binaries from Scala website : https://www.scala-lang.org/download/2.12.0.html
Copy this file in our work directory inside Linux environment
orienit@orienit:~/work$ pwd

/home/orienit/work
This is my work directory inside

Update '~/.bashrc' file with below changes

command: gedit ~/.bashrc
export SCALA_HOME=/home/orienit/work/scala-2.12.12
export PATH=$SCALA_HOME/bin:$PATH

Re-open the terminal
Verify installation with following command

orienit@orienit:~$ echo $SCALA_HOME

/home/orienit/work/scala-2.12.12

We are done with Scala installation

Spark Installation :

Download Spark version from https://spark.apache.org/downloads.html
Always use some older version (latest versions will not be stable)
Extract downloaded folder and put it in home directory
My home directory is : /home/orient/work
Update ~/.bashrc usign command gedit ~/.bashrc and put below 2 lines
export SPARK_HOME=/home/orienit/work/spark-3.4.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
Below 2 lines to enable PySpark support
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
Re-open terminal
Verify using command : echo $SPARK_HOME

Important commands :

To start Spark : $SPARK_HOME/sbin/start-all.sh (Pseudo mode installation)
To stop Spark : $SPARK_HOME/sbin/stop-all.sh (Pseudo mode installation)
Spark with java : spark-shell
Spark with scala : spark-shell
Spark with python : pyspark
Spark with R : sparkR

Note : I have installed Spark 3.4 (as I have done it long back)

Confirmations :

orienit@orienit:~$ echo $SPARK_HOME

/home/orienit/work/spark-3.4.0-bin-hadoop3

orienit@orienit:~$ python3

Python 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>>

orienit@orienit:~$ scala -version

orienit@orienit:~$ spark-shell

25/01/30 12:57:22 WARN Utils: Your hostname, orienit resolves to a loopback address: 127.0.0.1; using 192.168.147.129 instead (on interface ens33)

25/01/30 12:57:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

25/01/30 12:57:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Spark context Web UI available at http://192.168.147.129:4040

Spark context available as 'sc' (master = local[*], app id = local-1738222053714).

Spark session available as 'spark'.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 3.4.0

/_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_432)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

orienit@orienit:~$ pyspark

Python 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

25/01/30 12:58:12 WARN Utils: Your hostname, orienit resolves to a loopback address: 127.0.0.1; using 192.168.147.129 instead (on interface ens33)

25/01/30 12:58:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

25/01/30 12:58:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

25/01/30 12:58:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/__ / .__/\_,_/_/ /_/\_\ version 3.4.0

/_/

Using Python version 3.10.12 (main, Jan 17 2025 14:35:34)

Spark context Web UI available at http://192.168.147.129:4041

Spark context available as 'sc' (master = local[*], app id = local-1738222095884).

SparkSession available as 'spark'.

>>>

All Apache older versions will be available in this site : https://archive.apache.org/dist/

Points to remember :

If you learn Spark with one programming language like Python then working with Scala is also easy, we just need to know basics of Scala
We have to understand which version of Scala, Java, Spark are compatible
Hadoop installation not required for Spark
Spark just use Hadoop libraries
We need to install Hadoop only when we need to read data from Hadoop while running Spark code
We can call Java code from Scala, we can't call Scala code from Java code
Similar to Hadoop installation modes, even Spark have 3 installation modes

Local mode
Pseudo mode
Cluster mode

Above installation is Local mode installation

Sample code to read a student.json file locally and printing the content:

Scala :

val df = sprak.read.json("file:///home/orienit/work/input/student.json")

df.show()

Python :

df = sprak.read.json("file:///home/orienit/work/input/student.json")

df.show()

Let's see more information in coming blogs. Have a great day!

Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...

DataSphere

Search This Blog

Spark : Installation

Labels

Comments

Post a Comment

Popular posts from this blog

Python : Python for Spark

AWS : Working with Lambda, Glue, S3/Redshift

AWS : Boto3 (Accessing AWS using Python)