We can use programming languages like Java, Scala, Python to implement Spark.Also it is very important to understand version compatibility of different software's that we need to implement Spark.
Important points to remember in Scala :
- Java is object oriented programming language but not functional oriented
- Scala is both object & functional oriented
- It is built on the top of Java, we can call .java programmes directly from a Scala code
Version compatibility : We have to make sure that we are using compatible Scala, Java, Spark combinations while installing them in standalone mode for dev/testing purpose. If you are doing in Cloud(for example AWS) then no need to worry, it will be automatically taken care while creating a service like an EMR machine with required big data software's. Below are the compatible versions.
- Scala-2.9, Java-1.7+, Spark-1.x
- Scala-2.10, Java-1.7+, Spark-1.x
- Scala-2.11, Java-1.8+, Spark-1.x, Spark-2.x
- Scala-2.12, Java-1.8+, Spark-2.x, Spark-3.x
- Scala-2.13, Java-1.8+, Spark-3.x
Programming language backward compatibility :
- Source code compatibility (.java in java, .py in Python)
- Binary compatibility (.class in Java, pickle files in Python - this is what we use in real time)
Note :
- Scala doesn't have support for binary compatibility, means if you compile in Scala-2.9 and created .class & jar files, then you can use those binaries only in Scala 2.9 version. You CAN'T use those binaries in higher version's of Scala.
- But Java doesn't have this problem
For more information on compatibility, check spark website : https://mvnrepository.com/artifact/org.apache.spark/spark-core
If you observed above image(from above mentioned website), Spark-3.5.4 is supporting 2 jars for Scala(2.13, 2.12) because Scala doesn't have support for binary compatibility. Hence we need 2 different binaries of same Spark code, once executed with Scala 2.12 and another executed with Scala 2.13. It doesn't mean that we have to install 2 different versions of Scala in our productions environments, we just need to decide which version of Scala to use(from set of compatible versions of Scala) and install it.
Let's go to installation part now :
This installation process is for the purpose of development, as we are doing it on a plain computer(not in Cloud).
Prerequisites :
- We need a Linux environment, we can either use VMWare for setting up a Linux system or can dual boot our windows machine
- I have installed Ubuntu
- Let's install Spark-3.x, Scala-2.x & Java 1.8
Command to install Java in Ubuntu :
- To install Java 8 : sudo apt-get install openjdk-8-jdk
- To install Java 11 : sudo apt-get install openjdk-11-jdk
To install Scala, we need to go for binary approach :
- Download Scala binaries from Scala website : https://www.scala-lang.org/download/2.12.0.html
- Copy this file in our work directory inside Linux environment
- orienit@orienit:~/work$ pwd
- /home/orienit/work
- This is my work directory inside
- Update '~/.bashrc' file with below changes
- command: gedit ~/.bashrc
- export SCALA_HOME=/home/orienit/work/scala-2.12.12
- export PATH=$SCALA_HOME/bin:$PATH
- Re-open the terminal
- Verify installation with following command
- orienit@orienit:~$ echo $SCALA_HOME
- /home/orienit/work/scala-2.12.12
- We are done with Scala installation
Spark Installation :
- Download Spark version from https://spark.apache.org/downloads.html
- Always use some older version (latest versions will not be stable)
- Extract downloaded folder and put it in home directory
- My home directory is : /home/orient/work
- Update ~/.bashrc usign command gedit ~/.bashrc and put below 2 lines
- export SPARK_HOME=/home/orienit/work/spark-3.4.0-bin-hadoop3
- export PATH=$SPARK_HOME/bin:$PATH
- Below 2 lines to enable PySpark support
- export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
- export PYSPARK_PYTHON=python3
- Re-open terminal
- Verify using command : echo $SPARK_HOME
Important commands :
- To start Spark : $SPARK_HOME/sbin/start-all.sh (Pseudo mode installation)
- To stop Spark : $SPARK_HOME/sbin/stop-all.sh (Pseudo mode installation)
- Spark with java : spark-shell
- Spark with scala : spark-shell
- Spark with python : pyspark
- Spark with R : sparkR
Note : I have installed Spark 3.4 (as I have done it long back)
Confirmations :
orienit@orienit:~$ echo $SPARK_HOME
/home/orienit/work/spark-3.4.0-bin-hadoop3
orienit@orienit:~$ python3
Python 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
orienit@orienit:~$ scala -version
Scala code runner version 2.12.12 -- Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc.
orienit@orienit:~$ spark-shell
25/01/30 12:57:22 WARN Utils: Your hostname, orienit resolves to a loopback address: 127.0.0.1; using 192.168.147.129 instead (on interface ens33)
25/01/30 12:57:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/30 12:57:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.147.129:4040
Spark context available as 'sc' (master = local[*], app id = local-1738222053714).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.0
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_432)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
orienit@orienit:~$ pyspark
Python 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
25/01/30 12:58:12 WARN Utils: Your hostname, orienit resolves to a loopback address: 127.0.0.1; using 192.168.147.129 instead (on interface ens33)
25/01/30 12:58:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/30 12:58:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/30 12:58:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.4.0
/_/
Using Python version 3.10.12 (main, Jan 17 2025 14:35:34)
Spark context Web UI available at http://192.168.147.129:4041
Spark context available as 'sc' (master = local[*], app id = local-1738222095884).
SparkSession available as 'spark'.
>>>
All Apache older versions will be available in this site : https://archive.apache.org/dist/
Points to remember :
- If you learn Spark with one programming language like Python then working with Scala is also easy, we just need to know basics of Scala
- We have to understand which version of Scala, Java, Spark are compatible
- Hadoop installation not required for Spark
- Spark just use Hadoop libraries
- We need to install Hadoop only when we need to read data from Hadoop while running Spark code
- We can call Java code from Scala, we can't call Scala code from Java code
- Similar to Hadoop installation modes, even Spark have 3 installation modes
- Local mode
- Pseudo mode
- Cluster mode
- Above installation is Local mode installation
Sample code to read a student.json file locally and printing the content:
Scala :
val df = sprak.read.json("file:///home/orienit/work/input/student.json")
df.show()
Python :
df = sprak.read.json("file:///home/orienit/work/input/student.json")
df.show()
Let's see more information in coming blogs. Have a great day!
Arun Mathe
Gmail ID : arunkumar.mathe@gmail.com
Contact No : +91 9704117111
Comments
Post a Comment