Skip to main content

Spark : What is Spark ?

Apache Spark is an open-source, distributed system for processing large amounts of data. It's used for analytics, machine learning, and other applications that require fast processing of large data sets.


History of Spark :

Around 2009, a project called Mesos started in Berkeley university. It is a resource management system, similar to yarn in Hadoop. 

In Hadoop, we have a data processing module called Map Reduce. It consists of processes called JobTracker and TaskTracker. People who started Mesos aware of drawbacks of Map Reduce. To test this project called Mesos, these people implemented Spark, but their primary goal is Mesos. 

Initial Spark program is just 100 lines of code, they observed that Spark is almost 10x faster than Hadoop. Then focus shifted from Mesos to Spark. Around 2013 they made this project as open source.

In 2014, around Aug, Spark becoming top level project in Apache. Instead of using Map Reduce, people started using Map Reduce(MR). It is almost 100x faster than MR.

Now, Spark is a mandatory Big Data technology for data processing.



What is the main programming language for Spark(Spark 1.x) ?

Main programming language of Spark is Scala. To support other programming languages, some wrappers are available. We also have good number of use cases where we have to use Python(using library Py4J). 

We can use any of the below programming languages :

  • Scala
  • Python
  • Java
  • R
  • SQL


Why scala ? 

Scala is implemented on top of Java. Advantage of Scala is, it is having all Java features, you can directly use Java code in Scala, it is a scalable language. Scala development started in 2002, its main goal is to fix all issues existing in Java. Also Hadoop is built using Java.


Points to remember:

  • We have to use either Java/Scala to implement new features in Spark.
  • Major programming language is Scala
  • We can use Python as well using Py4J library
  • We can run Java in Scala but vice versa is not possible
  • We can directly call Java programs inside a Scala program

Note : Above context is for Spark 1.x


From Spark2, unified engine for large scale data analytics approach came into picture :

  • Performance will be same even if we use any programming language
  • API's used in Scala, Java, Python are same (almost ~90% are same)
  • Advantage of this approach is, lets say we learnt Spark in Python, then switching to Scala needs just the basics of Scala


Note : Spark3 fixed all the minor issues from Spark2 as well.


Below code snippet confirm that code will be almost similar in all programming languages like Python, Scala & Java. So, just remove the Myth that you need in depth programming knowledge to learn Spark. We need to be strong in Spark, programming basics are good enough.


Python code : Created a data-frame to read data from logs.json

df = spark.read.json("logs.json")

df.where("age > 21").select("name.first").show()


Scala code : Created a variable to read data from logs.json

val df= spark.read.json("logs.json")

df.where("age > 21").select("name.first").show()


Java code : Created dataset data-frame to read data from logs.json

Dataset df = spark.read.json("logs.json")

df.where("age > 21").select("name.first").show()



More information about Spark :

  • Heart of Spark is Spark core and we have below 4 main libraries in it 
    • Spark SQL
    • Spark Streaming
    • Spark MLLib
    • Spark GraphX
  • Above 4 libraries are built  on the top of Spark context & RDD
  • Spark Context & RDD are the primary concepts of Spark
  • Spark Context 
    • Entry point for any operations(to filter, groupBy, min, max etc,)
    • Using Spark context we will create an RDD
    • On top of RDD, we can run above operations


Conclusion :

Basically spark is all about processing large amount of data(Big Data). Going forward, I will be discussing more about Spark, like how to process large amounts of data, also how to do same in Cloud(AWS, Azure etc,). Lets see more about Spark in coming blogs. 

Have a great day!







Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...