Spark : Spark context and RDD

Spark context is the entry point for any Spark operations. Suppose we are running Spark in single node, multi nodes, all required details are available as part of an object called Spark context.

Once we have a Spark context, we can create a RDD out of this and proceed with whatever the operations we want to perform like min(), max(), groupBy, filter etc, We can do two important things in Spark, one is Transformation and other one is Action.

Learning spark is nothing but learning how to transform data and perform actions on them.

Below is the flow :

SPARK CONTEXT --> RDD -->(Transformations, Actions)

RDD(Resilient Distributed Dataset) : Resilient means failover(recover quickly from difficult conditions), Distributed means data is distributed(not in single server/machine), Dataset is a collection of data

Transformation
Action

Simple use case to understand difference between Transformation & Action :

Example 1 : f(x) = x + 1

If we give multiple inputs to above function, we will get multiple outputs. It is called a Transformation.

Example 2 : f([x1, x2, x3, ....xn]) = min(f(x)) OR max(f(x)) OR sum(f(x))

Even if we give multiple inputs for above function, we will get only one output. It is called a Action.

Functionalities of RDD :

Immutable (can't be modified)

Incase if we have to transform an RDD, then convert it into another RDD

Example :RDD1 --> RDD2

This is called Transformation
Spark is meant of Analytics (NOT TRANSACTIONS - OLAP(not OLTP))
If RDD's are not immutable then distributed parallel processing will produce inconsistent results.

Cacheable

It will persist information for future operations
It can be at either Disk level or Memory level
Spark is ~100x faster because of this Cache technique
It will reuse existing results for further computations

Example : Google results will retain to show for next users who use same search string

You can Cache entire data in Hard Disk
You can Cache entire data in Memory
You can Cache data in both Hard Disk + Memory

It will try to keep as much data in memory, rest of the data into Hard Disk

Lazy evaluation

RDD follow bottom to top approach
It won't execute everything, before execution it will come up with plan on what to execute
Main constraint is

if information is static, then we can go for lazy evaluation
if information is dynamic, then we can't go for this approach

Distributed, Partitioning, replication

Type Infer

Let's learn more information on further blogs. Have a great day!

Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111

DataSphere

Search This Blog

Spark : Spark context and RDD

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

Python : Python for Spark

Spark Core : Understanding RDD & Partitions in Spark