Skip to main content

Spark : Spark context and RDD

Spark context is the entry point for any Spark operations. Suppose we are running Spark in single node, multi nodes, all required details are available as part of an object called Spark context.

Once we have a Spark context, we can create a RDD out of this and proceed with whatever the operations we want to perform like min(), max(), groupBy, filter etc, We can do two important things in Spark, one is Transformation and other one is Action

Learning spark is nothing but learning how to transform data and perform actions on them.


Below is the flow :

SPARK CONTEXT --> RDD -->(Transformations, Actions)


RDD(Resilient Distributed Dataset) : Resilient means failover(recover quickly from difficult conditions), Distributed means data is distributed(not in single server/machine), Dataset is a collection of data

  • Transformation 
  • Action


Simple use case to understand difference between Transformation & Action :

Example 1 : f(x) = x + 1
If we give multiple inputs to above function, we will get multiple outputs. It is called a Transformation.


Example 2 : f([x1, x2, x3, ....xn]) = min(f(x)) OR max(f(x)) OR sum(f(x))
Even if we give multiple inputs for above function, we will get only one output. It is called a Action.



Functionalities of RDD :
  • Immutable (can't be modified)  
    • Incase if we have to transform an RDD, then convert it into another RDD
      • Example :RDD1 --> RDD2
    • This is called Transformation
    • Spark is meant of Analytics (NOT TRANSACTIONS - OLAP(not OLTP)) 
    • If RDD's are not immutable then distributed parallel processing will produce inconsistent results. 
  • Cacheable 
    • It will persist information for future operations
    • It can be at either Disk level or Memory level 
    • Spark is ~100x faster because of this Cache technique
    • It will reuse existing results for further computations
      • Example : Google results will retain to show for next users who use same search string
    • You can Cache entire data in Hard Disk
    • You can Cache entire data in Memory
    • You can Cache data in both Hard Disk + Memory
      • It will try to keep as much data in memory, rest of the data into Hard Disk
  • Lazy evaluation
    • RDD follow bottom to top approach
    • It won't execute everything, before execution it will come up with plan on what to execute
    • Main constraint is 
      • if information is static, then we can go for lazy evaluation
      • if information is dynamic, then we can't go for this approach
  • Distributed, Partitioning, replication
  • Type Infer

Let's learn more information on further blogs. Have a great day!





Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com










Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

(AI Blog#1) Deep Learning and Neural Networks

I was curious to learn Artificial Intelligence and thinking what is the best place to start learning, and then realized that Deep Learning and Neural Networks is the heart of AI. Hence started diving into AI from this point. Starting from today, I will write continuous blogs on AI, especially Gen AI & Agentic AI. Incase if you are interested on above topics then please watch out this space. What is Artificial Intelligence, Machine Learning & Deep Learning ? AI can be described as the effort to automate intellectual tasks normally performed by Humans. Is this really possible ? For example, when we see an image with our eyes, we will identify it within a fraction of milliseconds. Isn't it ? For a computer, is it possible to do the same within same time limit ? That's the power we are talking about. To be honest, things seems to be far advanced than we actually thing about AI.  BTW, starting from this blog, it is not just a technical journal, we talk about internals here. ...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...