Spark context is the entry point for any Spark operations. Suppose we are running Spark in single node, multi nodes, all required details are available as part of an object called Spark context.
Once we have a Spark context, we can create a RDD out of this and proceed with whatever the operations we want to perform like min(), max(), groupBy, filter etc, We can do two important things in Spark, one is Transformation and other one is Action.
Learning spark is nothing but learning how to transform data and perform actions on them.
Below is the flow :
SPARK CONTEXT --> RDD -->(Transformations, Actions)
RDD(Resilient Distributed Dataset) : Resilient means failover(recover quickly from difficult conditions), Distributed means data is distributed(not in single server/machine), Dataset is a collection of data
- Transformation
- Action
- Immutable (can't be modified)
- Incase if we have to transform an RDD, then convert it into another RDD
- Example :RDD1 --> RDD2
- This is called Transformation
- Spark is meant of Analytics (NOT TRANSACTIONS - OLAP(not OLTP))
- If RDD's are not immutable then distributed parallel processing will produce inconsistent results.
- Cacheable
- It will persist information for future operations
- It can be at either Disk level or Memory level
- Spark is ~100x faster because of this Cache technique
- It will reuse existing results for further computations
- Example : Google results will retain to show for next users who use same search string
- You can Cache entire data in Hard Disk
- You can Cache entire data in Memory
- You can Cache data in both Hard Disk + Memory
- It will try to keep as much data in memory, rest of the data into Hard Disk
- Lazy evaluation
- RDD follow bottom to top approach
- It won't execute everything, before execution it will come up with plan on what to execute
- Main constraint is
- if information is static, then we can go for lazy evaluation
- if information is dynamic, then we can't go for this approach
- Distributed, Partitioning, replication
- Type Infer
Arun Mathe
Gmail ID : arunkumar.mathe@gmail.com
Contact No : +91 9704117111
Comments
Post a Comment