DataSphere

Posts

Showing posts from March, 2025

Spark : Pair RDDs

Pair RDD operations are the real time operations which we use in projects. We can solve all real time issues using pair RDD's. In distributed environment, to handle complex problems, we can't go with just value based approach. We should also have a key associated with it. Remember in Map Reduce, internal calls will happen using record which is a Key, Value pair. In RDBMS, multiple columns will be available associate to one primary key. Record <key, Value> Values can be multiple but it will associate with a Key Example : scala> val namesrdd = sc.parallelize(List("raj", "venkat", "sunil", "kalyan", "anvith", "raju", "dev", "hari"), 2) namesrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at <console>:23 scala> val prdd1 =namesrdd.map(x => (x, x) ) prdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[39] at map at <console>:2...

Spark : Internals of Spark Job

It is very important to understand the internals of a Spark job, like what are the stages involved when we run a Spark Job etc. It will help understand the performance of the Job and we can decide on which particular steps are impacting performance and we can exclude such steps incase if they are not needed. Our entire job will be divided into stages, and there are 2 types of dependencies. We need to understand how stages will come, what is the importance of those stages and what are the dependencies. All these are interconnected. When we trigger some action, a Spark Job will be started. Spark Job Stages Tasks (every stage will have set of tasks which execute in parallel) Dependencies Narrow dependency Shuffle dependency Lets understand what are the dependencies. Ideally, below are 4 types of mappings for any actions in Spark. One to One (one element to another element - each element is independent) One to Many (This is also independent on each element) Many to...

Spark Core : RDD operations

We have different types of transformations and actions available that we can perform on the top of RDD's. This blog will explain most of the important functions that we can use on RDD's. Please try to focus on the scala code to understand what that particular function(action/transformation) does. aggregate() : Aggregate the elements of each partition, and then the results of all the partitions, using given combine functions and a neutral "zero value" (initial value). This function can return a different result type, U, than the type of this RDD, T. This is very important function which we regularly use in real time projects. If you have a clear understanding on this aggregate() then you will understand how a distributed parallel processing will work in distributed environments. Syntax : def aggregate[U](zeroValue: U) (SeqOp: (U, T) => U, combOp: (U, T) => U) U zeroValue The initial value of the accumulated result of each partition for the seqOp And also the ini...