Pair RDD operations are the real time operations which we use in projects. We can solve all real time issues using pair RDD's. In distributed environment, to handle complex problems, we can't go with just value based approach. We should also have a key associated with it. Remember in Map Reduce, internal calls will happen using record which is a Key, Value pair. In RDBMS, multiple columns will be available associate to one primary key. Record <key, Value> Values can be multiple but it will associate with a Key Example : scala> val namesrdd = sc.parallelize(List("raj", "venkat", "sunil", "kalyan", "anvith", "raju", "dev", "hari"), 2) namesrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at <console>:23 scala> val prdd1 =namesrdd.map(x => (x, x) ) prdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[39] at map at <console>:2...
It is very important to understand the internals of a Spark job, like what are the stages involved when we run a Spark Job etc. It will help understand the performance of the Job and we can decide on which particular steps are impacting performance and we can exclude such steps incase if they are not needed. Our entire job will be divided into stages, and there are 2 types of dependencies. We need to understand how stages will come, what is the importance of those stages and what are the dependencies. All these are interconnected. When we trigger some action, a Spark Job will be started. Spark Job Stages Tasks (every stage will have set of tasks which execute in parallel) Dependencies Narrow dependency Shuffle dependency Lets understand what are the dependencies. Ideally, below are 4 types of mappings for any actions in Spark. One to One (one element to another element - each element is independent) One to Many (This is also independent on each element) Many to...