DataSphere

Posts

Spark : Internals of Spark Job

It is very important to understand the internals of a Spark job, like what are the stages involved when we run a Spark Job etc. It will help understand the performance of the Job and we can decide on which particular steps are impacting performance and we can exclude such steps incase if they are not needed. Our entire job will be divided into stages, and there are 2 types of dependencies. We need to understand how stages will come, what is the importance of those stages and what are the dependencies. All these are interconnected. When we trigger some action, a Spark Job will be started. Spark Job Stages Tasks (every stage will have set of tasks which execute in parallel) Dependencies Narrow dependency Shuffle dependency Lets understand what are the dependencies. Ideally, below are 4 types of mappings for any actions in Spark. One to One (one element to another element - each element is independent) One to Many (This is also independent on each element) Many to...

Spark Core : RDD operations

We have different types of transformations and actions available that we can perform on the top of RDD's. This blog will explain most of the important functions that we can use on RDD's. Please try to focus on the scala code to understand what that particular function(action/transformation) does. aggregate() : Aggregate the elements of each partition, and then the results of all the partitions, using given combine functions and a neutral "zero value" (initial value). This function can return a different result type, U, than the type of this RDD, T. This is very important function which we regularly use in real time projects. If you have a clear understanding on this aggregate() then you will understand how a distributed parallel processing will work in distributed environments. Syntax : def aggregate[U](zeroValue: U) (SeqOp: (U, T) => U, combOp: (U, T) => U) U zeroValue The initial value of the accumulated result of each partition for the seqOp And also the ini...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark. RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

DataSphere

Search This Blog

Posts

Spark : Pair RDDs

Spark : Internals of Spark Job

Spark Core : RDD operations

Spark Core : Understanding RDD & Partitions in Spark