Skip to main content

Scala : Scala for Spark

Lets go through some basic concepts in Scala. 


We can use any one of the below approach to practice.

  • Terminal (Command line)
  • Notebook (Jupiter)
  • IDE (Eclipse, IntelliJ etc.,)

Below 2 are important key words :

  • Val
    • Means Value
    • Immutable
    • Use when you needs constant (value won't be changed)
  • Var
    • Means Variable
    • Mutable
    • Use when you need a variable (value may change)
Scala syntax :
val <identifier> : <data_type> = <value / expression>

Ex : val name : String = "Arun" + " " + "Spark"    (can have a value or an expression)
 
Java syntax :
<data_type> <identifier> = <value / expression>;

Ex : String name = "Arun";


Question : In terms of performance, will Java, Python, Scala are same while working on Spark ?
Answer : Until Spark-2.x Scala will always give better performance comparatively with Java & Python, but from Spark-2.x Spark data frame got introduced. You can this data frame using any language, performance is same. This change happened from Spark. Spark, Hadoop, HIVE written by experts in industry, it have ability to perform well irrespective of programming language.

Note:
Shortcut to clear screen in Scala prompt : Ctrl + l
Command to enter into Scala prompt : scala

Please check below screenshot to understand the concept of Val vs Var :

  • We are UNABLE to rename when we initialized using Val
  • We are ABLE to rename when we initialized using Var

One more important point to understand is, there is a concept called REPL supported by Scala :
  • REPL means Read - Evaluate - Print - Loop
  • It will read what we write, evaluate it, print what is correct/incorrect and loop
  • Below is an example, where it read one line of code, evaluated and printed incorrect info
  • It will help users to understand errors in a clear way in each line of code

Example :
scala> val name : String = "Arun"
name: String = Arun
scala> name = "Vynateya"
<console>:23: error: reassignment to val
       name = "Vynateya"
            ^

Next concept is TYPE INFER :
  • It automatically identifies the data type of the Val entered
  • See below example, we haven't mentioned String after Val, but it identified it as String
  • See below screenshot for more data types
scala> val name = "Arun"
name: String = Arun



Static Typing Vs Dynamic Typing :
  • Scala, Java follows Static Typing
  • See below example, we declared a variable of type String
  • Later trying to modify(tried to re-assign a integer value)
  • Then it won't allow, it expects us to re-assign a value of String
  • But it will work in other programming languages like Python (Dynamic Typing)
scala> var name = "Arun"
name: String = Arun

scala> name = 10
<console>:23: error: type mismatch;
 found   : Int(10)
 required: String
       name = 10
              ^

Python allows it : Hence Python is Dynamic Typing
>>> name = "Arun"
>>> print(name)
Arun
>>> name = 10
>>> print(name)
10



Working with Jupyter Notebook : 

I am interested to learn further basics in Jupyter notebook. It is a web application that allows users to create and share interactive documents that contains code, equations and other resources.

Please find following documentation for installing Jupyter notebook : https://jupyter.org/install

To open Jupyter notebook, use below command from terminal. 

orienit@orienit:~$ jupyter notebook

Scala work location : /work/scala-basics/

While creating a new notebook for practice, we have to chose spylon-kernel to work on Scala.

Please download Scala basic examples from : https://github.com/amathe1/scala_repo



Below Scala basics are needed to work with Spark :
  • Static typing (while re-assigning a value to a variable, it should be same data type)
  • Type Infer (Automatically identifies the data types based on assigned value)
  • Object overloading
  • Available operations +, -, min, max etc.
  • if else, if else ladder
  • different flavors of for loop including nested for loop
  • Arrays in Scala
  • String Interpolation
  • Tuples
  • Functions
    • Anonymus functions
    • Named functions
    • Curried functions
  • Data Types
  • Collections
    • Immutable
    • Mutable


Java Vs Scala : How to run application ?

  • In java, we will write main() in class, this is the entry point of application to run
  • But in Scala, we can only write main() only inside Object, Object meant for running Scala program

First way :
Define an Object and extend App.

object Example4 extends App {
  
}

Second way :
Define an Object and include the main(). So, that we can run Scala application.

object Example4 {

def main(args: Array[String]): Unit = {
    
  }
  
}


Example Scala code :


/*
 * trait in Scala is like interface in Java
 * 
 * Example1 is a Object(final) class with 3 traits Maths, Stats, Database
 * 
 * Calculations is also another trait(interface) which is extends above 3 traits
 * 
 * */


object Example1 extends App {
  
  trait Maths {
    
     def add(a :Int, b : Int) : Int
     def mul(a :Int, b : Int) : Int
  }
  
  trait Stats {
    def avg(a : Int, v : Int) : Double
  }
  
  trait Database {
    def convert(a : Int) : Double
  }
  
  trait Calculations extends Maths with Stats with Database {
    
  }
  
  // Below is an abstract class as it haven't
  // implemented anything from trait Maths
  abstract class A extends Maths {
    
  }
  
  // Below is also an abstract class, as it partially implemented Maths
  abstract class B extends Maths {
     def add(a : Int, b : Int) : Int = { a + b}
     
  }
  
  // Below is a class(not an abstract class), as it fully implemented Maths
  class C extends Maths {
    def add(a : Int, b : Int) : Int = { println("From Example1"); a + b}
    def mul(a : Int, b : Int) : Int = { println("From Example1"); a * b}
  }
  
  // Created an object to class C and calling methods
  val c = new C
  println("c.add(1, 2)" + c.add(1, 2))
  println("c.mul(1, 2)" + c.mul(1, 2))
  
}



I will be adding more Scala basics in this same blog tomorrow. So, please revisit this blog regularly to learn more Scala which is required for Spark. 

Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111



Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...