Skip to main content

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark.

Incase if you want to practice Spark in Big Data environment, you can use Databricks.
  • URL : https://community.cloud.databricks.com
  • This is the main tool which programmers are using in real time production environment
  • We have both Community edition(Free version with limited support) & paid versions available
  • Register for above tool online for free and practice
  • Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code.

Correct Indentation :

def greet():
    print("Hello!")  # Indented correctly
    print("Welcome to Python.")  # Inside the function

greet()  # No indentation needed here

In-correct Indentation :

def greet():
print("Hello!")  # ❌ IndentationError: expected an indented block

  • Regarding data types, Python doesn't have long, double, it has int, float, string, complex, tuple, list, set, dict etc.
# Int
a = 10
print(type(a))

# Float
f = 10.2
print(type(f))

# String
str = "Spark"
print(type(str))

# Tuple, it will allow duplicates
# Tuple is immutable, we can't modify elements
t1 = (1, 2, 3)
print(type(t1))

# List allows duplicates, it is ordered
# We can add and remove elements
l1 = [1, 2, 3]
print(type(l1))

# Set DO NOT allow duplicates, also it is not ordered
s1 = {1, 2, 3}
print(type(s1))

# Dictionary is a key, value concept (Map in Java)
d1 = {1 : 'a', 2 : 'b', 3 : 'c'}
print(type(d1))

Terminal Output :

<class 'int'>
<class 'float'>
<class 'str'>
<class 'tuple'>
<class 'list'>
<class 'set'>
<class 'dict'>

String reverse using slicing : 

word = "Python"
print(word[::-1])  # Output: "nohtyP"



For loop using range :

for i in range(10) :
    print(i)

'''
Output

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
0
1
2
3
4
5
6
7
8
9

'''

Iterating collections, DICT :

dict = {"name" : "Arun", "state" : "Telangana", "city" : "Hyd" }

for key, value in dict.items():
    print(key, value)

'''
Output

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
name Arun
state Telangana
city Hyd

'''


Defining variables :

# Correct way of defining variables
_name = "Arun"
my_name = "Arun"
my_name123 = "Arun"

# Incorrect way of defining variables
# 123name = "Arun"
# my-name = "Arun"

Adding boolean values to int :

# True means 1
print(1 + True)

# False means 0
print(1 + False)

'''
Output :

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
2
1

'''

# Integer values will be added and imaginary part will be as is.
print(1 + (2 + 3j))

'''
Output :

(3+3j)

'''

String Operations :

# String operations
data = "Welcome to Hyderbad. How are you doing today ?"

print("data              :", data)
print("data.split()      :", data.split())
print("data.split('.')   :", data.split('.'))
print("len(data)         :", len(data))
print("type(data)        :", type(data))

'''
Output :

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
data              : Welcome to Hyderbad. How are you doing today ?
data.split()      : ['Welcome', 'to', 'Hyderbad.', 'How', 'are', 'you', 'doing', 'today', '?']
data.split('.')   : ['Welcome to Hyderbad', ' How are you doing today ?']
len(data)         : 46
type(data)        : <class 'str'>

'''

x = "   abc   "
print(x.replace("", "*"))

'''
Output :

* * * *a*b*c* * * *

'''

x = "   abc   "
print(x.strip())
print(x.lstrip().replace(" ", "*"))
print(x.rstrip().replace(" ", "*"))

'''
Output :

abc
abc***
***abc
'''


List Functionality :
  • List allows duplicates
  • List is ordered and index based
  • Mixed data types are allowed
  • Mutable
Set Functionality :
  • Set doesn't allow duplicates
  • Set is not ordered and not index based
  • Mixed data types are allowed
  • Immutable
Tuples Functionality :
  • Tuples allow duplicates
  • Tuple is ordered and index based
  • Mixed data type allowed
  • Immutable
Dictionary Functionality :
  • Dictionary keys are unique
  • Mixed data types allowed
list1 = [1, 2, 3, 4]

for x in list1:
    print(x)

'''
Output :

1
2
3
4

'''

list1 = [1, 2, 3, 4]

l2 = [x for x in list1]
print(l2)

'''
Output :

[1, 2, 3, 4]

'''

list1 = [1, 2, 3, 4]

list2 = [x*x for x in list1]
print(list2)

'''
Output :

[1, 4, 9, 16]

'''

list1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

list2 = [x*x for x in list1 if(x % 2) == 0]
print(list2)

'''
Output :

[4, 16, 36, 64, 100]

'''


# Declare a dictionary
dict1 = {"x" : 1, "y" : 2, "z" : 3}

print("dict1.keys() :", dict1.keys() )
print("dict1.values() :", dict1.values() )
print("dict1.items() :", dict1.items() )

'''
Output :

dict1.keys() : dict_keys(['x', 'y', 'z'])
dict1.values() : dict_values([1, 2, 3])
dict1.items() : dict_items([('x', 1), ('y', 2), ('z', 3)])

'''

print(range(5))
print(list(range(5)))
print(list(range(1, 5)))
print(list(range(1, 5, 1)))
print(list(range(1, 5, 2)))

'''
Output :

[0, 1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 3]

'''

x = "abcdef"

if ('a' in x):
    print("A is available in given string")

'''
Output :
A is available in given string

'''

x = "abcdef"

if ('p' in x):
    print("P is available in given string")
else:
    print("P is not available in given string")

'''
Output :
P is not available in given string

'''

x = "abcdef"

if ('p' in x):
    print("P is available in given string")
elif('q' in x):
    print("q is available in given string")
elif('r' in x):
    print("r is available in given string")
else:
    print("No luck")

'''
Output :
No luck

'''

Break :
for i in range(5):
    print("Starting the loop : " +str(i))
    stop = input("Do you want to stop the loop (y/n) ? ")
    if stop == "y":
        break
    print("Ending the loop : " +str(i))

print("PROGRAM FINISHED! ")

'''
Output : Once break executed, it won't execute rest of the code in the scope of break

Starting the loop : 0
Do you want to stop the loop (y/n) ? y
PROGRAM FINISHED!

'''

Continue :
"""
continue keyword will ignore, rest of the code next to it in loop ;
but it won't exit loop like break statement

"""

for i in range(5):
    print("Starting the loop : " +str(i))
    stop = input("Do you want to stop the loop (y/n) ? ")
    if stop == "y":
        continue
    print("Ending the loop : " +str(i))
   

print("PROGRAM FINISHED! ")

'''
Output :

Starting the loop : 0
Do you want to stop the loop (y/n) ? n
Ending the loop : 0
Starting the loop : 1
Do you want to stop the loop (y/n) ? y
Starting the loop : 2
Do you want to stop the loop (y/n) ? y
Starting the loop : 3
Do you want to stop the loop (y/n) ? y
Starting the loop : 4
Do you want to stop the loop (y/n) ? y
PROGRAM FINISHED!


'''

Pass keyword :
# Use "Pass" keyword to momentarily ignore errros which are syntactical
# Later we can replace "Pass" with actual code

def addition():
    pass

def main():
    pass


Functions :
def add(a, b):
    return a + b

print("add(1,2)  : ", add(1, 2))

'''
Output :

add(1,2)  :  3

'''

# Anaonymus functions

lambda a, b : a + b

lambda a, b : a * b

Factorial of a given number in recursive method :
"""

def factorial_solution(value):
    result = 1
    for i in range(2, value+1):
        result = result * i
    return result

def main():
    result = factorial_solution(10)
    print("Factorial of given value is", result)

main()

"""


# factorial of a number (applicable to only positive numbers)
# n! = n * (n - 1) * (n - 2) * .... 3 * 2 *1

# Ex : 5! = 5 * 4 * 3 * 2 * 1
# 5! = 120

# Factorial of zero is 1.
# 0! = 1

def recursive_factorial(n):
    if n == 1:
        return n
    else:
        return n * recursive_factorial(n-1)


def main():
    number = input("Please enter the number : ")
    number = int(number)

    if number < 0:
        print("Please enter a number greater than '0'")
    elif number == 0:
        print("factorial of zero is 1")
    else:
        print(f"Factorial of given number {number} is : ", recursive_factorial(number) )

main()


'''
Output :

Please enter the number : 5
Factorial of given number 5 is :  120

'''


Apart from above information, we need to have some knowledge on additional Python modules like Numpy and Pandas. In Java we call them as libraries, but in Python we can them as modules.


Numpy : 

Numpy is nothing but numerical Python. Generally numerical operations are time taking operations. To increase the performance, Numpy came into existence.

  • In real time, programmers use Numpy rather than list as this will give better performance
  • Numpy is faster in accessing values
  • Numpy can handle large amount of data in real time
  • Numpy is more crucial for Data Scientists 
How to install NumPy & Pandas ? 
  • pip3 install numpy
  • pip3 install pandas
I already have both Numpy & Pandas libraries installed in my local. 

PS D:\GitHub\Python\python_practise> pip install numpy
Requirement already satisfied: numpy in c:\users\arunk\anaconda3\lib\site-packages (1.26.4)

PS D:\GitHub\Python\python_practise> pip install pandas
Requirement already satisfied: pandas in c:\users\arunk\anaconda3\lib\site-packages (2.2.2)

We need to import below packages in Python code to use these libraries : 
  • import numpy
  • import pandas
import numpy as np
import pandas as pd

If above import statements are not giving any errors in your code then installation was successful and we are ready to play with these libraries. 

Creating a array in Numpy :
"""

Numpy is a package that allows us to manipulate arrays of data.
Usually nymerical but we can also put strings in it.

We can easily manipulate numbers mathematically in a array.

It is like a array class but kind of turbo charged array (more power!)

Note : Please use 'pip3 list' and see if numpy is already installed in your machine.
       If not then, use 'pip3 install numpy' to install this python package


"""

import numpy as np



def main():
    # creating a one dimentional array/list using 'np'
    num1 = np.array([1, 2, 3, 4], dtype=int)

    print(num1)

    # How to check the type of array ?
    print("Type of the array is : ",num1.dtype)
   
    # How to check dimension of an array ?
    print("Dimension of the array is : ",num1.ndim)
   
    # What is the shape of the array ?
    print("Shape of the array is : ",num1.shape)
   
    # How many bytes does this array have ?
    print("Bytes : ", num1.nbytes)

    # # creating a two dimentional array/list using 'np'
    num2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=int)
   
    print(num2)

    print("Type of the array is : ", num2.dtype)

    print("Dimension of the array is : ", num2.ndim)

    print("Shape of the array is : ", num2.shape)


if __name__ == "__main__":
    main()
   
'''
Output :

[1 2 3 4]
Type of the array is :  int32
Dimension of the array is :  1
Shape of the array is :  (4,)
Bytes :  16
[[1 2]
 [3 4]
 [5 6]]
Type of the array is :  int32
Dimension of the array is :  2
Shape of the array is :  (3, 2)

'''

Zero dimensional Array:
import numpy as np

data = np.array(10)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)

'''
Output :

Data in numpy array is :  10
Dimension of this numpy array is :  0

'''

One dimensional Array:
import numpy as np

data = np.array([10, 20, 30, 40, 50])
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)

'''
Output :

Data in numpy array is :  [10 20 30 40 50]
Dimension of this numpy array is :  1

'''

Two dimensional Array:
import numpy as np

data = np.array([[10, 20, 30, 40, 50], [60, 70, 80, 90, 100]], dtype=int)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)

'''
Output :

Data in numpy array is :  [[ 10  20  30  40  50]
 [ 60  70  80  90 100]]
Dimension of this numpy array is :  2

'''

Some mathematical operations : (min, max, sum, avg, mean, variance etc.)
import numpy as np

data = np.array([10, 20, 30, 40, 50], dtype=int)
print("Data in numpy array is : ",data)

print("Dimension of this numpy array is : ", data.ndim)
print("Minimum number in this One dimensional array is : ", data.min())
print("Maxiimum number in this One dimensional array is : ", data.max())
print("Sum of all elements in this One dimensional array is : ", data.sum())

'''
Output :

Data in numpy array is :  [10 20 30 40 50]
Dimension of this numpy array is :  1
Minimum number in this One dimensional array is :  10
Maxiimum number in this One dimensional array is :  50
Sum of all elements in this One dimensional array is :  150

'''

Re-shaping an existing Array :
  • Re-shaping will be very helpful while working on huge data on multi-dimensional arrays
  • In such cases, we need to convert 'n' dimensional arrays into one dimensional array, partition them in distributed system, apply operations at partitional level and then again reshape back to normal. This is just one use case, we have multiple use cases like this in Data Science.
  • This information is just for our knowledge to understand why we use re-shaping.
import numpy as np

data_set1 = np.array([10, 20, 30, 40, 50, 60], dtype=int)
print(data_set1)
print("Dimension before reshaping : ", data_set1.ndim)
print()

# While reshaping, considering the number of elements in initial data set is very important
# we have six elements and hence we can reshape to (3 * 2) or (2 * 3) or (1 * 6) or (6 * 1)
data_set2 = data_set1.reshape(2, 3)
print(data_set2)
print("Dimension after reshaping : ", data_set2.ndim)

data_set3 = data_set1.reshape(3, 2)
print(data_set3)
print("Dimension after reshaping : ", data_set3.ndim)


'''
Output :

[10 20 30 40 50 60]
Dimension before reshaping :  1

[[10 20 30]
 [40 50 60]]
Dimension after reshaping :  2
[[10 20]
 [30 40]
 [50 60]]
Dimension after reshaping :  2

'''

Incorrect reshaping :
import numpy as np

data_set1 = np.array([10, 20, 30, 40, 50, 60], dtype=int)
print(data_set1)
print("Dimension before reshaping : ", data_set1.ndim)
print()

data_set2 = data_set1.reshape(3, 3)


'''
Output :

[10 20 30 40 50 60]
Dimension before reshaping :  1

Traceback (most recent call last):
  File "d:\GitHub\Python\python_practise\numpy\np_practise.py", line 8, in <module>
    data_set2 = data_set1.reshape(3, 3)
                ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 6 into shape (3,3)

'''



Pandas :

Pandas is a powerful open source Python library. It is used for data analysis and manipulation. Pandas consists of data structures and functions to perform efficient operations on data. Pandas is well suited for working with tabular data, such as spread sheets, SQL tables.

It is built on the top of NumPy library, which means lot of structures in NumPy are used in Pandas. The data produced by Pandas is often used as input for plotting functions in Matplotlib and Machine learning algorithms.

Here is a list of things that we can do using Pandas.
  • Data set cleaning, merging, and joining.
  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
  • Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
  • Powerful group by functionality for performing split-apply-combine operations on data sets.
  • Data Visualization.

Creating a simple data frame using Pandas :
import pandas as pd

students = {
    "Name" : ["Arun", "Anu", "Vynateya"],
    "ID" : ["1", "2", "3"],
    "Location" : ["Hyd", "Ban", "Ban"]
}

df = pd.DataFrame(students)

print(df)

'''
Output :

       Name ID Location
0      Arun  1      Hyd
1       Anu  2      Ban
2  Vynateya  3      Ban

'''

Manipulating Index :
import pandas as pd

students = {
    "Name" : ["Arun", "Anu", "Vynateya"],
    "ID" : ["1", "2", "3"],
    "Location" : ["Hyd", "Ban", "Ban"]
}

df = pd.DataFrame(students, index=["row1", "row2", "row3"])

print(df)

'''
Output :

          Name ID Location
row1      Arun  1      Hyd
row2       Anu  2      Ban
row3  Vynateya  3      Ban

'''


Important points :
  • Thus, we can convert any type of data set into tables and perform required operations. This is why we use Pandas.
  • We have same concept called PANDAS in Spark SQL which is inspired from Python Pandas
  • From Spark 2.x, we can convert a Spark data frame into Python data frame and vice versa
  • We can see more information related to Data frames in Spark
  • There is a library called Pi4j using which we can run Python code using Java, Spark will use this library

Please use Databricks community addition for practicing both Python and Scala. 


Thanks,
Arun Mathe
Email ID : arunkumar.mathe@gmail.com
Contact ID : 9704117111

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...