Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark.

Incase if you want to practice Spark in Big Data environment, you can use Databricks.

URL : https://community.cloud.databricks.com
This is the main tool which programmers are using in real time production environment
We have both Community edition(Free version with limited support) & paid versions available
Register for above tool online for free and practice

Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code.

Correct Indentation :

def greet():

print("Hello!") # Indented correctly

print("Welcome to Python.") # Inside the function

greet() # No indentation needed here

In-correct Indentation :

def greet():

print("Hello!") # ❌ IndentationError: expected an indented block

Regarding data types, Python doesn't have long, double, it has int, float, string, complex, tuple, list, set, dict etc.

# Int
a = 10
print(type(a))

# Float
f = 10.2
print(type(f))

# String
str = "Spark"
print(type(str))

# Tuple, it will allow duplicates

# Tuple is immutable, we can't modify elements
t1 = (1, 2, 3)
print(type(t1))

# List allows duplicates, it is ordered

# We can add and remove elements
l1 = [1, 2, 3]
print(type(l1))

# Set DO NOT allow duplicates, also it is not ordered
s1 = {1, 2, 3}
print(type(s1))

# Dictionary is a key, value concept (Map in Java)
d1 = {1 : 'a', 2 : 'b', 3 : 'c'}
print(type(d1))

Terminal Output :

String reverse using slicing :

word = "Python"
print(word[::-1])  # Output: "nohtyP"

For loop using range :

for i in range(10) :
    print(i)

'''
Output

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
0
1
2
3
4
5
6
7
8
9

'''

Iterating collections, DICT :

dict = {"name" : "Arun", "state" : "Telangana", "city" : "Hyd" }

for key, value in dict.items():
    print(key, value)

'''
Output

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
name Arun
state Telangana
city Hyd

'''

Defining variables :

# Correct way of defining variables
_name = "Arun"
my_name = "Arun"
my_name123 = "Arun"

# Incorrect way of defining variables
# 123name = "Arun"
# my-name = "Arun"

Adding boolean values to int :

# True means 1
print(1 + True)

# False means 0
print(1 + False)

'''
Output :

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
2
1

'''

# Integer values will be added and imaginary part will be as is.
print(1 + (2 + 3j))

'''
Output :

(3+3j)

'''

String Operations :

# String operations
data = "Welcome to Hyderbad. How are you doing today ?"

print("data              :", data)
print("data.split()      :", data.split())
print("data.split('.')   :", data.split('.'))
print("len(data)         :", len(data))
print("type(data)        :", type(data))

'''
Output :

PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
data              : Welcome to Hyderbad. How are you doing today ?
data.split()      : ['Welcome', 'to', 'Hyderbad.', 'How', 'are', 'you', 'doing', 'today', '?']
data.split('.')   : ['Welcome to Hyderbad', ' How are you doing today ?']
len(data)         : 46
type(data)        : <class 'str'>

'''

x = "   abc   "
print(x.replace("", "*"))

'''
Output :

* * * *a*b*c* * * *

'''

x = "   abc   "
print(x.strip())
print(x.lstrip().replace(" ", "*"))
print(x.rstrip().replace(" ", "*"))

'''
Output :

abc
abc***
***abc
'''

List Functionality :

List allows duplicates
List is ordered and index based
Mixed data types are allowed
Mutable

Set Functionality :

Set doesn't allow duplicates
Set is not ordered and not index based
Mixed data types are allowed
Immutable

Tuples Functionality :

Tuples allow duplicates
Tuple is ordered and index based
Mixed data type allowed
Immutable

Dictionary Functionality :

Dictionary keys are unique
Mixed data types allowed

list1 = [1, 2, 3, 4]

for x in list1:
    print(x)

'''
Output :

1
2
3
4

'''

list1 = [1, 2, 3, 4]

l2 = [x for x in list1]
print(l2)

'''
Output :

[1, 2, 3, 4]

'''

list1 = [1, 2, 3, 4]

list2 = [x*x for x in list1]
print(list2)

'''
Output :

[1, 4, 9, 16]

'''

list1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

list2 = [x*x for x in list1 if(x % 2) == 0]
print(list2)

'''
Output :

[4, 16, 36, 64, 100]

'''

# Declare a dictionary
dict1 = {"x" : 1, "y" : 2, "z" : 3}

print("dict1.keys() :", dict1.keys() )
print("dict1.values() :", dict1.values() )
print("dict1.items() :", dict1.items() )

'''
Output :

dict1.keys() : dict_keys(['x', 'y', 'z'])
dict1.values() : dict_values([1, 2, 3])
dict1.items() : dict_items([('x', 1), ('y', 2), ('z', 3)])

'''

print(range(5))
print(list(range(5)))
print(list(range(1, 5)))
print(list(range(1, 5, 1)))
print(list(range(1, 5, 2)))

'''
Output :

[0, 1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 3]

'''

x = "abcdef"

if ('a' in x):
    print("A is available in given string")

'''
Output :
A is available in given string

'''

x = "abcdef"

if ('p' in x):
    print("P is available in given string")
else:
    print("P is not available in given string")

'''
Output :
P is not available in given string

'''

x = "abcdef"

if ('p' in x):
    print("P is available in given string")
elif('q' in x):
    print("q is available in given string")
elif('r' in x):
    print("r is available in given string")
else:
    print("No luck")

'''
Output :
No luck

'''

Break :

for i in range(5):
    print("Starting the loop : " +str(i))
    stop = input("Do you want to stop the loop (y/n) ? ")
    if stop == "y":
        break
    print("Ending the loop : " +str(i))

print("PROGRAM FINISHED! ")

'''
Output : Once break executed, it won't execute rest of the code in the scope of break

Starting the loop : 0
Do you want to stop the loop (y/n) ? y
PROGRAM FINISHED!

'''

Continue :

"""
continue keyword will ignore, rest of the code next to it in loop ;
but it won't exit loop like break statement

"""

for i in range(5):
    print("Starting the loop : " +str(i))
    stop = input("Do you want to stop the loop (y/n) ? ")
    if stop == "y":
        continue 
    print("Ending the loop : " +str(i))
    

print("PROGRAM FINISHED! ")

'''
Output :

Starting the loop : 0
Do you want to stop the loop (y/n) ? n
Ending the loop : 0
Starting the loop : 1
Do you want to stop the loop (y/n) ? y
Starting the loop : 2
Do you want to stop the loop (y/n) ? y
Starting the loop : 3
Do you want to stop the loop (y/n) ? y
Starting the loop : 4
Do you want to stop the loop (y/n) ? y
PROGRAM FINISHED! 


'''

Pass keyword :

# Use "Pass" keyword to momentarily ignore errros which are syntactical
# Later we can replace "Pass" with actual code

def addition():
    pass

def main():
    pass

Functions :

def add(a, b):
    return a + b

print("add(1,2)  : ", add(1, 2))

'''
Output :

add(1,2)  :  3

'''

# Anaonymus functions

lambda a, b : a + b

lambda a, b : a * b

Factorial of a given number in recursive method :

"""

def factorial_solution(value):
    result = 1
    for i in range(2, value+1):
        result = result * i
    return result

def main():
    result = factorial_solution(10)
    print("Factorial of given value is", result)

main()

"""


# factorial of a number (applicable to only positive numbers)
# n! = n * (n - 1) * (n - 2) * .... 3 * 2 *1

# Ex : 5! = 5 * 4 * 3 * 2 * 1
# 5! = 120

# Factorial of zero is 1.
# 0! = 1 

def recursive_factorial(n):
    if n == 1:
        return n
    else:
        return n * recursive_factorial(n-1)


def main():
    number = input("Please enter the number : ")
    number = int(number)

    if number < 0:
        print("Please enter a number greater than '0'")
    elif number == 0:
        print("factorial of zero is 1")
    else:
        print(f"Factorial of given number {number} is : ", recursive_factorial(number) )

main()


'''
Output :

Please enter the number : 5
Factorial of given number 5 is :  120

'''

Apart from above information, we need to have some knowledge on additional Python modules like Numpy and Pandas. In Java we call them as libraries, but in Python we can them as modules.

Numpy :

Numpy is nothing but numerical Python. Generally numerical operations are time taking operations. To increase the performance, Numpy came into existence.

In real time, programmers use Numpy rather than list as this will give better performance
Numpy is faster in accessing values
Numpy can handle large amount of data in real time
Numpy is more crucial for Data Scientists

How to install NumPy & Pandas ?

pip3 install numpy
pip3 install pandas

I already have both Numpy & Pandas libraries installed in my local.

PS D:\GitHub\Python\python_practise> pip install numpy

Requirement already satisfied: numpy in c:\users\arunk\anaconda3\lib\site-packages (1.26.4)

PS D:\GitHub\Python\python_practise> pip install pandas

Requirement already satisfied: pandas in c:\users\arunk\anaconda3\lib\site-packages (2.2.2)

We need to import below packages in Python code to use these libraries :

import numpy
import pandas

import numpy as np
import pandas as pd

If above import statements are not giving any errors in your code then installation was successful and we are ready to play with these libraries.

Creating a array in Numpy :

"""

Numpy is a package that allows us to manipulate arrays of data.
Usually nymerical but we can also put strings in it.

We can easily manipulate numbers mathematically in a array.

It is like a array class but kind of turbo charged array (more power!)

Note : Please use 'pip3 list' and see if numpy is already installed in your machine.
       If not then, use 'pip3 install numpy' to install this python package


"""

import numpy as np



def main():
    # creating a one dimentional array/list using 'np'
    num1 = np.array([1, 2, 3, 4], dtype=int)

    print(num1)

    # How to check the type of array ?
    print("Type of the array is : ",num1.dtype)
    
    # How to check dimension of an array ?
    print("Dimension of the array is : ",num1.ndim)
    
    # What is the shape of the array ?
    print("Shape of the array is : ",num1.shape)
    
    # How many bytes does this array have ?
    print("Bytes : ", num1.nbytes)

    # # creating a two dimentional array/list using 'np'
    num2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=int)
    
    print(num2)

    print("Type of the array is : ", num2.dtype)

    print("Dimension of the array is : ", num2.ndim)

    print("Shape of the array is : ", num2.shape)


if __name__ == "__main__":
    main()
    
'''
Output :

[1 2 3 4]
Type of the array is :  int32
Dimension of the array is :  1
Shape of the array is :  (4,)
Bytes :  16
[[1 2]
 [3 4]
 [5 6]]
Type of the array is :  int32
Dimension of the array is :  2
Shape of the array is :  (3, 2)

'''

Zero dimensional Array:

import numpy as np

data = np.array(10)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)

'''
Output :

Data in numpy array is :  10
Dimension of this numpy array is :  0

'''

One dimensional Array:

import numpy as np

data = np.array([10, 20, 30, 40, 50])
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)

'''
Output :

Data in numpy array is :  [10 20 30 40 50]
Dimension of this numpy array is :  1

'''

Two dimensional Array:

import numpy as np

data = np.array([[10, 20, 30, 40, 50], [60, 70, 80, 90, 100]], dtype=int)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)

'''
Output :

Data in numpy array is :  [[ 10  20  30  40  50]
 [ 60  70  80  90 100]]
Dimension of this numpy array is :  2

'''

Some mathematical operations : (min, max, sum, avg, mean, variance etc.)

import numpy as np

data = np.array([10, 20, 30, 40, 50], dtype=int)
print("Data in numpy array is : ",data)

print("Dimension of this numpy array is : ", data.ndim)
print("Minimum number in this One dimensional array is : ", data.min())
print("Maxiimum number in this One dimensional array is : ", data.max())
print("Sum of all elements in this One dimensional array is : ", data.sum())

'''
Output :

Data in numpy array is :  [10 20 30 40 50]
Dimension of this numpy array is :  1
Minimum number in this One dimensional array is :  10
Maxiimum number in this One dimensional array is :  50
Sum of all elements in this One dimensional array is :  150

'''

Re-shaping an existing Array :

Re-shaping will be very helpful while working on huge data on multi-dimensional arrays
In such cases, we need to convert 'n' dimensional arrays into one dimensional array, partition them in distributed system, apply operations at partitional level and then again reshape back to normal. This is just one use case, we have multiple use cases like this in Data Science.
This information is just for our knowledge to understand why we use re-shaping.

import numpy as np

data_set1 = np.array([10, 20, 30, 40, 50, 60], dtype=int)
print(data_set1)
print("Dimension before reshaping : ", data_set1.ndim)
print()

# While reshaping, considering the number of elements in initial data set is very important
# we have six elements and hence we can reshape to (3 * 2) or (2 * 3) or (1 * 6) or (6 * 1)
data_set2 = data_set1.reshape(2, 3)
print(data_set2)
print("Dimension after reshaping : ", data_set2.ndim)

data_set3 = data_set1.reshape(3, 2)
print(data_set3)
print("Dimension after reshaping : ", data_set3.ndim)


'''
Output :

[10 20 30 40 50 60]
Dimension before reshaping :  1

[[10 20 30]
 [40 50 60]]
Dimension after reshaping :  2
[[10 20]
 [30 40]
 [50 60]]
Dimension after reshaping :  2

'''

Incorrect reshaping :

import numpy as np

data_set1 = np.array([10, 20, 30, 40, 50, 60], dtype=int)
print(data_set1)
print("Dimension before reshaping : ", data_set1.ndim)
print()

data_set2 = data_set1.reshape(3, 3)


'''
Output :

[10 20 30 40 50 60]
Dimension before reshaping :  1

Traceback (most recent call last):
  File "d:\GitHub\Python\python_practise\numpy\np_practise.py", line 8, in <module>
    data_set2 = data_set1.reshape(3, 3)
                ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 6 into shape (3,3)

'''

Pandas :

Pandas is a powerful open source Python library. It is used for data analysis and manipulation. Pandas consists of data structures and functions to perform efficient operations on data. Pandas is well suited for working with tabular data, such as spread sheets, SQL tables.

It is built on the top of NumPy library, which means lot of structures in NumPy are used in Pandas. The data produced by Pandas is often used as input for plotting functions in Matplotlib and Machine learning algorithms.

Here is a list of things that we can do using Pandas.

Data set cleaning, merging, and joining.
Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
Powerful group by functionality for performing split-apply-combine operations on data sets.
Data Visualization.

Creating a simple data frame using Pandas :

import pandas as pd

students = {
    "Name" : ["Arun", "Anu", "Vynateya"],
    "ID" : ["1", "2", "3"],
    "Location" : ["Hyd", "Ban", "Ban"] 
}

df = pd.DataFrame(students)

print(df)

'''
Output :

       Name ID Location
0      Arun  1      Hyd
1       Anu  2      Ban
2  Vynateya  3      Ban

'''

Manipulating Index :

import pandas as pd

students = {
    "Name" : ["Arun", "Anu", "Vynateya"],
    "ID" : ["1", "2", "3"],
    "Location" : ["Hyd", "Ban", "Ban"] 
}

df = pd.DataFrame(students, index=["row1", "row2", "row3"])

print(df)

'''
Output :

          Name ID Location
row1      Arun  1      Hyd
row2       Anu  2      Ban
row3  Vynateya  3      Ban

'''

Important points :

Thus, we can convert any type of data set into tables and perform required operations. This is why we use Pandas.
We have same concept called PANDAS in Spark SQL which is inspired from Python Pandas
From Spark 2.x, we can convert a Spark data frame into Python data frame and vice versa
We can see more information related to Data frames in Spark
There is a library called Pi4j using which we can run Python code using Java, Spark will use this library

Please use Databricks community addition for practicing both Python and Scala.

Thanks,

Arun Mathe

Email ID : arunkumar.mathe@gmail.com

Contact ID : 9704117111

DataSphere

Search This Blog

Python : Python for Spark

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

AWS : Boto3 (Accessing AWS using Python)