Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark.
# Float
# String
# Tuple, it will allow duplicates
Incase if you want to practice Spark in Big Data environment, you can use Databricks.
- URL : https://community.cloud.databricks.com
- This is the main tool which programmers are using in real time production environment
- We have both Community edition(Free version with limited support) & paid versions available
- Register for above tool online for free and practice
- Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code.
Correct Indentation :
def greet():
print("Hello!") # Indented correctly
print("Welcome to Python.") # Inside the function
greet() # No indentation needed here
In-correct Indentation :
def greet():
print("Hello!") # ❌ IndentationError: expected an indented block
- Regarding data types, Python doesn't have long, double, it has int, float, string, complex, tuple, list, set, dict etc.
# Int
a = 10
print(type(a))
f = 10.2
print(type(f))
str = "Spark"
print(type(str))
# Tuple is immutable, we can't modify elements
# List allows duplicates, it is ordered
t1 = (1, 2, 3)
print(type(t1))
# We can add and remove elements
# Set DO NOT allow duplicates, also it is not ordered
# Dictionary is a key, value concept (Map in Java)
l1 = [1, 2, 3]
print(type(l1))
s1 = {1, 2, 3}
print(type(s1))
d1 = {1 : 'a', 2 : 'b', 3 : 'c'}
print(type(d1))
Terminal Output :
<class 'int'>
<class 'float'>
<class 'str'>
<class 'tuple'>
<class 'list'>
<class 'set'>
<class 'dict'>
String reverse using slicing :
word = "Python"
print(word[::-1]) # Output: "nohtyP"
For loop using range :
for i in range(10) :
print(i)
'''
Output
PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
0
1
2
3
4
5
6
7
8
9
'''
Iterating collections, DICT :
dict = {"name" : "Arun", "state" : "Telangana", "city" : "Hyd" }
for key, value in dict.items():
print(key, value)
'''
Output
PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
name Arun
state Telangana
city Hyd
'''
Defining variables :
# Correct way of defining variables
_name = "Arun"
my_name = "Arun"
my_name123 = "Arun"
# Incorrect way of defining variables
# 123name = "Arun"
# my-name = "Arun"
Adding boolean values to int :
# True means 1
print(1 + True)
# False means 0
print(1 + False)
'''
Output :
PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
2
1
'''
# Integer values will be added and imaginary part will be as is.
print(1 + (2 + 3j))
'''
Output :
(3+3j)
'''
String Operations :
# String operations
data = "Welcome to Hyderbad. How are you doing today ?"
print("data :", data)
print("data.split() :", data.split())
print("data.split('.') :", data.split('.'))
print("len(data) :", len(data))
print("type(data) :", type(data))
'''
Output :
PS D:\GitHub\Python\python_practise> & C:/Users/arunk/AppData/Local/Programs/Python/Python312/python.exe d:/GitHub/Python/python_practise/basics/sample.py
data : Welcome to Hyderbad. How are you doing today ?
data.split() : ['Welcome', 'to', 'Hyderbad.', 'How', 'are', 'you', 'doing', 'today', '?']
data.split('.') : ['Welcome to Hyderbad', ' How are you doing today ?']
len(data) : 46
type(data) : <class 'str'>
'''
x = " abc "
print(x.replace("", "*"))
'''
Output :
* * * *a*b*c* * * *
'''
x = " abc "
print(x.strip())
print(x.lstrip().replace(" ", "*"))
print(x.rstrip().replace(" ", "*"))
'''
Output :
abc
abc***
***abc
'''
List Functionality :
- List allows duplicates
- List is ordered and index based
- Mixed data types are allowed
- Mutable
Set Functionality :
- Set doesn't allow duplicates
- Set is not ordered and not index based
- Mixed data types are allowed
- Immutable
Tuples Functionality :
- Tuples allow duplicates
- Tuple is ordered and index based
- Mixed data type allowed
- Immutable
Dictionary Functionality :
- Dictionary keys are unique
- Mixed data types allowed
list1 = [1, 2, 3, 4]
for x in list1:
print(x)
'''
Output :
1
2
3
4
'''
list1 = [1, 2, 3, 4]
l2 = [x for x in list1]
print(l2)
'''
Output :
[1, 2, 3, 4]
'''
list1 = [1, 2, 3, 4]
list2 = [x*x for x in list1]
print(list2)
'''
Output :
[1, 4, 9, 16]
'''
list1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
list2 = [x*x for x in list1 if(x % 2) == 0]
print(list2)
'''
Output :
[4, 16, 36, 64, 100]
'''
# Declare a dictionary
dict1 = {"x" : 1, "y" : 2, "z" : 3}
print("dict1.keys() :", dict1.keys() )
print("dict1.values() :", dict1.values() )
print("dict1.items() :", dict1.items() )
'''
Output :
dict1.keys() : dict_keys(['x', 'y', 'z'])
dict1.values() : dict_values([1, 2, 3])
dict1.items() : dict_items([('x', 1), ('y', 2), ('z', 3)])
'''
print(range(5))
print(list(range(5)))
print(list(range(1, 5)))
print(list(range(1, 5, 1)))
print(list(range(1, 5, 2)))
'''
Output :
[0, 1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 3]
'''
x = "abcdef"
if ('a' in x):
print("A is available in given string")
'''
Output :
A is available in given string
'''
x = "abcdef"
if ('p' in x):
print("P is available in given string")
else:
print("P is not available in given string")
'''
Output :
P is not available in given string
'''
x = "abcdef"
if ('p' in x):
print("P is available in given string")
elif('q' in x):
print("q is available in given string")
elif('r' in x):
print("r is available in given string")
else:
print("No luck")
'''
Output :
No luck
'''
Break :
for i in range(5):
print("Starting the loop : " +str(i))
stop = input("Do you want to stop the loop (y/n) ? ")
if stop == "y":
break
print("Ending the loop : " +str(i))
print("PROGRAM FINISHED! ")
'''
Output : Once break executed, it won't execute rest of the code in the scope of break
Starting the loop : 0
Do you want to stop the loop (y/n) ? y
PROGRAM FINISHED!
'''
Continue :
"""
continue keyword will ignore, rest of the code next to it in loop ;
but it won't exit loop like break statement
"""
for i in range(5):
print("Starting the loop : " +str(i))
stop = input("Do you want to stop the loop (y/n) ? ")
if stop == "y":
continue
print("Ending the loop : " +str(i))
print("PROGRAM FINISHED! ")
'''
Output :
Starting the loop : 0
Do you want to stop the loop (y/n) ? n
Ending the loop : 0
Starting the loop : 1
Do you want to stop the loop (y/n) ? y
Starting the loop : 2
Do you want to stop the loop (y/n) ? y
Starting the loop : 3
Do you want to stop the loop (y/n) ? y
Starting the loop : 4
Do you want to stop the loop (y/n) ? y
PROGRAM FINISHED!
'''
Pass keyword :
# Use "Pass" keyword to momentarily ignore errros which are syntactical
# Later we can replace "Pass" with actual code
def addition():
pass
def main():
pass
Functions :
def add(a, b):
return a + b
print("add(1,2) : ", add(1, 2))
'''
Output :
add(1,2) : 3
'''
# Anaonymus functions
lambda a, b : a + b
lambda a, b : a * b
Factorial of a given number in recursive method :
"""
def factorial_solution(value):
result = 1
for i in range(2, value+1):
result = result * i
return result
def main():
result = factorial_solution(10)
print("Factorial of given value is", result)
main()
"""
# factorial of a number (applicable to only positive numbers)
# n! = n * (n - 1) * (n - 2) * .... 3 * 2 *1
# Ex : 5! = 5 * 4 * 3 * 2 * 1
# 5! = 120
# Factorial of zero is 1.
# 0! = 1
def recursive_factorial(n):
if n == 1:
return n
else:
return n * recursive_factorial(n-1)
def main():
number = input("Please enter the number : ")
number = int(number)
if number < 0:
print("Please enter a number greater than '0'")
elif number == 0:
print("factorial of zero is 1")
else:
print(f"Factorial of given number {number} is : ", recursive_factorial(number) )
main()
'''
Output :
Please enter the number : 5
Factorial of given number 5 is : 120
'''
Apart from above information, we need to have some knowledge on additional Python modules like Numpy and Pandas. In Java we call them as libraries, but in Python we can them as modules.
Numpy :
Numpy is nothing but numerical Python. Generally numerical operations are time taking operations. To increase the performance, Numpy came into existence.
- In real time, programmers use Numpy rather than list as this will give better performance
- Numpy is faster in accessing values
- Numpy can handle large amount of data in real time
- Numpy is more crucial for Data Scientists
How to install NumPy & Pandas ?
- pip3 install numpy
- pip3 install pandas
I already have both Numpy & Pandas libraries installed in my local.
PS D:\GitHub\Python\python_practise> pip install numpy
Requirement already satisfied: numpy in c:\users\arunk\anaconda3\lib\site-packages (1.26.4)
PS D:\GitHub\Python\python_practise> pip install pandas
Requirement already satisfied: pandas in c:\users\arunk\anaconda3\lib\site-packages (2.2.2)
We need to import below packages in Python code to use these libraries :
- import numpy
- import pandas
import numpy as np
import pandas as pd
If above import statements are not giving any errors in your code then installation was successful and we are ready to play with these libraries.
Creating a array in Numpy :
"""
Numpy is a package that allows us to manipulate arrays of data.
Usually nymerical but we can also put strings in it.
We can easily manipulate numbers mathematically in a array.
It is like a array class but kind of turbo charged array (more power!)
Note : Please use 'pip3 list' and see if numpy is already installed in your machine.
If not then, use 'pip3 install numpy' to install this python package
"""
import numpy as np
def main():
# creating a one dimentional array/list using 'np'
num1 = np.array([1, 2, 3, 4], dtype=int)
print(num1)
# How to check the type of array ?
print("Type of the array is : ",num1.dtype)
# How to check dimension of an array ?
print("Dimension of the array is : ",num1.ndim)
# What is the shape of the array ?
print("Shape of the array is : ",num1.shape)
# How many bytes does this array have ?
print("Bytes : ", num1.nbytes)
# # creating a two dimentional array/list using 'np'
num2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=int)
print(num2)
print("Type of the array is : ", num2.dtype)
print("Dimension of the array is : ", num2.ndim)
print("Shape of the array is : ", num2.shape)
if __name__ == "__main__":
main()
'''
Output :
[1 2 3 4]
Type of the array is : int32
Dimension of the array is : 1
Shape of the array is : (4,)
Bytes : 16
[[1 2]
[3 4]
[5 6]]
Type of the array is : int32
Dimension of the array is : 2
Shape of the array is : (3, 2)
'''
Zero dimensional Array:
import numpy as np
data = np.array(10)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)
'''
Output :
Data in numpy array is : 10
Dimension of this numpy array is : 0
'''
One dimensional Array:
import numpy as np
data = np.array([10, 20, 30, 40, 50])
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)
'''
Output :
Data in numpy array is : [10 20 30 40 50]
Dimension of this numpy array is : 1
'''
Two dimensional Array:
import numpy as np
data = np.array([[10, 20, 30, 40, 50], [60, 70, 80, 90, 100]], dtype=int)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)
'''
Output :
Data in numpy array is : [[ 10 20 30 40 50]
[ 60 70 80 90 100]]
Dimension of this numpy array is : 2
'''
Some mathematical operations : (min, max, sum, avg, mean, variance etc.)
import numpy as np
data = np.array([10, 20, 30, 40, 50], dtype=int)
print("Data in numpy array is : ",data)
print("Dimension of this numpy array is : ", data.ndim)
print("Minimum number in this One dimensional array is : ", data.min())
print("Maxiimum number in this One dimensional array is : ", data.max())
print("Sum of all elements in this One dimensional array is : ", data.sum())
'''
Output :
Data in numpy array is : [10 20 30 40 50]
Dimension of this numpy array is : 1
Minimum number in this One dimensional array is : 10
Maxiimum number in this One dimensional array is : 50
Sum of all elements in this One dimensional array is : 150
'''
Re-shaping an existing Array :
- Re-shaping will be very helpful while working on huge data on multi-dimensional arrays
- In such cases, we need to convert 'n' dimensional arrays into one dimensional array, partition them in distributed system, apply operations at partitional level and then again reshape back to normal. This is just one use case, we have multiple use cases like this in Data Science.
- This information is just for our knowledge to understand why we use re-shaping.
import numpy as np
data_set1 = np.array([10, 20, 30, 40, 50, 60], dtype=int)
print(data_set1)
print("Dimension before reshaping : ", data_set1.ndim)
print()
# While reshaping, considering the number of elements in initial data set is very important
# we have six elements and hence we can reshape to (3 * 2) or (2 * 3) or (1 * 6) or (6 * 1)
data_set2 = data_set1.reshape(2, 3)
print(data_set2)
print("Dimension after reshaping : ", data_set2.ndim)
data_set3 = data_set1.reshape(3, 2)
print(data_set3)
print("Dimension after reshaping : ", data_set3.ndim)
'''
Output :
[10 20 30 40 50 60]
Dimension before reshaping : 1
[[10 20 30]
[40 50 60]]
Dimension after reshaping : 2
[[10 20]
[30 40]
[50 60]]
Dimension after reshaping : 2
'''
Incorrect reshaping :
import numpy as np
data_set1 = np.array([10, 20, 30, 40, 50, 60], dtype=int)
print(data_set1)
print("Dimension before reshaping : ", data_set1.ndim)
print()
data_set2 = data_set1.reshape(3, 3)
'''
Output :
[10 20 30 40 50 60]
Dimension before reshaping : 1
Traceback (most recent call last):
File "d:\GitHub\Python\python_practise\numpy\np_practise.py", line 8, in <module>
data_set2 = data_set1.reshape(3, 3)
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 6 into shape (3,3)
'''
Pandas :
Pandas is a powerful open source Python library. It is used for data analysis and manipulation. Pandas consists of data structures and functions to perform efficient operations on data. Pandas is well suited for working with tabular data, such as spread sheets, SQL tables.
It is built on the top of NumPy library, which means lot of structures in NumPy are used in Pandas. The data produced by Pandas is often used as input for plotting functions in Matplotlib and Machine learning algorithms.
Here is a list of things that we can do using Pandas.
- Data set cleaning, merging, and joining.
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
- Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
- Powerful group by functionality for performing split-apply-combine operations on data sets.
- Data Visualization.
Creating a simple data frame using Pandas :
import pandas as pd
students = {
"Name" : ["Arun", "Anu", "Vynateya"],
"ID" : ["1", "2", "3"],
"Location" : ["Hyd", "Ban", "Ban"]
}
df = pd.DataFrame(students)
print(df)
'''
Output :
Name ID Location
0 Arun 1 Hyd
1 Anu 2 Ban
2 Vynateya 3 Ban
'''
Manipulating Index :
import pandas as pd
students = {
"Name" : ["Arun", "Anu", "Vynateya"],
"ID" : ["1", "2", "3"],
"Location" : ["Hyd", "Ban", "Ban"]
}
df = pd.DataFrame(students, index=["row1", "row2", "row3"])
print(df)
'''
Output :
Name ID Location
row1 Arun 1 Hyd
row2 Anu 2 Ban
row3 Vynateya 3 Ban
'''
Important points :
- Thus, we can convert any type of data set into tables and perform required operations. This is why we use Pandas.
- We have same concept called PANDAS in Spark SQL which is inspired from Python Pandas
- From Spark 2.x, we can convert a Spark data frame into Python data frame and vice versa
- We can see more information related to Data frames in Spark
- There is a library called Pi4j using which we can run Python code using Java, Spark will use this library
Please use Databricks community addition for practicing both Python and Scala.
Thanks,
Arun Mathe
Email ID : arunkumar.mathe@gmail.com
Contact ID : 9704117111
again thank you so much I revise my concepts
ReplyDeleteYou are welcome !
Delete