AWS : Running spark code using Databricks

In this blog, we are going to understand how to prepare Databricks workbench to read, write data from relational databases in cloud(AWS) environment. Databricks is a company that provides could based platform for data analytics and AI. We have to create an account here which is free of cost.

What is the use case ?

We are basically trying to read data from a MySQL table which is residing inside a MySQL DB created in AWS RDS service. Once we read the table/data, we need to apply a simple transformation and load same table in another SQL DB i.e., MicrosoftSQL which is also residing inside AWS RDS service. So, first we need to create 2 databases, one MySQL and another MSSQL by logging in into AWS account.

Once these 2 databases are ready, we need to install SQL workbench to interact with these relational databases from our local system(Please refer : How to setup SQL workbench blog)

How to create Databricks account ?

Login to https://community.cloud.databricks.com/ using email ID
You will receive an OTP to email, enter it in the prompt and login
Reference screenshot

How to create a cluster with Spark environment (refer above screenshot) ?

Click on compute option
Click on "Create compute" button
Now select some random version of spark and click create compute button
That's all, cluster with spark environment have created

How to open a new note to write Spark code ?

Click on "New" button on the top and then create a new notebook

Scenario 1 : Read table from MySQL

Open SQL workbench, connect to MySQL database inside AWS RDS(refer section "How to open a new workbench and connect to a Database ?" inside my blob "How to setup SQL workbench"

Spark code :

# Below line of code is the connection URL of MySQL database inside AWS RDS service
data="jdbc:mysql://sparkpoc1.ctucw2ay6r46.ap-south-1.rds.amazonaws.com:3306/arundb"

# This is how we connect to a specific table using spark, check query section in spark website : https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html 
df=spark.read.format("jdbc").option("url", data).option("user","admin").option("password","Mypassword.01").option("dbtable","emp").load()

# To display the data that we read using above line of code
df.show()

Scenario 2 : Read table from MySQL and load same table in another database, MSSQL

Spark code :

from pyspark.sql.functions import *
# MySQL URL
data="jdbc:mysql://sparkpoc1.ctucw2ay6r46.ap-south-1.rds.amazonaws.com:3306/arundb"
# reading emp table from MySQL
df=spark.read.format("jdbc").option("url", data).option("user","admin").option("password","Mypassword.01").option("dbtable","emp").load()
# displaying information from data frame
#df.show()
# Transformation
df=df.withColumn("today", current_date())
# MSSQL URL
mssqlurl="jdbc:sqlserver://sparkpoc2.ctucw2ay6r46.ap-south-1.rds.amazonaws.com:1433;trustServerCertificate=true;databaseName=arundb"
# Loading emp table read from MySQL DB to arunemp table in MSSQL table
df.write.format("jdbc").option("url", mssqlurl).option("user","admin").option("password","Mypassword.02").option("dbtable","arun1emp").save()

Scenario 3 : Reading all tables from a DB in MySQl, transform and load it into destination DB MSSQL

Spark code :

from pyspark.sql.functions import *

#tabs=["emp","dept","orders"]

data="jdbc:mysql://walmart2025.cdyg20qqucui.ap-south-1.rds.amazonaws.com:3306/pavanidb"
qry="(SELECT table_name FROM information_schema.tables WHERE table_schema = 'pavanidb') aaa"
all=spark.read.format("jdbc").option("url",data).option("user","myuser").option("password","Mypassword.1").option("dbtable",qry).load()
type(all)
print(all)
tabs = [t[0] for t in all.collect()]

print(tabs)
for t in tabs:
    print("table:",t)
    data="jdbc:mysql://walmart2025.cdyg20qqucui.ap-south-1.rds.amazonaws.com:3306/pavanidb"
    df=spark.read.format("jdbc").option("url",data).option("user","myuser").option("password","Mypassword.1").option("dbtable",t).load()
    df.show(4)
    #df.write.format("jdbc").option("url",mshost).option("user","myuser").option("password","Mypassword.1").option("dbtable",t+"_mysql123").save()

Arun Mathe
Gmail ID : arunkumar.mathe@gmail.com
Contact No : +91 9704117111

DataSphere

Search This Blog

AWS : Running spark code using Databricks

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

Python : Python for Spark

Spark Core : Understanding RDD & Partitions in Spark