DataSphere

Posts

Showing posts from January, 2025

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action. Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool) will start a Pyspark job to receive this file from Lambda, perform so...

Scala : Scala for Spark

Lets go through some basic concepts in Scala. We can use any one of the below approach to practice. Terminal (Command line) Notebook (Jupiter) IDE (Eclipse, IntelliJ etc.,) Below 2 are important key words : Val Means Value Immutable Use when you needs constant (value won't be changed) Var Means Variable Mutable Use when you need a variable (value may change) Scala syntax : val <identifier> : <data_type> = <value / expression> Ex : val name : String = "Arun" + " " + "Spark" (can have a value or an expression) Java syntax : <data_type> <identifier> = <value / expression>; Ex : String name = "Arun"; Question : In terms of performance, will Java, Python, Scala are same while working on Spark ? Answer : Until Spark-2.x Scala will always give better performance comparatively with Java & Python, but from Spark-2.x Spark data frame got introduced. You can this data frame using any language, performance...

Spark : Installation

We can use programming languages like Java, Scala, Python to implement Spark.Also it is very important to understand version compatibility of different software's that we need to implement Spark. Important points to remember in Scala : Java is object oriented programming language but not functional oriented Scala is both object & functional oriented It is built on the top of Java, we can call .java programmes directly from a Scala code Version compatibility : We have to make sure that we are using compatible Scala, Java, Spark combinations while installing them in standalone mode for dev/testing purpose. If you are doing in Cloud(for example AWS) then no need to worry, it will be automatically taken care while creating a service like an EMR machine with required big data software's. Below are the compatible versions. Scala-2.9, Java-1.7+, Spark-1.x Scala-2.10, Java-1.7+, Spark-1.x Scala-2.11, Java-1.8+, Spark-1.x, Spark-2.x Scala-2.12, Java-1.8+, Spark-2.x, Spark-...

AWS : Boto3 (Create, Delete RDS using Python)

Below code is to delete a existing RDS and also to create a new RDS in AWS RDS using boto3 python package : import boto3 # Creating a client session for RDS using region name, aws_access_key_id & aws_secret_access_key client = boto3 . client ( 'rds' , region_name = "ap-south-1" , aws_secret_access_key = 'YOUR_AWS_SECRET_ACCESS_KEY' , aws_access_key_id = 'YOUR_AWS_ACCESS_KEY_ID' ) # Deletig an existing instance # DB instance ID is enough, make sure to skip final snapshot & delete any automated backup's response = client .delete_db_instance( DBInstanceIdentifier = 'newpoc' , SkipFinalSnapshot = True , DeleteAutomatedBackups = True ) # To cross check if any RDS is available response = client .describe_db_instances() print ( response ) # To create a new RDS in AWS # DBInstanceIdentifier is the name of RDS # Engine must be your expected RDBMS name # Provide user name and...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...

HIVE : CREATE & DROP database

HIVE is a Data warehouse, it is not a database. Like how important it is to understand when to use a particular tool, it is equally important to understand when NOT to use it. HIVE is designed for only analytical operations in large scale, it is not a good fit for transactional operations. HIVE data is totally de-normalized. HIVE supports JOINS but need to avoid them as much as we can to improve performance. HIVE query language HQL is similar to SQL. Lets understand the relation between Hadoop and HIVE : HDFS is having folders and files HIVE have databases and tables When we create a database in HIVE, it will create a folder in HDFS When we create a table in HIVE, it will create a folder in HDFS When we insert records in HIVE table, those records will be saved in HDFS in the form of files Delimiter is very important while creating a table in HIVE Delimiter can be a Comma , Tab etc., HIVE can store structured , semi-structured & un-structured data but it is important t...

Hadoop : HIVE Installation

A person sailing in a boat who know swimming will always be in a safe side comparatively with other person who doesn't know how to swim. Guys, we all know that we are using Cloud based platforms like AWS Athena for working on HIVE these days, but it is very important to understand the basics of HIVE, like what is HIVE, how to install it, what are the installation modes, HIVE tables, concepts involved in it to format the data, serDe etc., Most of the time, already implemented HIVE serDe's are good enough for practical use cases but what if we land into a situation where we have to write our own serDe in Java? Hence, it is good to have this knowledge for a Data engineer though we are using UI in cloud to perform same activities. Hence let's learn HIVE in depth and have this knowledge with us to sail further. What is HIVE ? Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like inter...