Hello World !

Hello World !

Hope you are doing great in life.

I have created this space to share more information about technical world. World is getting changed, technology is growing rapidly. Also, all these new technologies seems to be interesting. We are into Artificial Intelligence era! Let's take advantage of it by learning more technical information. More information will help us to get more confidence, it will obviously improve the clarity of thought.

I will make sure that I will be writing these blogs in a common terminology to let things be clear for every person from technical background.

Please watch this space for more technical information on Big Data world, specifically Data engineering, Data analytics, NoSQL Databases, Hadoop, Hive, Spark, Scala, Python, Artificial Intelligence, Machine Learning, GenAI & AgenticAI.

Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111

Comments

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action. Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool) will start a Pyspark job to receive this file from Lambda, perform so...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark. RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

Spark Core : Introduction & understanding Spark Context

Apache Spark is a free, open-source tool for processing large amounts of data in parallel across multiple computers. It is used for big data work loads like machine learning, graph processing and big data analytics. Spark is built on top of Hadoop, it is aware how Hadoop works. Programming languages for Spark : Scala Python Java R SQL Spark support 2 operations : Transformations Actions RDD (Resilient Distributed Dataset) : Entire Spark is built on the base concept called RDD. Below 2 operations are supported by RDD. Transformations Actions Features of Spark : Distributed, Partitioned, Replicated Note that if data is mutable then it will be hard to distribute, partition and replicate Hence Spark required Immutability feature Immutability We can't change the data By design, Spark is purely designed for Analytical operations(OLAP) It do support transactional operations using some 3rd party tools Cacheable To reuse data we cache it If information is static, no need to recomput...

DataSphere

Search This Blog

Hello World !

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

Spark Core : Understanding RDD & Partitions in Spark

Spark Core : Introduction & understanding Spark Context