DataSphere

Posts

Showing posts from February, 2025

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark. RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

Kafka Structured streaming : Using Nifi, Spark & Snowflake

This is a end-to-end data flow pipeline created using Apache Nifi, Kafka-Spark structured streaming and Snowflake. Flow of data in this pipeline : Server(https://randomuser.me/api/) -----> Nifi (Using REST API) -----> Kafka(Kafka brokers) -----> Consumer(Kafka Structured streaming) ----->Snowflake(To store data) For this project, I have used above online data generation website(randomuser.me) to collect the data stream. This will act as Server. We have to configure Nifi to catch this data stream using InvokeHTTP processor We have to configure Nifi to send this data stream to Kafka Producer using another processor called PublishKafkaRecord_2_6 So that data stream will be continuously happening from Online website to Kafka Producer Using PyCharm/VSS, create code for Kafka Consumer to receive this data using Spark structured streaming And store it in snowflake (after doing required transformation if needed) What knowledge required to understand this pipeline ? We need to ...

Spark Core : Introduction & understanding Spark Context

Apache Spark is a free, open-source tool for processing large amounts of data in parallel across multiple computers. It is used for big data work loads like machine learning, graph processing and big data analytics. Spark is built on top of Hadoop, it is aware how Hadoop works. Programming languages for Spark : Scala Python Java R SQL Spark support 2 operations : Transformations Actions RDD (Resilient Distributed Dataset) : Entire Spark is built on the base concept called RDD. Below 2 operations are supported by RDD. Transformations Actions Features of Spark : Distributed, Partitioned, Replicated Note that if data is mutable then it will be hard to distribute, partition and replicate Hence Spark required Immutability feature Immutability We can't change the data By design, Spark is purely designed for Analytical operations(OLAP) It do support transactional operations using some 3rd party tools Cacheable To reuse data we cache it If information is static, no need to recomput...

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL : https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet(): print("Hello!") # Indented correctly print("Welcome ...