In this blog, we are going to see how Spark streaming will work, we will also see how to write Spark code to read streaming data, and store in some other place, let's say AWS RDS.
Spark Architecture :
We have below layers in Spark architecture.
- Data Storage (HDFS, HBase, Cassandra, Amazon S3)
- Resource Management (Hadoop Yarn, Apache Mesos, Kubernetes)
- Processing Engine (Spark Core)
- Libraries (Spark streaming, Spark SQL, GraphX, MLlib)
- API's (Scala, Python, Java, R)
Spark core :
Spark core is the heart of Spark architecture. By default, it will process only batch data(historical data), on the top of this we can use libraries to perform multiple activities. That means, if you need to run any SQL queries on the top of Spark, it won't support directly, hence we use Spark SQL.
We have to understand that there is a limitation in Spark where we can't read live incoming traffic, at-least we need to let that incoming traffic wait for few seconds, and then only read, process it. This is configurable using a property in Spark code. We will see more information in below sections.
Apache Flink is an open source Apache tool best fit to process real time data(live streaming), but Spark streaming is a best fit for batch processing. To make it simple, use Spark streaming for historical data, and Apache Flink for live data processing.
More information about Apache Flink at : https://www.tutorialspoint.com/apache_flink/apache_flink_batch_realtime_processing.htm
Let's get into the use case now, below are the prerequisites :
- Create an EC2 to send some live data using netcat command
- Using Databricks/EMR/local computer, create a Spark environment
- Write Spark code to read incoming data
Use below command in command prompt after connecting to Ubuntu EC2 instance :
This is act as a source which will be sending live data.
ubuntu@ip-172-31-12-58:~$ nc -lk 1111
bob,36,hyd
sri,35,hyd
thoshi,01,hyd
ashu,34,hyd
ram,60,hyd
Now login to Databricks, create a Spark ecosystem and run below piece of code :
- To create or get the Spark session
- To set Streaming context, '10' is to tell streaming to repeat receiving stream after 10 seconds
- To receive stream from below EC2 in AWS using port number '1111' (we need to set this in AWS EC2 config under security tab)
- To print received stream in Databricks console
lines.pprint()
- Create data frame using RDD, lambda
- File columns groupBy city="hyd"
- Connect to AWS RDS (MySQL)
- Create a table with name livefeb4
- Append records into this table
Spark code for this activity :
Now, after this job completed, you can see a table created with name livefeb4 in RDS.
Let's learn more information in coming blogs. Have a great day!
Arun Mathe
Gmail ID : arunkumar.mathe@gmail.com
Contact No : +91 9704117111
Comments
Post a Comment