AWS : Athena

What is Athena ?

Amazon Athena is a serverless, interactive query service that allows users to analyze data stored in Amazon Simple Storage Service(S3). To make it more clear, it is a kind of workbench for working on HIVE queries in more visual way.

AWS took open source Apache HIVE and modified it and released their own product called Athena.

Home page of Athena looks as below(login to AWS and type Athena under search) :

Once you are in Athena home page, it will show a prompt to launch it on right hand side, click launch to open Athena editor(as shown in above screenshot).

Also, before start working on Athena, we need to set up a query result location in Amazon S3. Let's understand why we do this, if you remember, HIVE doesn't have a storage, we need a relational database to store schema information and HDFS storage to store table data(records). HIVE is just a processing layer on the top of Hadoop.

Similarly, even in Athena, we need to set up a query result location in Amazon S3 so that result will be saved in this location. So, we need to create a S3 bucket and have some folders in it.

Hence I have uploaded asl.csv into my S3 bucket : s3://arun2025/input/asl/ (note that HIVE process only folder, not files).

We need to set Query result location(as shown in below screen shot) to store result :

Now come back to Athena Editor.

Now environment is ready, lets see how to use Athena :

First step is to create a Database : CREATE DATABASE arundb
Now create a table as below and load data as shown below(write below table info and click run)

Example :

CREATE EXTERNAL TABLE asltab (

name STRING,

age INT,

city STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

"separatorChar" = ",",

"quoteChar" = "\""

)

LOCATION 's3://arun2025/input/asl/' tblproperties ("skip.header.line.count"="1");

Note that on left hand side, you can see table information.

3. Now run some query to see if we are seeing results

Ex : select * from asltab where age > 30;

One of the main feature that AWS Athena is offering is, we can create a HIVE table using the Athena UI itself, let's see how to do it.

Click on 'Create - S3 bucket data' button beside 'Tables and Views' tab on the left hand side of Athena editor.

Then it will take us to below page.

Now enter a Table name, choose an existing database, set a serDe and under 'Column Details' choose column by column or we can use 'Bulk add columns' option to add all columns at once.

Enter column details with comma separated and click on 'Create Table' to create a HIVE table.

Example :

CREATE EXTERNAL TABLE IF NOT EXISTS `arundb`.`s3table` (

`age` int,

`job` string,

`marital` string,

`education` string,

`default1` string,

`balance` int,

`housing` string,

`loan` string,

`contact` string,

`day` string,

`month` string,

`duration` string,

`campaign` string,

`pdays` string,

`previous` string,

`poutcome` string,

`y` string

) COMMENT "table from s3"

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES ('separatorChar' = ';')

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://arun2025/input/banktab/'

TBLPROPERTIES (

'classification' = 'csv',

'skip.header.line.count' = '1'

);

So, above query loaded data from location : "s3://arun2025/input/banktab/" and create a result in the temp location as s3://arun2025/temp/

Likewise we can create required HIVE queries, and load data from S3 bucket, analyze and save results into temp locations of S3.

Points to remember :

Athena is free for CREATE, DELETE, ALTER commands but we need to pay if we scan data if data size >= 10 MB (means all SELECT queries with > 10 mb are chargeable)
More information about pricing at https://aws.amazon.com/athena/pricing/
We need to understand more about SerDe, please check SerDe section.

SerDe :

SerDe is an acronym for Serializer/Deserializer in Apache HIVE
It is a framework that tells HIVE how to process records, such as reading data from a table and writing it back to HDFS
Users can write their own SerDe for their own formats
This is basically a Java class which has code to read input data and then write result
I will explain more about SerDe in my HIVE blog which I will post in coming days.
But for now, keep in mind that we have enough SerDe's available to read all kinds of file formats, we just need to understand which one to use.
As we are using Cloud, it have options available to select required SerDe but it is important to understand what a SerDe is, just incase if we landed into a situation where have to create our own SerDe

Let's learn more about HIVE in future blogs. Have a great day!

Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111

DataSphere

Search This Blog

AWS : Athena

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

Python : Python for Spark

Spark Core : Understanding RDD & Partitions in Spark