Skip to main content

AWS : Athena

What is Athena ?

Amazon Athena is a serverless, interactive query service that allows users to analyze data stored in Amazon Simple Storage Service(S3). To make it more clear, it is a kind of workbench for working on HIVE queries in more visual way. 

AWS took open source Apache HIVE and modified it and released their own product called Athena.


Home page of Athena looks as below(login to AWS and type Athena under search) :

Once you are in Athena home page, it will show a prompt to launch it on right hand side, click launch to open Athena editor(as shown in above screenshot).

Also, before start working on Athena, we need to set up a query result location in Amazon S3. Let's understand why we do this, if you remember, HIVE doesn't have a storage, we need a relational database to store schema information and HDFS storage to store table data(records). HIVE is just a processing layer on the top of Hadoop. 

Similarly, even in Athena, we need to set up a query result location in Amazon S3 so that result will be saved in this location. So, we need to create a S3 bucket and have some folders in it.


Hence I have uploaded asl.csv into my S3 bucket : s3://arun2025/input/asl/ (note that HIVE process only folder, not files).

We need to set Query result location(as shown in below screen shot) to store result :


Now come back to Athena Editor.


Now environment is ready, lets see how to use Athena :

  1. First step is to create a Database : CREATE DATABASE arundb

  2. Now create a table as below and load data as shown below(write below table info and click run)
       Example : 
       CREATE EXTERNAL TABLE asltab (
       name  STRING,
       age INT,
       city  STRING
        )
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
        WITH SERDEPROPERTIES (
         "separatorChar" = ",",
         "quoteChar"     = "\""
        )  
        LOCATION 's3://arun2025/input/asl/'  tblproperties ("skip.header.line.count"="1");

        
            Note that on left hand side, you can see table information.
        
    3. Now run some query to see if we are seeing results
        Ex : select * from asltab where age > 30;


One of the main feature that AWS Athena is offering is, we can create a HIVE table using the Athena UI itself, let's see how to do it.

Click on 'Create - S3 bucket data' button beside 'Tables and Views' tab on the left hand side of Athena editor. 


Then it will take us to below page.


Now enter a Table name, choose an existing database, set a serDe and under 'Column Details' choose column by column or we can use 'Bulk add columns' option to add all columns at once.


Enter column details with comma separated and click on 'Create Table' to create a HIVE table.

Example :

CREATE EXTERNAL TABLE IF NOT EXISTS `arundb`.`s3table` (

  `age` int,

  `job` string,

  `marital` string,

  `education` string,

  `default1` string,

  `balance` int,

  `housing` string,

  `loan` string,

  `contact` string,

  `day` string,

  `month` string,

  `duration` string,

  `campaign` string,

  `pdays` string,

  `previous` string,

  `poutcome` string,

  `y` string

) COMMENT "table from s3"

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES ('separatorChar' = ';')

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://arun2025/input/banktab/'

TBLPROPERTIES (

  'classification' = 'csv',

  'skip.header.line.count' = '1'

);


So, above query loaded data from location : "s3://arun2025/input/banktab/" and create a result in the temp location as s3://arun2025/temp/

Likewise we can create required HIVE queries, and load data from S3 bucket, analyze and save results into temp locations of S3.



Points to remember :
  • Athena is free for CREATE, DELETE, ALTER commands but we need to pay if we scan data if data size >= 10 MB (means all SELECT queries with > 10 mb are chargeable)
  • More information about pricing at https://aws.amazon.com/athena/pricing/
  • We need to understand more about SerDe, please check SerDe section.

SerDe :
  • SerDe is an acronym for Serializer/Deserializer in Apache HIVE
  • It is a framework that tells HIVE how to process records, such as reading data from a table and writing it back to HDFS
  • Users can write their own SerDe for their own formats
  • This is basically a Java class which has code to read input data and then write result
  • I will explain more about SerDe in my HIVE blog which I will post in coming days.
  • But for now, keep in mind that we have enough SerDe's available to read all kinds of file formats, we just need to understand which one to use.
  • As we are using Cloud, it have options available to select required SerDe but it is important to understand what a SerDe is, just incase if we landed into a situation where have to create our own SerDe

Let's learn more about HIVE in future blogs. Have a great day!


Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111














Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...