Skip to main content

HIVE : CREATE & DROP database

HIVE is a Data warehouse, it is not a database. 

Like how important it is to understand when to use a particular tool, it is equally important to understand when NOT to use it.

  • HIVE is designed for only analytical operations in large scale, it is not a good fit for transactional operations.
  • HIVE data is totally de-normalized.
  • HIVE supports JOINS but need to avoid them as much as we can to improve performance. 
  • HIVE query language HQL is similar to SQL.

Lets understand the relation between Hadoop and HIVE :

  • HDFS is having folders and files
  • HIVE have databases and tables
  • When we create a database in HIVE, it will create a folder in HDFS
  • When we create a table in HIVE, it will create a folder in HDFS
  • When we insert records in HIVE table, those records will be saved in HDFS in the form of files
  • Delimiter is very important while creating a table in HIVE
  • Delimiter can be a Comma, Tab etc.,
  • HIVE can store structured, semi-structured & un-structured data but it is important to convert un-structured data into HIVE understandable format using serDe, we will see more about it in further blogs of HIVE


              Table in HIVE                                 How it sore in HDFS as a file

               C1    C2    C3                                    11,12,13

               11    12    13                                       21,22,23

                21    22    23


Note :

Always write your HIVE queries in a notepad and copy paste them in HIVE prompt because HIVE won't allow you to update the query if there is a mistake.



Create a Database in HIVE :

Syntax 1 :

CREATE DATABASE IF NOT EXISTS <DB_NAME>;


Syntax 2 :

CREATE DATABASE IF NOT EXISTS <DB_NAME>

COMMENT 'COMMENT ON DATABASE'

LOCATION 'PATH OF THE DATABASE LOCATION'

WITH DBPROPERTIES(KEY1 = VALUE1, KEY2 = VALUE2,….)

;


You can mention location to explicitly mention where you want to locate your table :

Example :

create database IF NOT EXISTS sample3
COMMENT 'sample 2 database'
LOCATION '/hive/sample3'
;



Location path should be a folder, it won't work if it is a file path. Because in HIVE table will be saved as folder and records will be saved as file.


We can also add some properties as shown below :

Example :

create database IF NOT EXISTS sample4

COMMENT 'sample 4 database'

LOCATION '/hive/sample4'

WITH DBPROPERTIES('key1' = 'value1', 'key2' = 'value2')

;

These properties will be useful while working with other systems like Spark, and other NoSQL databases to send them some access information.


Using EXTENDED keyword while describing a database will print additional information as well like key, value information which we have used in above WITH DBPROPERTIES keyword while creating sample4 database.

Describe database EXTENDED sample4; 



Note :

  • WHEN WE CREATE A DATABASE / TABLE LOCATION PATH MUST & SHOULD BE FOLDER PATH, NOT FILE PATH
    • IF FOLDER PATH EXISTS, THEN IT WILL USE IT 
    • IF FOLDER PATH NOT EXISTS, THEN IT WILL CREATE IT 
    • IF FILE PATH EXISTS, THEN IT THROW AN ERROR



DROP DATABASE in HIVE :

Syntax :

  • DROP DATABASE <DATABASE_NAME>;
  • DROP DATABASE IF EXISTS <DATABASE_NAME>;
  • DROP DATABASE IF EXISTS <DATABASE_NAME>;
  • DROP DATABASE IF EXISTS <DATABASE_NAME> RESTRICT;
  • DROP DATABASE IF EXISTS <DATABASE_NAME> CASCADE;

Examples :

  • DROP database IF EXISTS sample1;
  • DROP database IF EXISTS sample1 RESTRICT; ==> It will restrict to drop a database if it is having some tables in it
  • DROP database IF EXISTS sample1 CASCADE; ==> This is like a force drop, it will drop database though there are tables associated to this database sample1


We will see more about creating tables in next blog. Have a great day!



Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111



Comments

Popular posts from this blog

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...