Skip to main content

HIVE : How to create a database, table and load data

 Please keep below points in mind before creating a table in HIVE.

  • HIVE is not a database but it is a data warehouse which is designed for data analytics(OLAP)
  • HIVE doesn't have any internal storage to store data, it will use HDFS to store table data and RDBMS(like MySQL) to store schema related information. Hence we need to have a RDBMS like MySQL.
  • When we create a database, HIVE will create a folder in HDFS. Later when we create a HIVE table, it will create another folder inside database folder.
  • After creating database, table, once we load/insert data into it then that data will be stored as a file in HDFS and schema information will be stored in MySQL.
  • Delimiter is very important while loading the data, we will see some examples below.
  • As this is analytics, we know what type of data is there(as data is historical) and we will create schema based on the data. Based on the data, we need to create a table.
HOW TO CREATE DATABASE
-----------------------------
CREATE DATABASE IF NOT EXISTS kalyan;

CREATE DATABASE IF NOT EXISTS kalyan
COMMENT 'My database name is kalyan'
LOCATION '/hive/kalyan'
WITH DBPROPERTIES ('key1' = 'value1', 'key2' = 'value2');

DESCRIBE DATABASE kalyan;
DESCRIBE DATABASE EXTENDED kalyan; // It will show detailed table information


HOW TO DROP DATABASE
-----------------------------
DROP DATABASE IF EXISTS kalyan; // To drop an empty database

DROP DATABASE IF EXISTS kalyan CASCADE; // Use to drop a database which has tables

DROP DATABASE IF EXISTS kalyan RESTRICT; 



HOW TO USE DATABASE
-----------------------------
USE kalyan;


HOW TO ALTER DATABASE
-----------------------------
ALTER DATABASE kalyan SET DBPROPERTIES ('key1' = 'value11', 'key3' = 'value3');

ALTER DATABASE kalyan SET OWNER USER hadoop; 

ALTER DATABASE kalyan SET OWNER ROLE hadoop; 


Example of a HIVE table :
  • Always create EXTERNAL tables in production, otherwise we will land into data loss scenarios.
  • Observe that FIELDS(COLUMNS) are terminated by '\t' ; that means columns are tab separated
  • Observe that LINES are terminated by '\n' ; that means lines are separated by new line character
  • We are going to store this as text file
  • If we don't mention a location as below table then by default file will store in default warehouse location mentioned in hive-site.xml (property will be mentioned as below)
    • <property>
    • <name>hive.metastore.warehouse.dir</name>
    • <value>/home/arun/work/warehouse</value>
    • </property>
  • Note that we have used delimiters, hence we need to load data which is in the above format otherwise entire data will be loaded into one FIELD/COLUMN and rest of the columns will be filled with NULL values

CREATE EXTERNAL TABLE IF NOT EXISTS student 
name string comment 'student name', 
id int comment 'student id', 
course string comment 'student course', 
year int comment 'student year'
)
COMMENT 'My table name is student'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'   // By default FIELD separator is "ctrl + a" or in unicode \001
LINES TERMINATED BY '\n'    // By default LINE separator is \n ; so this line is optional
STORED AS TEXTFILE            //  By default it will store a text file ; so this line is also optional
LOCATION '/hive/arun/student'
TBLPROPERTIES ('key1'='value1', 'key2'='value2')
;

Incase if we want to store the data which is "," separated then we have to write as below :

CREATE EXTERNAL TABLE IF NOT EXISTS student 
name string comment 'student name', 
id int comment 'student id', 
course string comment 'student course', 
year int comment 'student year'
)
COMMENT 'My table name is student'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','   // By default FIELD separator is "ctrl + a" or in unicode \001
LINES TERMINATED BY '\n'    // By default LINE separator is \n ; so this line is optional
STORED AS TEXTFILE            //  By default it will store a text file ; so this line is also optional
LOCATION '/hive/arun/student'
TBLPROPERTIES ('key1'='value1', 'key2'='value2')
;

HOW TO KNOW INFORMATION ABOUT TABLE
----------------------------------------------------------------
DESCRIBE tbl_name;

DESCRIBE EXTENDED tbl_name; // To get more information about table

DESCRIBE FORMATTED tbl_name; // To get more information about table in structured format


HOW TO LOAD DATA INTO TABLE
-----------------------------------------------
  • LOAD DATA LOCAL INPATH '/home/arun/work/hive_inputs/student.txt' OVERWRITE INTO TABLE student2;

  • LOAD DATA LOCAL INPATH '/home/arun/work/hive_inputs/student.txt' INTO TABLE student2;

hadoop fs -mkdir -p /home/arun/work/hive_inputs

hadoop fs -put /home/arun/work/hive_inputs/student.txt /home/orienit/work/hive_inputs/student.txt

  • LOAD DATA INPATH '/home/arun/work/hive_inputs/student.txt' OVERWRITE INTO TABLE student2;

hadoop fs -put /home/arun/work/hive_inputs/student.txt /home/orienit/work/hive_inputs/student.txt

  • LOAD DATA INPATH '/home/arun/work/hive_inputs/student.txt' INTO TABLE student2;


Points to remember during data loading : 
  • We can load data in 4 types as shown above
    • local to HDFS (with overwrite - old data will be deleted)
    • local to HDFS(without overwrite - old data will be retained, new data will be appended)
    • HDFS to  HDFS (with overwrite - old data will be deleted)
    • HDFS to HDFS (without overwrite - old data will be retained , new data will be appended)
  • When we load data from local to HDFS then it is a COPY operation
  • When we load data from HDFS to HDFS then it is a MOVE operation

Note :  We don't load data this way in production, we will generally do something like "INSERT OVERWRITE" etc. which we will see in further blogs



Thanks,
Arun Mathe
Email ID : arunkumar.mathe@gmail.com

Comments

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

I was curious to learn Artificial Intelligence and thinking what is the best place to start learning, and then realized that Deep Learning and Neural Networks is the heart of AI. Hence started diving into AI from this point. Starting from today, I will write continuous blogs on AI, especially Gen AI & Agentic AI. Incase if you are interested on above topics then please watch out this space. What is Artificial Intelligence, Machine Learning & Deep Learning ? AI can be described as the effort to automate intellectual tasks normally performed by Humans. Is this really possible ? For example, when we see an image with our eyes, we will identify it within a fraction of milliseconds. Isn't it ? For a computer, is it possible to do the same within same time limit ? That's the power we are talking about. To be honest, things seems to be far advanced than we actually thing about AI.  BTW, starting from this blog, it is not just a technical journal, we talk about internals here. ...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

It is extremely important to have a deep knowledge while designing a machine learning model, otherwise we will end up creating ML models which are of no use. We have to have a clear understanding on certain techniques to confidently build a ML model, train it using "training data", finalize the model and to deploy it in production. So far, from blog #1, #2, we have seen about the fundamentals of Deep Learning and Neural Network, architecture of a Neural Network, internal layers and components etc.  Providing the links of Blogs #1 , #2 below for quick reference. Deep Learning & Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html Building a real world neural network: A practical usecase explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html Now let's dive through below concepts/criteria to help gaining confidence on building your ML model: Activation Functions (Forward Propaga...