HIVE : How to create a database, table and load data

Please keep below points in mind before creating a table in HIVE.

HIVE is not a database but it is a data warehouse which is designed for data analytics(OLAP)
HIVE doesn't have any internal storage to store data, it will use HDFS to store table data and RDBMS(like MySQL) to store schema related information. Hence we need to have a RDBMS like MySQL.
When we create a database, HIVE will create a folder in HDFS. Later when we create a HIVE table, it will create another folder inside database folder.
After creating database, table, once we load/insert data into it then that data will be stored as a file in HDFS and schema information will be stored in MySQL.
Delimiter is very important while loading the data, we will see some examples below.
As this is analytics, we know what type of data is there(as data is historical) and we will create schema based on the data. Based on the data, we need to create a table.

HOW TO CREATE DATABASE

-----------------------------

CREATE DATABASE IF NOT EXISTS kalyan;

CREATE DATABASE IF NOT EXISTS kalyan

COMMENT 'My database name is kalyan'

LOCATION '/hive/kalyan'

WITH DBPROPERTIES ('key1' = 'value1', 'key2' = 'value2');

DESCRIBE DATABASE kalyan;

DESCRIBE DATABASE EXTENDED kalyan; // It will show detailed table information

HOW TO DROP DATABASE

-----------------------------

DROP DATABASE IF EXISTS kalyan; // To drop an empty database

DROP DATABASE IF EXISTS kalyan CASCADE; // Use to drop a database which has tables

DROP DATABASE IF EXISTS kalyan RESTRICT;

HOW TO USE DATABASE

-----------------------------

USE kalyan;

HOW TO ALTER DATABASE

-----------------------------

ALTER DATABASE kalyan SET DBPROPERTIES ('key1' = 'value11', 'key3' = 'value3');

ALTER DATABASE kalyan SET OWNER USER hadoop;

ALTER DATABASE kalyan SET OWNER ROLE hadoop;

Example of a HIVE table :

Always create EXTERNAL tables in production, otherwise we will land into data loss scenarios.
Observe that FIELDS(COLUMNS) are terminated by '\t' ; that means columns are tab separated
Observe that LINES are terminated by '\n' ; that means lines are separated by new line character
We are going to store this as text file
If we don't mention a location as below table then by default file will store in default warehouse location mentioned in hive-site.xml (property will be mentioned as below)

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/arun/work/warehouse</value>
</property>

Note that we have used delimiters, hence we need to load data which is in the above format otherwise entire data will be loaded into one FIELD/COLUMN and rest of the columns will be filled with NULL values

CREATE EXTERNAL TABLE IF NOT EXISTS student

(

name string comment 'student name',

id int comment 'student id',

course string comment 'student course',

year int comment 'student year'

)

COMMENT 'My table name is student'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' // By default FIELD separator is "ctrl + a" or in unicode \001

LINES TERMINATED BY '\n' // By default LINE separator is \n ; so this line is optional

STORED AS TEXTFILE // By default it will store a text file ; so this line is also optional

LOCATION '/hive/arun/student'

TBLPROPERTIES ('key1'='value1', 'key2'='value2')

;

Incase if we want to store the data which is "," separated then we have to write as below :

CREATE EXTERNAL TABLE IF NOT EXISTS student

(

name string comment 'student name',

id int comment 'student id',

course string comment 'student course',

year int comment 'student year'

)

COMMENT 'My table name is student'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',' // By default FIELD separator is "ctrl + a" or in unicode \001

LINES TERMINATED BY '\n' // By default LINE separator is \n ; so this line is optional

STORED AS TEXTFILE // By default it will store a text file ; so this line is also optional

LOCATION '/hive/arun/student'

TBLPROPERTIES ('key1'='value1', 'key2'='value2')

;

HOW TO KNOW INFORMATION ABOUT TABLE

----------------------------------------------------------------

DESCRIBE tbl_name;

DESCRIBE EXTENDED tbl_name; // To get more information about table

DESCRIBE FORMATTED tbl_name; // To get more information about table in structured format

HOW TO LOAD DATA INTO TABLE

-----------------------------------------------

LOAD DATA LOCAL INPATH '/home/arun/work/hive_inputs/student.txt' OVERWRITE INTO TABLE student2;

LOAD DATA LOCAL INPATH '/home/arun/work/hive_inputs/student.txt' INTO TABLE student2;

hadoop fs -mkdir -p /home/arun/work/hive_inputs

hadoop fs -put /home/arun/work/hive_inputs/student.txt /home/orienit/work/hive_inputs/student.txt

LOAD DATA INPATH '/home/arun/work/hive_inputs/student.txt' OVERWRITE INTO TABLE student2;

hadoop fs -put /home/arun/work/hive_inputs/student.txt /home/orienit/work/hive_inputs/student.txt

LOAD DATA INPATH '/home/arun/work/hive_inputs/student.txt' INTO TABLE student2;

Points to remember during data loading :

We can load data in 4 types as shown above

local to HDFS (with overwrite - old data will be deleted)
local to HDFS(without overwrite - old data will be retained, new data will be appended)
HDFS to HDFS (with overwrite - old data will be deleted)
HDFS to HDFS (without overwrite - old data will be retained , new data will be appended)

When we load data from local to HDFS then it is a COPY operation
When we load data from HDFS to HDFS then it is a MOVE operation

Note : We don't load data this way in production, we will generally do something like "INSERT OVERWRITE" etc. which we will see in further blogs

Thanks,

Arun Mathe

Email ID : arunkumar.mathe@gmail.com

DataSphere

Search This Blog

HIVE : How to create a database, table and load data

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

Spark Core : Understanding RDD & Partitions in Spark

Spark Core : Introduction & understanding Spark Context