Skip to main content

HIVE : How to create a database, table and load data

 Please keep below points in mind before creating a table in HIVE.

  • HIVE is not a database but it is a data warehouse which is designed for data analytics(OLAP)
  • HIVE doesn't have any internal storage to store data, it will use HDFS to store table data and RDBMS(like MySQL) to store schema related information. Hence we need to have a RDBMS like MySQL.
  • When we create a database, HIVE will create a folder in HDFS. Later when we create a HIVE table, it will create another folder inside database folder.
  • After creating database, table, once we load/insert data into it then that data will be stored as a file in HDFS and schema information will be stored in MySQL.
  • Delimiter is very important while loading the data, we will see some examples below.
  • As this is analytics, we know what type of data is there(as data is historical) and we will create schema based on the data. Based on the data, we need to create a table.
HOW TO CREATE DATABASE
-----------------------------
CREATE DATABASE IF NOT EXISTS kalyan;

CREATE DATABASE IF NOT EXISTS kalyan
COMMENT 'My database name is kalyan'
LOCATION '/hive/kalyan'
WITH DBPROPERTIES ('key1' = 'value1', 'key2' = 'value2');

DESCRIBE DATABASE kalyan;
DESCRIBE DATABASE EXTENDED kalyan; // It will show detailed table information


HOW TO DROP DATABASE
-----------------------------
DROP DATABASE IF EXISTS kalyan; // To drop an empty database

DROP DATABASE IF EXISTS kalyan CASCADE; // Use to drop a database which has tables

DROP DATABASE IF EXISTS kalyan RESTRICT; 



HOW TO USE DATABASE
-----------------------------
USE kalyan;


HOW TO ALTER DATABASE
-----------------------------
ALTER DATABASE kalyan SET DBPROPERTIES ('key1' = 'value11', 'key3' = 'value3');

ALTER DATABASE kalyan SET OWNER USER hadoop; 

ALTER DATABASE kalyan SET OWNER ROLE hadoop; 


Example of a HIVE table :
  • Always create EXTERNAL tables in production, otherwise we will land into data loss scenarios.
  • Observe that FIELDS(COLUMNS) are terminated by '\t' ; that means columns are tab separated
  • Observe that LINES are terminated by '\n' ; that means lines are separated by new line character
  • We are going to store this as text file
  • If we don't mention a location as below table then by default file will store in default warehouse location mentioned in hive-site.xml (property will be mentioned as below)
    • <property>
    • <name>hive.metastore.warehouse.dir</name>
    • <value>/home/arun/work/warehouse</value>
    • </property>
  • Note that we have used delimiters, hence we need to load data which is in the above format otherwise entire data will be loaded into one FIELD/COLUMN and rest of the columns will be filled with NULL values

CREATE EXTERNAL TABLE IF NOT EXISTS student 
name string comment 'student name', 
id int comment 'student id', 
course string comment 'student course', 
year int comment 'student year'
)
COMMENT 'My table name is student'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'   // By default FIELD separator is "ctrl + a" or in unicode \001
LINES TERMINATED BY '\n'    // By default LINE separator is \n ; so this line is optional
STORED AS TEXTFILE            //  By default it will store a text file ; so this line is also optional
LOCATION '/hive/arun/student'
TBLPROPERTIES ('key1'='value1', 'key2'='value2')
;

Incase if we want to store the data which is "," separated then we have to write as below :

CREATE EXTERNAL TABLE IF NOT EXISTS student 
name string comment 'student name', 
id int comment 'student id', 
course string comment 'student course', 
year int comment 'student year'
)
COMMENT 'My table name is student'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','   // By default FIELD separator is "ctrl + a" or in unicode \001
LINES TERMINATED BY '\n'    // By default LINE separator is \n ; so this line is optional
STORED AS TEXTFILE            //  By default it will store a text file ; so this line is also optional
LOCATION '/hive/arun/student'
TBLPROPERTIES ('key1'='value1', 'key2'='value2')
;

HOW TO KNOW INFORMATION ABOUT TABLE
----------------------------------------------------------------
DESCRIBE tbl_name;

DESCRIBE EXTENDED tbl_name; // To get more information about table

DESCRIBE FORMATTED tbl_name; // To get more information about table in structured format


HOW TO LOAD DATA INTO TABLE
-----------------------------------------------
  • LOAD DATA LOCAL INPATH '/home/arun/work/hive_inputs/student.txt' OVERWRITE INTO TABLE student2;

  • LOAD DATA LOCAL INPATH '/home/arun/work/hive_inputs/student.txt' INTO TABLE student2;

hadoop fs -mkdir -p /home/arun/work/hive_inputs

hadoop fs -put /home/arun/work/hive_inputs/student.txt /home/orienit/work/hive_inputs/student.txt

  • LOAD DATA INPATH '/home/arun/work/hive_inputs/student.txt' OVERWRITE INTO TABLE student2;

hadoop fs -put /home/arun/work/hive_inputs/student.txt /home/orienit/work/hive_inputs/student.txt

  • LOAD DATA INPATH '/home/arun/work/hive_inputs/student.txt' INTO TABLE student2;


Points to remember during data loading : 
  • We can load data in 4 types as shown above
    • local to HDFS (with overwrite - old data will be deleted)
    • local to HDFS(without overwrite - old data will be retained, new data will be appended)
    • HDFS to  HDFS (with overwrite - old data will be deleted)
    • HDFS to HDFS (without overwrite - old data will be retained , new data will be appended)
  • When we load data from local to HDFS then it is a COPY operation
  • When we load data from HDFS to HDFS then it is a MOVE operation

Note :  We don't load data this way in production, we will generally do something like "INSERT OVERWRITE" etc. which we will see in further blogs



Thanks,
Arun Mathe
Email ID : arunkumar.mathe@gmail.com
Contact ID : 9704117111

Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...