Skip to main content

Hadoop : HIVE Installation

A person sailing in a boat who know swimming will always be in a safe side comparatively with other person who doesn't know how to swim.

Guys, we all know that we are using Cloud based platforms like AWS Athena for working on HIVE these days, but it is very important to understand the basics of HIVE, like what is HIVE, how to install it, what are the installation modes, HIVE tables, concepts involved in it to format the data, serDe etc., 

Most of the time, already implemented HIVE serDe's are good enough for practical use cases but what if we land into a situation where we have to write our own serDe in Java? Hence, it is good to have this knowledge for a Data engineer though we are using UI in cloud to perform same activities. Hence let's learn HIVE in depth and have this knowledge with us to sail further. 


What is HIVE ?

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

The Apache HIVE is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.

  • Generally, HIVE is built on the top of HDFS to process data using SQL type of syntax instead of writing Map Reduce programming in Java
  • It used HIVE query language, HQL(kind of SQL syntax) 
  • It is a data warehouse system(it is not Database)
  • It reads, writes data to/from distributed environments
  • It can deal with huge amount of data
  • HIVE is not designed for transactions(OLAP not OLTP)

HIVE installation(Bare metal/In-Premises) :
  • We need Hadoop, and one RBMS(MySQL/Derby/Teradata etc.,) before installing HIVE
  • Because HIVE doesn't have any internal storage
  • It will store actual data in HDFS and schema in RDBMS(meta store)
  • Meta store is nothing but information about database, tables, columns, functions etc.,

Why they built like that to have data at one place and schema information in a different place ?

Let's say we have a table in a RDBMS(MySQL) with name arun1, we re-named it to arun2 and now the table name changed from arun1 to arun2. But in the background, it won't rename it, it will just create another table with arun2, there is no concept called update in the software world. It is just a MYTH, we will be under impression that it got updated but what's really happening is, a new table will be created. Having this knowledge, let's understand that HDFS won't allow any updates and HIVE doesn't have a storage. This is the reason we are keeping meta store information in a RDBMS rather than keeping it in HDFS. Only data/records will be stored in HDFS. 

As per above image, we have 2 types of setup in HIVE:
  • Local/Internal metastore DB
  • Remote/External metastore DB (recommended way even for development)
  • Hadoop have 3 ways of installation
    • Local mode(dev)
    • Psuedo mode(dev)
    • Cluster mode(prod)
  • Difference between Local/Internal MSDB & Remote/External MSDB is as shown in below image
  • Below 2 images are 2 different Hadoop environments, we have 3 Data Nodes(DN's) available in each of them.
  • Left side, on Local/Internal metastore DB, metastore DB is installed on each machine locally
  • Right side, on Remote/External metastore DB, metastore DB is installed at a common location
  • Lets say, we created a table T1 on first data node & T2 on second data node in Local/Internal metastore DB, as they are locally installed each data node doesn't have information about the metastore or tables available in other data node
  • But in Remote/External metastore DB, metastore DB is installed at a common place, and entire metastore information is known to all data nodes in HDFS. Hence it is recommended to use this mode even for development purpose.
  • In production, must and should we need external approach
  • Understand that we will store maximum in MB's in metastore DB. Hence RBMS is enough, high end configuration databases are not required here.
  • Large amounts of actual data will be stored in HDFS in Hadoop.
  • Embedded Derby is a programmatical way of running the database, physically they don't exists, through some programs we can insert, delete data etc.,


Actual installation process will start from here :

Note : I have installed VMware in my windows laptop, installed Ubuntu(Linux flavor) on the top of to install Hadoop. We need Hadoop and one relational database to install HIVE. 

Use below command to confirm the version of installed Linux :

orienit@orienit:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
orienit@orienit:~$

Command to start Hadoop :

orienit@orienit:~$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as orienit in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [orienit]
Starting resourcemanager
Starting nodemanagers
orienit@orienit:~$ 


Use ''jps" command to see if Hadoop is running :

orienit@orienit:~$ jps
3459 ResourceManager
3206 SecondaryNameNode
2887 NameNode
3017 DataNode
3579 NodeManager
6894 Jps

Use below command to confirm relational database installation :

orienit@orienit:~$ mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 29
Server version: 8.0.39-0ubuntu0.22.04.1 (Ubuntu)

Copyright (c) 2000, 2024, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>


Command to connect to HIVE :

orienit@orienit:~$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/orienit/work/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/orienit/work/hadoop-3.4.0/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = debad87f-62d7-4ea2-bbb3-3d67a4dd9b0c

Logging initialized using configuration in jar:file:/home/orienit/work/apache-hive-3.1.3-bin/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 662e36b5-685f-45e5-acd7-13eb47f18dc4
hive>


Below are the default databases in mySQL :

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
4 rows in set (0.00 sec)

So, it is confirmed that Hadoop, MySQL is installed. Now, HIVE installation is pretty easy.

  • Download HIVE from Apache website : https://dlcdn.apache.org/hive/
  • Copy downloaded apache-hive-3.1.3-bin.tar.gz file into work directory /home/orienit/work         Here /home/orienit is my user location and /work is my working directory for Hadoop  

  • Now extract downloaded folder in the same location as shown in above screenshot
  • Update '~/.bashrc' file with below changes (copy paste below lines in bashrc file)
         Command : 
          orienit@orienit:~$ gedit ~/.bashrc

          export HIVE_HOME=/home/orienit/work/apache-hive-3.1.3-bin
          export PATH=$HIVE_HOME/bin:$PATH
  • Save and close the terminal and open a new terminal
  • Use below command to confirm HIVE version
            orienit@orienit:~$ echo $HIVE_HOME
            /home/orienit/work/apache-hive-3.1.3-bin

  • copy 'mysql-connector-j-8.0.33.jar and guava-27.0-jre.jar' jar files into '$HIVE_HOME/lib' folder (mysql connector is to connect to RBMS(metastore) jar and guava jar is specific to this HIVE version due to some bug - it is for serialization purpose to transfer data in network)
  • Copy 'hive-site.xml' in '$HIVE_HOME/conf' folder as shown in below screenshot

Below are the contents of hive-site.xml :

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/orienit/work/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive3_mysql_db?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>XXXXX</value>
</property>
<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
</property>
</configuration>

  • hive.metastore.warehouse.dir is for the HDFS location where actual data is stored
  • javax.jdo.option.ConnectionURL is for the connection URL to connect to mySQL
  • javax.jdo.option.ConnectionDriverName is for driver class name
  • We need to mention user name and password to connect to RDBMS(mySQL in this case
Create metastore DB : One time during first time installation

schematool --dbType mysql --initSchema

Once we executed above command, it will create database hive3_mysql_db with default tables required to work with HIVE

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| hive3_mysql_db     |
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

mysql> use hive3_mysql_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> 

mysql> show tables;
+-------------------------------+
| Tables_in_hive3_mysql_db      |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDS                           |
| COLUMNS_V2                    |
| COMPACTION_QUEUE              |
| COMPLETED_COMPACTIONS         |
| COMPLETED_TXN_COMPONENTS      |
| CTLGS                         |
| DATABASE_PARAMS               |
| DBS                           |
| DB_PRIVS                      |
| DELEGATION_TOKENS             |
| FUNCS                         |
| FUNC_RU                       |
| GLOBAL_PRIVS                  |
| HIVE_LOCKS                    |
| IDXS                          |
| INDEX_PARAMS                  |
| I_SCHEMA                      |
| KEY_CONSTRAINTS               |
| MASTER_KEYS                   |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES       |
| MIN_HISTORY_LEVEL             |
| MV_CREATION_METADATA          |
| MV_TABLES_USED                |
| NEXT_COMPACTION_QUEUE_ID      |
| NEXT_LOCK_ID                  |
| NEXT_TXN_ID                   |
| NEXT_WRITE_ID                 |
| NOTIFICATION_LOG              |
| NOTIFICATION_SEQUENCE         |
| NUCLEUS_TABLES                |
| PARTITIONS                    |
| PARTITION_EVENTS              |
| PARTITION_KEYS                |
| PARTITION_KEY_VALS            |
| PARTITION_PARAMS              |
| PART_COL_PRIVS                |
| PART_COL_STATS                |
| PART_PRIVS                    |
| REPL_TXN_MAP                  |
| ROLES                         |
| ROLE_MAP                      |
| RUNTIME_STATS                 |
| SCHEMA_VERSION                |
| SDS                           |
| SD_PARAMS                     |
| SEQUENCE_TABLE                |
| SERDES                        |
| SERDE_PARAMS                  |
| SKEWED_COL_NAMES              |
| SKEWED_COL_VALUE_LOC_MAP      |
| SKEWED_STRING_LIST            |
| SKEWED_STRING_LIST_VALUES     |
| SKEWED_VALUES                 |
| SORT_COLS                     |
| TABLE_PARAMS                  |
| TAB_COL_STATS                 |
| TBLS                          |
| TBL_COL_PRIVS                 |
| TBL_PRIVS                     |
| TXNS                          |
| TXN_COMPONENTS                |
| TXN_TO_WRITE_ID               |
| TYPES                         |
| TYPE_FIELDS                   |
| VERSION                       |
| WM_MAPPING                    |
| WM_POOL                       |
| WM_POOL_TO_TRIGGER            |
| WM_RESOURCEPLAN               |
| WM_TRIGGER                    |
| WRITE_SET                     |
+-------------------------------+
74 rows in set (0.00 sec)


HIVE commands to enter, exit :

   hive start command        : hive
   hive stop command        : exit;
   hive verify command     :     show databases;


Finally it should show as below when you type show datatypes in HIVE prompt :

    > show databases;
        OK
        default
        Time taken: 1.493 seconds, Fetched: 1 row(s)

You should see above output to confirm that HIVE installation is succeeded.



Let's create a simple database and see how it work.

Before creating database, use below command and see what DBS table inside hive3_mysql_db DB is having :

mysql> select * from hive3_mysql_db.DBS;
+-------+-----------------------+---------------------------------------------------+---------+------------+------------+-----------+
| DB_ID | DESC                  | DB_LOCATION_URI                                   | NAME    | OWNER_NAME | OWNER_TYPE | CTLG_NAME |
+-------+-----------------------+---------------------------------------------------+---------+------------+------------+-----------+
|     1 | Default Hive database | hdfs://localhost:8020/home/orienit/work/warehouse | default | public     | ROLE       | hive      |
+-------+-----------------------+---------------------------------------------------+---------+------------+------------+-----------+
1 row in set (0.00 sec)


    > create database test1;
        OK
        Time taken: 1.672 seconds
hive> create database test2;
            OK
            Time taken: 0.192 seconds
hive> 
 
    > show databases;
        OK
        default
        test1
        test2
        Time taken: 0.695 seconds, Fetched: 3 row(s)


mysql> select * from hive3_mysql_db.DBS;
+-------+-----------------------+------------------------------------------------------------+---------+------------+------------+-----------+
| DB_ID | DESC                  | DB_LOCATION_URI                                            | NAME    | OWNER_NAME | OWNER_TYPE | CTLG_NAME |
+-------+-----------------------+------------------------------------------------------------+---------+------------+------------+-----------+
|     1 | Default Hive database | hdfs://localhost:8020/home/orienit/work/warehouse          | default | public     | ROLE       | hive      |
|     6 | NULL                  | hdfs://localhost:8020/home/orienit/work/warehouse/test1.db | test1   | orienit    | USER       | hive      |
|     7 | NULL                  | hdfs://localhost:8020/home/orienit/work/warehouse/test2.db | test2   | orienit    | USER       | hive      |
+-------+-----------------------+------------------------------------------------------------+---------+------------+------------+-----------+
3 rows in set (0.00 sec)

Observe that MySQL updated this information inside metastore DB hive3_mysql_db inside DBS table.



To see same databases via browser :
http://localhost:9870/explorer.html#/home/orienit/work/warehouse (this is the warehouse directory that we mentioned in hive-site.xml)

HIVE installation is completed. 

This is just HIVE installation, we need to learn how to create HIVE tables and optimize them, how to balance load in HIVE using bucketization techniques, how to write UDFs, temporary and permanent functions, how to execute hive queries in production by creating .hql files, HIVE joins etc., 

Let's see how to deal above things in coming blogs. Have a great day!



Arun Mathe

Gmail ID : arunkumar.mathe@gmail.com

Contact No : +91 9704117111







Comments

Popular posts from this blog

Python : Python for Spark

Python is a general purpose programming language, that is used for variety of tasks like web-development, Data analytics etc. Initially Python is developed as a functional programming language, later object oriented programming concepts are also added to Python. We will see what basics we need in Python to play with Spark. Incase if you want to practice Spark in Big Data environment, you can use Databricks. URL :  https://community.cloud.databricks.com This is the main tool which programmers are using in real time production environment We have both Community edition(Free version with limited support) & paid versions available Register for above tool online for free and practice Indentation is very important in Python. We don't use braces in Python like we do in Java, and the scope of the block/loop/definition is interpreted based on the indentation of code. Correct Indentation : def greet():     print("Hello!")  # Indented correctly     print("Welcome ...

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

AWS : Boto3 (Accessing AWS using Python)

Boto3 is the Amazon Web Services software development kit for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by AWS. Please find latest documentation at : https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Command to install it : pip install boto3 Local storage Vs Cloud storage: Local file system is block oriented, means storage is divided into block with size range 1-4kb Collections of multiple blocks is called a file in local storage Example : 10MB file will be occupying almost 2500 blocks(assuming 4kb each block) We know that we can install softwares in local system (indirectly in blocks) Local system blocks managed by Operating system But Cloud storage is a object oriented storage, means everything is object No size limit, it is used only to store data, we can't install software in cloud storage Cloud storage managed by users We need to install either Pyc...