🐝A.4) Understanding Apache Hive: A Data Warehouse, Not a Database

 

Understanding Apache Hive: A Data Warehouse, Not a Database

When working with big data, it's crucial to know not only when to use a particular tool but also when not to use it. Apache Hive is a data warehouse built for analytical operations on a large scale. However, it is not suitable for transactional operations.

Key Characteristics of Hive

  • Hive is designed for analytical queries and is not optimized for transactional processing.
  • Hive data is completely de-normalized.
  • Hive supports JOINs, but they should be avoided whenever possible to improve performance.
  • Hive Query Language (HQL) is similar to SQL, making it easy for SQL users to adapt.

Relation Between Hadoop and Hive

Hadoop Component Hive Component
HDFS stores folders and files Hive organizes data into databases and tables
Creating a database in Hive Creates a corresponding folder in HDFS
Creating a table in Hive Creates a corresponding folder in HDFS
Inserting records into a Hive table Saves records in HDFS as files

Example: How Data is Stored in Hive and HDFS

Table in Hive How it is Stored in HDFS
C1 C2 C3 11,12,13
11 12 13 21,22,23
21 22 23

Delimiter Usage in Hive

  • Delimiters (such as commas or tabs) play a crucial role in data storage in Hive.
  • Hive supports structured, semi-structured, and unstructured data, but unstructured data must be converted into a Hive-readable format using SerDe (Serializer/Deserializer).

Writing Hive Queries

Best Practice: Always write Hive queries in a Notepad or text editor before copying them into the Hive prompt. Hive does not allow query modifications once executed.

Creating a Database in Hive

Hive provides two syntax formats for creating a database:

Syntax 1:

CREATE DATABASE IF NOT EXISTS <DB_NAME>;

Syntax 2:

CREATE DATABASE IF NOT EXISTS <DB_NAME>
COMMENT 'COMMENT ON DATABASE'
LOCATION 'PATH OF THE DATABASE LOCATION'
WITH DBPROPERTIES(KEY1 = VALUE1, KEY2 = VALUE2,…);

Example:

CREATE DATABASE IF NOT EXISTS sample3
COMMENT 'sample3 database'
LOCATION '/hive/sample3';

Note: The location path should be a folder path, not a file path. If the folder path exists, Hive will use it; otherwise, it will create a new folder.

Adding Properties to a Database

CREATE DATABASE IF NOT EXISTS sample4
COMMENT 'sample4 database'
LOCATION '/hive/sample4'
WITH DBPROPERTIES('key1' = 'value1', 'key2' = 'value2');

These properties are useful when integrating Hive with other systems like Apache Spark and NoSQL databases.

Viewing Extended Database Information

To view database properties, use the EXTENDED keyword:

DESCRIBE DATABASE EXTENDED sample4;

Dropping a Database in Hive

To delete a database, use the following syntax:

DROP DATABASE <DATABASE_NAME>;
DROP DATABASE IF EXISTS <DATABASE_NAME>;
DROP DATABASE IF EXISTS <DATABASE_NAME> RESTRICT;
DROP DATABASE IF EXISTS <DATABASE_NAME> CASCADE;

Examples:

DROP DATABASE IF EXISTS sample1;
DROP DATABASE IF EXISTS sample1 RESTRICT; -- Prevents dropping if tables exist
DROP DATABASE IF EXISTS sample1 CASCADE;  -- Drops database along with all tables

Final Thoughts

Understanding Hive’s role as a data warehouse rather than a database is essential for working with big data. In the next blog, we will explore table creation in Hive. Stay tuned and happy learning!


Have a great day!

Comments

Popular posts from this blog

🔥Apache Spark Architecture with RDD & DAG

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks