🐝A.4) Understanding Apache Hive: A Data Warehouse, Not a Database
Understanding Apache Hive: A Data Warehouse, Not a Database
When working with big data, it's crucial to know not only when to use a particular tool but also when not to use it. Apache Hive is a data warehouse built for analytical operations on a large scale. However, it is not suitable for transactional operations.
Key Characteristics of Hive
- Hive is designed for analytical queries and is not optimized for transactional processing.
- Hive data is completely de-normalized.
- Hive supports JOINs, but they should be avoided whenever possible to improve performance.
- Hive Query Language (HQL) is similar to SQL, making it easy for SQL users to adapt.
Relation Between Hadoop and Hive
Hadoop Component | Hive Component |
---|---|
HDFS stores folders and files | Hive organizes data into databases and tables |
Creating a database in Hive | Creates a corresponding folder in HDFS |
Creating a table in Hive | Creates a corresponding folder in HDFS |
Inserting records into a Hive table | Saves records in HDFS as files |
Example: How Data is Stored in Hive and HDFS
Table in Hive | How it is Stored in HDFS |
---|---|
C1 C2 C3 | 11,12,13 |
11 12 13 | 21,22,23 |
21 22 23 |
Delimiter Usage in Hive
- Delimiters (such as commas or tabs) play a crucial role in data storage in Hive.
- Hive supports structured, semi-structured, and unstructured data, but unstructured data must be converted into a Hive-readable format using SerDe (Serializer/Deserializer).
Writing Hive Queries
Best Practice: Always write Hive queries in a Notepad or text editor before copying them into the Hive prompt. Hive does not allow query modifications once executed.
Creating a Database in Hive
Hive provides two syntax formats for creating a database:
Syntax 1:
CREATE DATABASE IF NOT EXISTS <DB_NAME>;
Syntax 2:
CREATE DATABASE IF NOT EXISTS <DB_NAME>
COMMENT 'COMMENT ON DATABASE'
LOCATION 'PATH OF THE DATABASE LOCATION'
WITH DBPROPERTIES(KEY1 = VALUE1, KEY2 = VALUE2,…);
Example:
CREATE DATABASE IF NOT EXISTS sample3
COMMENT 'sample3 database'
LOCATION '/hive/sample3';
Note: The location path should be a folder path, not a file path. If the folder path exists, Hive will use it; otherwise, it will create a new folder.
Adding Properties to a Database
CREATE DATABASE IF NOT EXISTS sample4
COMMENT 'sample4 database'
LOCATION '/hive/sample4'
WITH DBPROPERTIES('key1' = 'value1', 'key2' = 'value2');
These properties are useful when integrating Hive with other systems like Apache Spark and NoSQL databases.
Viewing Extended Database Information
To view database properties, use the EXTENDED
keyword:
DESCRIBE DATABASE EXTENDED sample4;
Dropping a Database in Hive
To delete a database, use the following syntax:
DROP DATABASE <DATABASE_NAME>;
DROP DATABASE IF EXISTS <DATABASE_NAME>;
DROP DATABASE IF EXISTS <DATABASE_NAME> RESTRICT;
DROP DATABASE IF EXISTS <DATABASE_NAME> CASCADE;
Examples:
DROP DATABASE IF EXISTS sample1;
DROP DATABASE IF EXISTS sample1 RESTRICT; -- Prevents dropping if tables exist
DROP DATABASE IF EXISTS sample1 CASCADE; -- Drops database along with all tables
Final Thoughts
Understanding Hive’s role as a data warehouse rather than a database is essential for working with big data. In the next blog, we will explore table creation in Hive. Stay tuned and happy learning!
Have a great day!
Comments
Post a Comment