🐝A.5) Setting Up and Running Apache Hive on AWS EMR and Databricks

March 12, 2025

Setting Up and Running Apache Hive on AWS EMR and Databricks

1. SQL Databases and Standards

All SQL databases follow a common standard called SQL standards.
Limited only to processing structured data. SQL is used for structured (tabular) data.
Row-oriented scanning: Traditional databases process data row by row (OLTP systems).
OLTP (Online Transaction Processing): Used for real-time transaction-based processing (e.g., banking, e-commerce).
OLAP (Online Analytical Processing): Used for complex analytical queries (e.g., reporting, business intelligence).

2. Next-Generation Database & Hive Overview

Hive Query Language (HQL): Hive uses HQL, similar to SQL, but designed for big data processing.
Hive supports only OLAP queries (like SELECT, GROUP BY, JOIN, functions, etc.).
90% syntax is similar to SQL, but not 100% the same.

3. Hive Data Processing & File Storage

Data must be stored in a folder, not as a single file.
✅ Correct: s3://bucket/input/asldata/ (folder containing multiple CSVs with the same schema)
❌ Incorrect: s3://bucket/input/asldata/asl.csv (single file)

4. Hive Configuration & Settings

Hive configuration file: hive-site.xml (stores all Hive settings).
To enable column headers in Hive CLI:
```
set hive.cli.print.header=true;
```
To change the Hive execution engine to Tez (for better performance):
```
set hive.execution.engine=tez;
```

To make settings permanent, modify hive-site.xml:

whereis hive 
/usr/lib/hive/conf/hive-site.xml

5. Hive in Development vs. Production

Development & Testing:

Use .hql script files.
Example: hqlscript.hql

Production:

Use automation tools like:
- Shell scripts
- Oozie
- Apache Airflow
- Control-M
- Autosys

6. Setting Up Hive on AWS EMR (Recommended for Big Data Processing)

Step 1: Create an AWS EMR Cluster

Log in to AWS Console → Navigate to Amazon EMR.
Click Create Cluster.
Select Advanced Options.
Choose the following applications:
- Hadoop
- Hive
- Spark
Select an instance type (e.g., m5.xlarge for better performance).
Set the number of nodes to 1 (for testing) or 3+ (for production).
Enable SSH access (Port 22) in Security Groups.
Click Create Cluster and wait for it to start.

Step 2: Connect to EMR via SSH

Once the cluster is running:

Open a terminal on Linux/macOS or Git Bash on Windows.
Connect to the master node using SSH:
```
ssh -i your-key.pem hadoop@<master-node-public-dns>
```
- Replace your-key.pem with your AWS Key Pair.
- Replace <master-node-public-dns> (find it in AWS EMR under Master Node details).

Step 3: Run Hive Commands in EMR

After logging in:

Start Hive:
```
hive
```

Run Hive queries:

CREATE DATABASE mydb;
USE mydb;
CREATE TABLE test_table (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
SHOW TABLES;

Exit Hive when done:
```
exit;
```

7. Using Databricks to Run Hive Queries

Step 1: Create a Databricks Workspace

Log in to Databricks (Azure or AWS).
Click Create Workspace.
Create a Cluster with a Databricks Runtime that supports Hive.

Step 2: Run Hive Queries in Databricks

Open a Databricks Notebook.

Use the %sql magic command before running SQL queries:

%sql
CREATE DATABASE mydb;
USE mydb;
CREATE TABLE test_table (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
SHOW TABLES;

Run the notebook to see results.

8. Connecting Hive from Local CMD Using Beeline

If you want to run Hive commands from your local machine without installing Hive, you can connect to AWS EMR using beeline.

Step 1: Install Beeline Locally

On Windows/Linux, install Apache Hive Client.
Add Hive to the system PATH.

Step 2: Connect Beeline to AWS EMR

Run this command:

beeline -u jdbc:hive2://<master-node-public-dns>:10000

Replace <master-node-public-dns> with your EMR Master Node Public DNS.
Now you can execute Hive queries from your local terminal.

9. Comparison: AWS EMR vs. Databricks vs. Beeline

Option	Best For
AWS EMR	Running Hive on a Hadoop cluster for big data processing
Databricks	Running Hive queries in an interactive cloud notebook
Beeline (Local CMD)	Running Hive commands locally without full installation

10. Conclusion

Apache Hive is a great tool for processing large-scale datasets. If you don’t have Hive installed locally, you can set it up using AWS EMR or Databricks. Additionally, you can use Beeline to run Hive queries from your local machine without needing a full installation. Choose the setup that best fits your needs and start querying big data efficiently!

💡 Next Steps: Try these steps and let me know if you have any questions! Please comment 🚀

Search This Blog

Data Nexus