🐝A.5) Setting Up and Running Apache Hive on AWS EMR and Databricks
Setting Up and Running Apache Hive on AWS EMR and Databricks
1. SQL Databases and Standards
- All SQL databases follow a common standard called SQL standards.
- Limited only to processing structured data. SQL is used for structured (tabular) data.
- Row-oriented scanning: Traditional databases process data row by row (OLTP systems).
- OLTP (Online Transaction Processing): Used for real-time transaction-based processing (e.g., banking, e-commerce).
- OLAP (Online Analytical Processing): Used for complex analytical queries (e.g., reporting, business intelligence).
2. Next-Generation Database & Hive Overview
- Hive Query Language (HQL): Hive uses HQL, similar to SQL, but designed for big data processing.
- Hive supports only OLAP queries (like SELECT, GROUP BY, JOIN, functions, etc.).
- 90% syntax is similar to SQL, but not 100% the same.
3. Hive Data Processing & File Storage
- Data must be stored in a folder, not as a single file.
- ✅ Correct:
s3://bucket/input/asldata/
(folder containing multiple CSVs with the same schema) - ❌ Incorrect:
s3://bucket/input/asldata/asl.csv
(single file)
4. Hive Configuration & Settings
- Hive configuration file:
hive-site.xml
(stores all Hive settings). - To enable column headers in Hive CLI:
set hive.cli.print.header=true;
- To change the Hive execution engine to Tez (for better performance):
set hive.execution.engine=tez;
- To make settings permanent, modify
hive-site.xml
:whereis hive /usr/lib/hive/conf/hive-site.xml
5. Hive in Development vs. Production
Development & Testing:
- Use .hql script files.
- Example:
hqlscript.hql
Production:
- Use automation tools like:
- Shell scripts
- Oozie
- Apache Airflow
- Control-M
- Autosys
6. Setting Up Hive on AWS EMR (Recommended for Big Data Processing)
Step 1: Create an AWS EMR Cluster
- Log in to AWS Console → Navigate to Amazon EMR.
- Click Create Cluster.
- Select Advanced Options.
- Choose the following applications:
- Hadoop
- Hive
- Spark
- Select an instance type (e.g., m5.xlarge for better performance).
- Set the number of nodes to 1 (for testing) or 3+ (for production).
- Enable SSH access (Port 22) in Security Groups.
- Click Create Cluster and wait for it to start.
Step 2: Connect to EMR via SSH
Once the cluster is running:
- Open a terminal on Linux/macOS or Git Bash on Windows.
- Connect to the master node using SSH:
ssh -i your-key.pem hadoop@<master-node-public-dns>
- Replace
your-key.pem
with your AWS Key Pair. - Replace
<master-node-public-dns>
(find it in AWS EMR under Master Node details).
- Replace
Step 3: Run Hive Commands in EMR
After logging in:
- Start Hive:
hive
- Run Hive queries:
CREATE DATABASE mydb; USE mydb; CREATE TABLE test_table (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; SHOW TABLES;
- Exit Hive when done:
exit;
7. Using Databricks to Run Hive Queries
Step 1: Create a Databricks Workspace
- Log in to Databricks (Azure or AWS).
- Click Create Workspace.
- Create a Cluster with a Databricks Runtime that supports Hive.
Step 2: Run Hive Queries in Databricks
- Open a Databricks Notebook.
- Use the
%sql
magic command before running SQL queries:%sql CREATE DATABASE mydb; USE mydb; CREATE TABLE test_table (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; SHOW TABLES;
- Run the notebook to see results.
8. Connecting Hive from Local CMD Using Beeline
If you want to run Hive commands from your local machine without installing Hive, you can connect to AWS EMR using beeline
.
Step 1: Install Beeline Locally
- On Windows/Linux, install Apache Hive Client.
- Add Hive to the system
PATH
.
Step 2: Connect Beeline to AWS EMR
Run this command:
beeline -u jdbc:hive2://<master-node-public-dns>:10000
- Replace
<master-node-public-dns>
with your EMR Master Node Public DNS. - Now you can execute Hive queries from your local terminal.
9. Comparison: AWS EMR vs. Databricks vs. Beeline
Option | Best For |
---|---|
AWS EMR | Running Hive on a Hadoop cluster for big data processing |
Databricks | Running Hive queries in an interactive cloud notebook |
Beeline (Local CMD) | Running Hive commands locally without full installation |
10. Conclusion
Apache Hive is a great tool for processing large-scale datasets. If you don’t have Hive installed locally, you can set it up using AWS EMR or Databricks. Additionally, you can use Beeline to run Hive queries from your local machine without needing a full installation. Choose the setup that best fits your needs and start querying big data efficiently!
💡 Next Steps: Try these steps and let me know if you have any questions! Please comment 🚀
Comments
Post a Comment