How To Install Apache Hadoop on Debian 12
In this tutorial, we will show you how to install Apache Hadoop on Debian 12. Big Data is the backbone of modern data-driven businesses, and Hadoop has emerged as the go-to solution for processing and analyzing massive datasets. If you’re looking to harness the power of Hadoop on a Debian 12 system, you’re in the right place.
This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo
‘ to the commands to get root privileges. I will show you the step-by-step installation of the Prometheus monitoring tool on a Debian 12 (Bookworm).
Prerequisites
- A server running one of the following operating systems: Debian 12 (Bookworm).
- It’s recommended that you use a fresh OS install to prevent any potential issues.
- SSH access to the server (or just open Terminal if you’re on a desktop).
- An active internet connection. You’ll need an internet connection to download the necessary packages and dependencies for Apache Hadoop.
- A
non-root sudo user
or access to theroot user
. We recommend acting as anon-root sudo user
, however, as you can harm your system if you’re not careful when acting as the root.
Install Apache Hadoop on Debian 12 Bookworm
Step 1. Before we install any software, it’s important to make sure your system is up to date by running the following apt
commands in the terminal:
sudo apt update
This command will refresh the repository, allowing you to install the latest versions of software packages.
Step 2. Installing Java Development Kit (JDK).
Hadoop relies on Java, so make sure you have the JDK installed:
sudo apt install openjdk-11-jdk
Verify the Java version using the following command:
java --version
Step 3. Preparing the Hadoop Environment
Before diving into the Hadoop installation, it’s a good practice to create a dedicated user for Hadoop and set up the necessary directories:
sudo adduser hadoopuser
Give the new user sudo privileges and add them to the users
group:
sudo usermod -aG sudo hadoopuser sudo usermod -aG users hadoopuser
Step 4. Installing Hadoop on Debian 12.
Go to the official Apache Hadoop website and download the Hadoop distribution that suits your needs. For this guide, we’ll use Hadoop 3.3.6:
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz
Ensure the download is not corrupted by verifying the SHA-256 checksum:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz.sha512 sha256sum -c hadoop-3.3.6-src.tar.gz.sha512
Next, create a directory for Hadoop and extract the downloaded archive:
sudo mkdir /opt/hadoop sudo tar -xzvf hadoop-3.3.6.tar.gz -C /opt/hadoop --strip-components=1
Step 5. Configuring Hadoop.
Hadoop’s configuration is essential for its proper functioning. Let’s delve into the necessary configurations.
A. Understanding the Core Hadoop Configuration Files
Hadoop has several XML configuration files, but we’ll primarily focus on four: core-site.xml
, hdfs-site.xml
, yarn-site.xml
, and mapred-site.xml
.
B. Editing the core-site.xml
Edit the core-site.xml configuration file:
sudo nano /opt/hadoop/etc/hadoop/core-site.xml
Add the following properties to the <configuration>
tags:
<property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
C. Editing the hdfs-site.xml
Edit the hdfs-site.xml
configuration file:
sudo nano /opt/hadoop/etc/hadoop/hdfs-site.xml
Add the following properties:
<property> <name>dfs.replication</name> <value>1</value> </property>
D. Configuring yarn-site.xml
Edit the yarn-site.xml
configuration file:
sudo nano /opt/hadoop/etc/hadoop/yarn-site.xml
Add the following property:
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
E. Configuring mapred-site.xml
Edit the mapred-site.xml
configuration file:
sudo nano /opt/hadoop/etc/hadoop/mapred-site.xml
Add the following property:
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Step 6. Setting Up SSH Authentication.
Hadoop relies on SSH for secure communication between nodes. Let’s set up SSH keys.
Generate SSH keys for the Hadoop user:
sudo su - hadoopuser ssh-keygen -t rsa -P ""
Copy the public key to the authorized_keys
file:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test SSH connectivity to localhost and other nodes:
ssh localhost
Step 7. Formatting the Hadoop Distributed File System (HDFS).
Before starting Hadoop services, we need to format the Hadoop Distributed File System (HDFS).
Initialize the NameNode:
hdfs namenode -format
Create the necessary directories for HDFS:
hdfs dfs -mkdir -p /user/hadoopuser hdfs dfs -chown hadoopuser:hadoopuser /user/hadoopuser
Verify the HDFS status by browsing the NameNode web interface at http://localhost:9870
.
Step 8. Starting Hadoop Services.
It’s time to start the Hadoop services. Start the Hadoop NameNode and DataNode:
start-dfs.sh
Start the ResourceManager and NodeManager:
start-yarn.sh
To ensure everything is running smoothly, check the Hadoop cluster’s status using the ResourceManager web interface at http://localhost:8088
.
Step 9. Running a Simple Hadoop Job.
Now, let’s test our Hadoop setup by running a simple MapReduce job.
A. Preparing Input Data
Create an input directory and upload a sample text file:
hdfs dfs -mkdir -p /input hdfs dfs -put /path/to/your/inputfile.txt /input
B. Running a MapReduce Job
Run a WordCount example:
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /input /output
C. Monitoring Job Progress
Monitor the job progress by visiting the ResourceManager web interface.
Step 10. Troubleshooting Common Issues
While Hadoop is powerful, it can be challenging. Here are some common issues and their resolutions.
A. Diagnosing Hadoop Startup Problems
- Check the logs in
/opt/hadoop/logs
for error messages. - Ensure that all configuration files are correctly edited.
B. Debugging HDFS Issues
- Verify the HDFS status by browsing the NameNode web interface.
- Check for disk space and permissions issues in the data directories.
C. Handling Resource Allocation Problems
- Adjust the resource allocation in the yarn-site.xml file.
- Monitor resource usage in the ResourceManager web interface.
Congratulations! You have successfully installed Apache Hadoop. Thanks for using this tutorial to install the latest version of the Apache Hadoop on Debian 12 Bookworm. For additional help or useful information, we recommend you check the official Hadoop website.