How To Install Apache Hadoop on Debian 12

4 minutes read

In this tutorial, we will show you how to install Apache Hadoop on Debian 12. Big Data is the backbone of modern data-driven businesses, and Hadoop has emerged as the go-to solution for processing and analyzing massive datasets. If you’re looking to harness the power of Hadoop on a Debian 12 system, you’re in the right place.

This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo‘ to the commands to get root privileges. I will show you the step-by-step installation of the Prometheus monitoring tool on a Debian 12 (Bookworm).

Prerequisites

A server running one of the following operating systems: Debian 12 (Bookworm).
It’s recommended that you use a fresh OS install to prevent any potential issues.
SSH access to the server (or just open Terminal if you’re on a desktop).
An active internet connection. You’ll need an internet connection to download the necessary packages and dependencies for Apache Hadoop.
A non-root sudo user or access to the root user. We recommend acting as a non-root sudo user, however, as you can harm your system if you’re not careful when acting as the root.

Install Apache Hadoop on Debian 12 Bookworm

Step 1. Before we install any software, it’s important to make sure your system is up to date by running the following apt commands in the terminal:

sudo apt update

This command will refresh the repository, allowing you to install the latest versions of software packages.

Step 2. Installing Java Development Kit (JDK).

Hadoop relies on Java, so make sure you have the JDK installed:

sudo apt install openjdk-11-jdk

Verify the Java version using the following command:

java --version

Step 3. Preparing the Hadoop Environment

Before diving into the Hadoop installation, it’s a good practice to create a dedicated user for Hadoop and set up the necessary directories:

sudo adduser hadoopuser

Give the new user sudo privileges and add them to the users group:

sudo usermod -aG sudo hadoopuser
sudo usermod -aG users hadoopuser

Step 4. Installing Hadoop on Debian 12.

Go to the official Apache Hadoop website and download the Hadoop distribution that suits your needs. For this guide, we’ll use Hadoop 3.3.6:

wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz

Ensure the download is not corrupted by verifying the SHA-256 checksum:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz.sha512
sha256sum -c hadoop-3.3.6-src.tar.gz.sha512

Next, create a directory for Hadoop and extract the downloaded archive:

sudo mkdir /opt/hadoop
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /opt/hadoop --strip-components=1

Step 5. Configuring Hadoop.

Hadoop’s configuration is essential for its proper functioning. Let’s delve into the necessary configurations.

A. Understanding the Core Hadoop Configuration Files

Hadoop has several XML configuration files, but we’ll primarily focus on four: core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml.

B. Editing the core-site.xml

Edit the core-site.xml configuration file:

sudo nano /opt/hadoop/etc/hadoop/core-site.xml

Add the following properties to the <configuration> tags:

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

C. Editing the hdfs-site.xml

Edit the hdfs-site.xml configuration file:

sudo nano /opt/hadoop/etc/hadoop/hdfs-site.xml

Add the following properties:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

D. Configuring yarn-site.xml

Edit the yarn-site.xml configuration file:

sudo nano /opt/hadoop/etc/hadoop/yarn-site.xml

Add the following property:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

E. Configuring mapred-site.xml

Edit the mapred-site.xml configuration file:

sudo nano /opt/hadoop/etc/hadoop/mapred-site.xml

Add the following property:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Step 6. Setting Up SSH Authentication.

Hadoop relies on SSH for secure communication between nodes. Let’s set up SSH keys.

Generate SSH keys for the Hadoop user:

sudo su - hadoopuser
ssh-keygen -t rsa -P ""

Copy the public key to the authorized_keys file:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test SSH connectivity to localhost and other nodes:

ssh localhost

Step 7. Formatting the Hadoop Distributed File System (HDFS).

Before starting Hadoop services, we need to format the Hadoop Distributed File System (HDFS).

Initialize the NameNode:

hdfs namenode -format

Create the necessary directories for HDFS:

hdfs dfs -mkdir -p /user/hadoopuser
hdfs dfs -chown hadoopuser:hadoopuser /user/hadoopuser

Verify the HDFS status by browsing the NameNode web interface at http://localhost:9870.

Step 8. Starting Hadoop Services.

It’s time to start the Hadoop services. Start the Hadoop NameNode and DataNode:

start-dfs.sh

Start the ResourceManager and NodeManager:

start-yarn.sh

To ensure everything is running smoothly, check the Hadoop cluster’s status using the ResourceManager web interface at http://localhost:8088.

Step 9. Running a Simple Hadoop Job.

Now, let’s test our Hadoop setup by running a simple MapReduce job.

A. Preparing Input Data

Create an input directory and upload a sample text file:

hdfs dfs -mkdir -p /input
hdfs dfs -put /path/to/your/inputfile.txt /input

B. Running a MapReduce Job

Run a WordCount example:

hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /input /output

C. Monitoring Job Progress

Monitor the job progress by visiting the ResourceManager web interface.

Step 10. Troubleshooting Common Issues

While Hadoop is powerful, it can be challenging. Here are some common issues and their resolutions.

A. Diagnosing Hadoop Startup Problems

Check the logs in /opt/hadoop/logs for error messages.
Ensure that all configuration files are correctly edited.

B. Debugging HDFS Issues

Verify the HDFS status by browsing the NameNode web interface.
Check for disk space and permissions issues in the data directories.

C. Handling Resource Allocation Problems

Adjust the resource allocation in the yarn-site.xml file.
Monitor resource usage in the ResourceManager web interface.

Congratulations! You have successfully installed Apache Hadoop. Thanks for using this tutorial to install the latest version of the Apache Hadoop on Debian 12 Bookworm. For additional help or useful information, we recommend you check the official Hadoop website.

VPS Manage Service Offer

If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

Install Apache Hadoop on Debian 12 Bookworm

r00t