DebianDebian Based

How To Install Apache Hadoop on Debian 13

Install Apache Hadoop on Debian 13

Apache Hadoop is the industry-standard open-source framework for storing and processing massive datasets across distributed systems, and if you want to run it on the latest stable Debian release, this guide covers exactly how to install Apache Hadoop on Debian 13 “Trixie” from scratch. Debian 13, released on August 9, 2025, ships with Linux Kernel 6.12 and APT 3.0, making it a rock-solid base for data infrastructure workloads. By the end of this tutorial, you will have a fully working Apache Hadoop 3.4.x single-node cluster with HDFS, YARN, and MapReduce all verified and running. Whether you are a sysadmin setting up a test environment, a developer exploring big data pipelines, or a student learning distributed systems, this guide gives you every command you need with clear explanations of what each one actually does.

What Is Apache Hadoop and Why Does It Matter?

Apache Hadoop is an open-source, Java-based distributed computing platform that lets you store and process large datasets across clusters of commodity hardware. It was originally built on Google’s MapReduce and GFS research papers, and today it powers data pipelines at companies processing petabytes of data every day.

Hadoop has four main components that work together:

  • HDFS (Hadoop Distributed File System): Stores data across nodes in blocks with automatic replication for fault tolerance.
  • YARN (Yet Another Resource Negotiator): Manages CPU and memory resources across the cluster and schedules application jobs.
  • MapReduce: A parallel processing model that runs compute tasks close to where data lives, reducing network overhead.
  • Hadoop Common: Shared utilities and libraries that support the other three modules.

Understanding this architecture matters before you install anything. When you run start-dfs.sh, you are launching the NameNode and DataNode processes. When you run start-yarn.sh, you are launching the ResourceManager and NodeManager. Knowing what each process does helps you diagnose problems fast instead of guessing.

Why Debian 13 Trixie Is a Strong Choice for Hadoop

Debian 13 “Trixie” brings real improvements relevant to server workloads. Kernel 6.12 includes the EEVDF scheduler, which handles CPU-intensive jobs more efficiently than the older CFS scheduler — directly useful for MapReduce workloads. APT 3.0 with the new Solver3 backend resolves Java package dependencies more cleanly, which reduces the chance of broken installs.

Debian also applies enhanced security hardening including ROP, COP, and JOP attack mitigations on amd64 and arm64 architectures. For a framework like Hadoop that will often run on production servers, that security baseline matters. Debian’s long-term support cycle and minimal base install also mean less bloat competing for the resources Hadoop needs.

Prerequisites

Before you start, make sure you have the following ready:

  • A machine or virtual machine running Debian 13 “Trixie” — a fresh install is strongly recommended to avoid configuration conflicts.
  • Minimum hardware: 2 CPU cores, 4 GB RAM, 20 GB free disk space (8 GB RAM recommended for stable YARN operation).
  • A user account with sudo privileges.
  • Active internet connection to download Java, SSH packages, and the Hadoop binary.
  • Basic comfort with the Linux terminal and a text editor such as nano or vim.
  • No existing Hadoop installation — leftover configs from a previous install will cause silent daemon failures.

This guide sets up a pseudo-distributed single-node cluster, which is the standard starting point before scaling to multi-node production clusters.

Step 1: Update Your System

Always start with a full system update. This prevents dependency conflicts and ensures you are installing packages against the latest repository state.

sudo apt update && sudo apt upgrade -y

The apt update command refreshes your local package index from Debian’s repositories. The apt upgrade -y command installs all pending upgrades without prompting. Running both together before any new software installation is a fundamental Linux server hygiene practice.

After the update completes, confirm your running kernel version:

uname -r

You should see a 6.12.x kernel version. If you do not, reboot the system to load the updated kernel before proceeding.

Step 2: Install Java (OpenJDK 11)

Hadoop is a Java application. It will not start without a compatible JDK installed on the system. OpenJDK 11 is the recommended version for Apache Hadoop 3.4.x — Java 8 is approaching end-of-life and Java 17+ introduces module restrictions that can cause classpath issues with older Hadoop internals.

Install OpenJDK 11:

sudo apt install openjdk-11-jdk -y

Verify the installation:

java -version

Expected output:

openjdk version "11.0.x" 2024-xx-xx
OpenJDK Runtime Environment (build 11.0.x+x-Debian-x)
OpenJDK 64-Bit Server VM (build 11.0.x+x-Debian-x, mixed mode, sharing)

Now find the exact path to your Java installation:

readlink -f $(which java)

The output will look like /usr/lib/jvm/java-11-openjdk-amd64/bin/java. Your JAVA_HOME is everything before /bin/java, so in this case: /usr/lib/jvm/java-11-openjdk-amd64. Write this down — you will use it in Steps 4 and 6.

Step 3: Install SSH and rsync

Hadoop uses SSH to launch daemons across nodes. Even on a single-node cluster, the NameNode SSHes into localhost to start the DataNode. If SSH is not working correctly, your HDFS daemons will silently fail to start with no obvious error message.

Install OpenSSH server, client, and rsync:

sudo apt install openssh-server openssh-client rsync -y

Enable and start the SSH service:

sudo systemctl enable ssh && sudo systemctl start ssh

Verify SSH is running:

sudo systemctl status ssh

You should see active (running) in the output. The rsync package is used by Hadoop to sync configuration and data blocks between nodes during cluster operations and rolling upgrades.

Step 4: Create a Dedicated Hadoop User

Running Hadoop as root is a security risk. A dedicated system user isolates Hadoop processes and their file permissions from the rest of the system.

Create the hadoop user:

sudo adduser hadoop
sudo usermod -aG sudo hadoop

Switch to the new user:

su - hadoop

Now generate an SSH key pair for passwordless localhost authentication. Hadoop’s startup scripts require this to function:

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

The -P "" flag sets an empty passphrase. This is intentional — Hadoop’s scripts run non-interactively and cannot handle passphrase prompts.

Add the public key to the authorized keys file:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

The chmod 0600 command restricts the file so only the owner can read and write it. SSH will refuse to use authorized_keys files with looser permissions.

Test that passwordless SSH works:

ssh localhost

If you connect without being asked for a password and see a shell prompt, SSH is configured correctly. Type exit to return to the hadoop user session.

Step 5: Download and Install Apache Hadoop

Download the Hadoop Binary

You will download the official pre-built binary from the Apache Hadoop project. Do not compile from source for a standard installation — it adds unnecessary complexity and build-time dependencies.

wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz

Verify the Checksum

Always verify the download against the official SHA-512 checksum published on the Apache Hadoop downloads page. This confirms the file was not corrupted or tampered with during download:

sha512sum hadoop-3.4.1.tar.gz

Compare the output against the checksum listed on https://hadoop.apache.org/releases.html. If the hashes do not match, delete the file and re-download it.

Extract and Move to /opt/hadoop

sudo mkdir /opt/hadoop
sudo tar -xzvf hadoop-3.4.1.tar.gz -C /opt/hadoop --strip-components=1

The --strip-components=1 flag removes the top-level directory from the archive so files extract directly into /opt/hadoop.

sudo chown -R hadoop:hadoop /opt/hadoop
mkdir /opt/hadoop/logs

Step 6: Configure Apache Hadoop on Debian 13

This is the most critical section of the entire Apache Hadoop on Debian 13 setup. Four XML files and one shell script control how Hadoop runs. A single typo in any of these files will prevent daemons from starting.

All configuration files live in /opt/hadoop/etc/hadoop/.

Configure Environment Variables

Open the hadoop user’s .bashrc:

nano ~/.bashrc

Add the following block at the end of the file:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Save and apply:

source ~/.bashrc

Run hadoop version to confirm the PATH is working. You should see output like Hadoop 3.4.1.

Edit hadoop-env.sh

Hadoop’s own startup scripts read hadoop-env.sh independently of .bashrc. Set JAVA_HOME here explicitly to prevent JAVA_HOME is not set errors at startup:

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add this line:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Configure core-site.xml

This file tells Hadoop where the HDFS NameNode runs:

nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Configure hdfs-site.xml

This file sets the HDFS replication factor and storage paths. For a single-node cluster, replication is set to 1:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
  </property>
</configuration>

Create those directories now:

mkdir -p /home/hadoop/hadoopdata/hdfs/namenode
mkdir -p /home/hadoop/hadoopdata/hdfs/datanode

Configure mapred-site.xml

This file tells MapReduce to use YARN as its execution framework:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Configure yarn-site.xml

This enables the MapReduce shuffle service on the NodeManager:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Step 7: Format the HDFS NameNode

Before you start Hadoop for the first time, you must format the HDFS NameNode. This initializes the filesystem metadata on disk.

hdfs namenode -format

Important warning: Only run this command once. Re-formatting a NameNode that already has data destroys all HDFS metadata permanently. If you ever need to wipe and start fresh, delete the namenode and datanode directories first, then re-format.

A successful format will print a line containing Storage directory ... has been successfully formatted near the end of the output. If you see errors about JAVA_HOME, go back to Step 6 and verify hadoop-env.sh.

Step 8: Start Hadoop Services and Verify

Start HDFS

start-dfs.sh

This command starts the NameNode, DataNode, and SecondaryNameNode processes. You will see SSH connections to localhost as each daemon launches.

Start YARN

start-yarn.sh

This starts the ResourceManager and NodeManager.

Verify All Daemons Are Running

jps

Expected output:

12345 NameNode
12456 DataNode
12567 SecondaryNameNode
12678 ResourceManager
12789 NodeManager
12890 Jps

If any of these five processes is missing, check the logs immediately:

cat $HADOOP_HOME/logs/*.log | grep -i error

Access the Web UIs

Open a browser and verify the dashboards are live:

  • HDFS NameNode UI: http://localhost:9870 — filesystem health, block reports, DataNode list.
  • YARN ResourceManager UI: http://localhost:8088 — cluster resources, application history.

If you are on a remote server, use SSH port forwarding:

ssh -L 9870:localhost:9870 -L 8088:localhost:8088 hadoop@your-server-ip

Step 9: Run a Test MapReduce Job

Validate your full configure Apache Hadoop on Debian 13 setup by running Hadoop’s built-in WordCount example. This test exercises HDFS, YARN, and MapReduce together in a single job.

Create an HDFS input directory and upload sample files:

hdfs dfs -mkdir -p /user/hadoop/input
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/hadoop/input

Run the WordCount job:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/hadoop/input /user/hadoop/output

Monitor the job in the YARN UI at http://localhost:8088. You will see the job move through Map and Reduce phases and reach 100% completion.

View the results:

hdfs dfs -cat /user/hadoop/output/part-r-00000

A working output here confirms your entire Hadoop stack — HDFS storage, YARN scheduling, and MapReduce processing — is functioning correctly end to end.

Step 10: Stop and Restart Hadoop Services

Always shut Hadoop down cleanly. Killing processes without using the stop scripts can leave HDFS in an inconsistent state and corrupt block metadata.

Stop YARN first, then HDFS:

stop-yarn.sh
stop-dfs.sh

To start Hadoop after a system reboot, run the start scripts in reverse order:

start-dfs.sh
start-yarn.sh

Hadoop daemons do not start automatically after a reboot by default. If you want them to start on boot, create a simple systemd service file or add the start commands to a startup script.

Troubleshooting Common Errors

Even with careful configuration, a few errors come up repeatedly on Hadoop installations. Here is how to resolve the most common ones:

  1. JAVA_HOME is not set at startupThis means hadoop-env.sh either has no JAVA_HOME line or points to a wrong path. Verify the correct path with update-alternatives --config java, then set JAVA_HOME explicitly in $HADOOP_HOME/etc/hadoop/hadoop-env.sh.
  2. DataNode does not appear in jps outputThe most common cause is a cluster ID mismatch after accidentally re-running hdfs namenode -format. Fix this by deleting the DataNode data directory (/home/hadoop/hadoopdata/hdfs/datanode/*) and restarting HDFS without re-formatting.
  3. SSH connection refused on localhostThe SSH service is not running. Start it with sudo systemctl start ssh and verify with sudo systemctl status ssh. Also confirm that ~/.ssh/authorized_keys has chmod 0600 permissions.
  4. Port 9000 or 9870 already in useAnother process is occupying the port. Identify it with ss -tulnp | grep 9000 and either kill the conflicting process or change Hadoop’s port in core-site.xml.
  5. Permission denied on HDFS file operationsThe HDFS directory does not belong to the hadoop user. Fix it with:
    hdfs dfs -chown -R hadoop /user/hadoop

When in doubt, always check $HADOOP_HOME/logs/ first. Hadoop’s log files are detailed and almost always identify the exact failure within the first few lines of the error stack.

Congratulations! You have successfully installed Apache Hadoop. Thanks for using this tutorial to install the latest version of Apache Hadoop on Debian 13 “Trixie”. For additional help or useful information, we recommend you check the official Hadoop website.

VPS Manage Service Offer
If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

r00t

r00t is a dedicated and highly skilled Linux Systems Administrator with over a decade of progressive experience in designing, deploying, and maintaining enterprise-grade Linux infrastructure. His professional journey began in the telecommunications industry, where early exposure to Unix-based operating systems ignited a deep and enduring passion for open-source technologies and server administration.​ Throughout his career, r00t has demonstrated exceptional proficiency in managing large-scale Linux environments, overseeing more than 300 servers across development, staging, and production platforms while consistently achieving 99.9% system uptime. He holds advanced competencies in Red Hat Enterprise Linux (RHEL), Debian, and Ubuntu distributions, complemented by hands-on expertise in automation tools such as Ansible, Terraform, Bash scripting, and Python.
Back to top button