How To Install Apache Hadoop on Debian 13

Apache Hadoop is the industry-standard open-source framework for storing and processing massive datasets across distributed systems, and if you want to run it on the latest stable Debian release, this guide covers exactly how to install Apache Hadoop on Debian 13 “Trixie” from scratch. Debian 13, released on August 9, 2025, ships with Linux Kernel 6.12 and APT 3.0, making it a rock-solid base for data infrastructure workloads. By the end of this tutorial, you will have a fully working Apache Hadoop 3.4.x single-node cluster with HDFS, YARN, and MapReduce all verified and running. Whether you are a sysadmin setting up a test environment, a developer exploring big data pipelines, or a student learning distributed systems, this guide gives you every command you need with clear explanations of what each one actually does.
What Is Apache Hadoop and Why Does It Matter?
Apache Hadoop is an open-source, Java-based distributed computing platform that lets you store and process large datasets across clusters of commodity hardware. It was originally built on Google’s MapReduce and GFS research papers, and today it powers data pipelines at companies processing petabytes of data every day.
Hadoop has four main components that work together:
- HDFS (Hadoop Distributed File System): Stores data across nodes in blocks with automatic replication for fault tolerance.
- YARN (Yet Another Resource Negotiator): Manages CPU and memory resources across the cluster and schedules application jobs.
- MapReduce: A parallel processing model that runs compute tasks close to where data lives, reducing network overhead.
- Hadoop Common: Shared utilities and libraries that support the other three modules.
Understanding this architecture matters before you install anything. When you run start-dfs.sh, you are launching the NameNode and DataNode processes. When you run start-yarn.sh, you are launching the ResourceManager and NodeManager. Knowing what each process does helps you diagnose problems fast instead of guessing.
Why Debian 13 Trixie Is a Strong Choice for Hadoop
Debian 13 “Trixie” brings real improvements relevant to server workloads. Kernel 6.12 includes the EEVDF scheduler, which handles CPU-intensive jobs more efficiently than the older CFS scheduler — directly useful for MapReduce workloads. APT 3.0 with the new Solver3 backend resolves Java package dependencies more cleanly, which reduces the chance of broken installs.
Debian also applies enhanced security hardening including ROP, COP, and JOP attack mitigations on amd64 and arm64 architectures. For a framework like Hadoop that will often run on production servers, that security baseline matters. Debian’s long-term support cycle and minimal base install also mean less bloat competing for the resources Hadoop needs.
Prerequisites
Before you start, make sure you have the following ready:
- A machine or virtual machine running Debian 13 “Trixie” — a fresh install is strongly recommended to avoid configuration conflicts.
- Minimum hardware: 2 CPU cores, 4 GB RAM, 20 GB free disk space (8 GB RAM recommended for stable YARN operation).
- A user account with sudo privileges.
- Active internet connection to download Java, SSH packages, and the Hadoop binary.
- Basic comfort with the Linux terminal and a text editor such as
nanoorvim. - No existing Hadoop installation — leftover configs from a previous install will cause silent daemon failures.
This guide sets up a pseudo-distributed single-node cluster, which is the standard starting point before scaling to multi-node production clusters.
Step 1: Update Your System
Always start with a full system update. This prevents dependency conflicts and ensures you are installing packages against the latest repository state.
sudo apt update && sudo apt upgrade -y
The apt update command refreshes your local package index from Debian’s repositories. The apt upgrade -y command installs all pending upgrades without prompting. Running both together before any new software installation is a fundamental Linux server hygiene practice.
After the update completes, confirm your running kernel version:
uname -r
You should see a 6.12.x kernel version. If you do not, reboot the system to load the updated kernel before proceeding.
Step 2: Install Java (OpenJDK 11)
Hadoop is a Java application. It will not start without a compatible JDK installed on the system. OpenJDK 11 is the recommended version for Apache Hadoop 3.4.x — Java 8 is approaching end-of-life and Java 17+ introduces module restrictions that can cause classpath issues with older Hadoop internals.
Install OpenJDK 11:
sudo apt install openjdk-11-jdk -y
Verify the installation:
java -version
Expected output:
openjdk version "11.0.x" 2024-xx-xx
OpenJDK Runtime Environment (build 11.0.x+x-Debian-x)
OpenJDK 64-Bit Server VM (build 11.0.x+x-Debian-x, mixed mode, sharing)
Now find the exact path to your Java installation:
readlink -f $(which java)
The output will look like /usr/lib/jvm/java-11-openjdk-amd64/bin/java. Your JAVA_HOME is everything before /bin/java, so in this case: /usr/lib/jvm/java-11-openjdk-amd64. Write this down — you will use it in Steps 4 and 6.
Step 3: Install SSH and rsync
Hadoop uses SSH to launch daemons across nodes. Even on a single-node cluster, the NameNode SSHes into localhost to start the DataNode. If SSH is not working correctly, your HDFS daemons will silently fail to start with no obvious error message.
Install OpenSSH server, client, and rsync:
sudo apt install openssh-server openssh-client rsync -y
Enable and start the SSH service:
sudo systemctl enable ssh && sudo systemctl start ssh
Verify SSH is running:
sudo systemctl status ssh
You should see active (running) in the output. The rsync package is used by Hadoop to sync configuration and data blocks between nodes during cluster operations and rolling upgrades.
Step 4: Create a Dedicated Hadoop User
Running Hadoop as root is a security risk. A dedicated system user isolates Hadoop processes and their file permissions from the rest of the system.
Create the hadoop user:
sudo adduser hadoop
sudo usermod -aG sudo hadoop
Switch to the new user:
su - hadoop
Now generate an SSH key pair for passwordless localhost authentication. Hadoop’s startup scripts require this to function:
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
The -P "" flag sets an empty passphrase. This is intentional — Hadoop’s scripts run non-interactively and cannot handle passphrase prompts.
Add the public key to the authorized keys file:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
The chmod 0600 command restricts the file so only the owner can read and write it. SSH will refuse to use authorized_keys files with looser permissions.
Test that passwordless SSH works:
ssh localhost
If you connect without being asked for a password and see a shell prompt, SSH is configured correctly. Type exit to return to the hadoop user session.
Step 5: Download and Install Apache Hadoop
Download the Hadoop Binary
You will download the official pre-built binary from the Apache Hadoop project. Do not compile from source for a standard installation — it adds unnecessary complexity and build-time dependencies.
wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
Verify the Checksum
Always verify the download against the official SHA-512 checksum published on the Apache Hadoop downloads page. This confirms the file was not corrupted or tampered with during download:
sha512sum hadoop-3.4.1.tar.gz
Compare the output against the checksum listed on https://hadoop.apache.org/releases.html. If the hashes do not match, delete the file and re-download it.
Extract and Move to /opt/hadoop
sudo mkdir /opt/hadoop
sudo tar -xzvf hadoop-3.4.1.tar.gz -C /opt/hadoop --strip-components=1
The --strip-components=1 flag removes the top-level directory from the archive so files extract directly into /opt/hadoop.
sudo chown -R hadoop:hadoop /opt/hadoop
mkdir /opt/hadoop/logs
Step 6: Configure Apache Hadoop on Debian 13
This is the most critical section of the entire Apache Hadoop on Debian 13 setup. Four XML files and one shell script control how Hadoop runs. A single typo in any of these files will prevent daemons from starting.
All configuration files live in /opt/hadoop/etc/hadoop/.
Configure Environment Variables
Open the hadoop user’s .bashrc:
nano ~/.bashrc
Add the following block at the end of the file:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Save and apply:
source ~/.bashrc
Run hadoop version to confirm the PATH is working. You should see output like Hadoop 3.4.1.
Edit hadoop-env.sh
Hadoop’s own startup scripts read hadoop-env.sh independently of .bashrc. Set JAVA_HOME here explicitly to prevent JAVA_HOME is not set errors at startup:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add this line:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Configure core-site.xml
This file tells Hadoop where the HDFS NameNode runs:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Configure hdfs-site.xml
This file sets the HDFS replication factor and storage paths. For a single-node cluster, replication is set to 1:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Create those directories now:
mkdir -p /home/hadoop/hadoopdata/hdfs/namenode
mkdir -p /home/hadoop/hadoopdata/hdfs/datanode
Configure mapred-site.xml
This file tells MapReduce to use YARN as its execution framework:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configure yarn-site.xml
This enables the MapReduce shuffle service on the NodeManager:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Step 7: Format the HDFS NameNode
Before you start Hadoop for the first time, you must format the HDFS NameNode. This initializes the filesystem metadata on disk.
hdfs namenode -format
Important warning: Only run this command once. Re-formatting a NameNode that already has data destroys all HDFS metadata permanently. If you ever need to wipe and start fresh, delete the namenode and datanode directories first, then re-format.
A successful format will print a line containing Storage directory ... has been successfully formatted near the end of the output. If you see errors about JAVA_HOME, go back to Step 6 and verify hadoop-env.sh.
Step 8: Start Hadoop Services and Verify
Start HDFS
start-dfs.sh
This command starts the NameNode, DataNode, and SecondaryNameNode processes. You will see SSH connections to localhost as each daemon launches.
Start YARN
start-yarn.sh
This starts the ResourceManager and NodeManager.
Verify All Daemons Are Running
jps
Expected output:
12345 NameNode
12456 DataNode
12567 SecondaryNameNode
12678 ResourceManager
12789 NodeManager
12890 Jps
If any of these five processes is missing, check the logs immediately:
cat $HADOOP_HOME/logs/*.log | grep -i error
Access the Web UIs
Open a browser and verify the dashboards are live:
- HDFS NameNode UI:
http://localhost:9870— filesystem health, block reports, DataNode list. - YARN ResourceManager UI:
http://localhost:8088— cluster resources, application history.
If you are on a remote server, use SSH port forwarding:
ssh -L 9870:localhost:9870 -L 8088:localhost:8088 hadoop@your-server-ip
Step 9: Run a Test MapReduce Job
Validate your full configure Apache Hadoop on Debian 13 setup by running Hadoop’s built-in WordCount example. This test exercises HDFS, YARN, and MapReduce together in a single job.
Create an HDFS input directory and upload sample files:
hdfs dfs -mkdir -p /user/hadoop/input
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/hadoop/input
Run the WordCount job:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/hadoop/input /user/hadoop/output
Monitor the job in the YARN UI at http://localhost:8088. You will see the job move through Map and Reduce phases and reach 100% completion.
View the results:
hdfs dfs -cat /user/hadoop/output/part-r-00000
A working output here confirms your entire Hadoop stack — HDFS storage, YARN scheduling, and MapReduce processing — is functioning correctly end to end.
Step 10: Stop and Restart Hadoop Services
Always shut Hadoop down cleanly. Killing processes without using the stop scripts can leave HDFS in an inconsistent state and corrupt block metadata.
Stop YARN first, then HDFS:
stop-yarn.sh
stop-dfs.sh
To start Hadoop after a system reboot, run the start scripts in reverse order:
start-dfs.sh
start-yarn.sh
Hadoop daemons do not start automatically after a reboot by default. If you want them to start on boot, create a simple systemd service file or add the start commands to a startup script.
Troubleshooting Common Errors
Even with careful configuration, a few errors come up repeatedly on Hadoop installations. Here is how to resolve the most common ones:
JAVA_HOME is not setat startupThis meanshadoop-env.sheither has noJAVA_HOMEline or points to a wrong path. Verify the correct path withupdate-alternatives --config java, then setJAVA_HOMEexplicitly in$HADOOP_HOME/etc/hadoop/hadoop-env.sh.- DataNode does not appear in
jpsoutputThe most common cause is a cluster ID mismatch after accidentally re-runninghdfs namenode -format. Fix this by deleting the DataNode data directory (/home/hadoop/hadoopdata/hdfs/datanode/*) and restarting HDFS without re-formatting. - SSH connection refused on localhostThe SSH service is not running. Start it with
sudo systemctl start sshand verify withsudo systemctl status ssh. Also confirm that~/.ssh/authorized_keyshaschmod 0600permissions. - Port 9000 or 9870 already in useAnother process is occupying the port. Identify it with
ss -tulnp | grep 9000and either kill the conflicting process or change Hadoop’s port incore-site.xml. Permission deniedon HDFS file operationsThe HDFS directory does not belong to the hadoop user. Fix it with:hdfs dfs -chown -R hadoop /user/hadoop
When in doubt, always check $HADOOP_HOME/logs/ first. Hadoop’s log files are detailed and almost always identify the exact failure within the first few lines of the error stack.
Congratulations! You have successfully installed Apache Hadoop. Thanks for using this tutorial to install the latest version of Apache Hadoop on Debian 13 “Trixie”. For additional help or useful information, we recommend you check the official Hadoop website.