How To Install Apache Hadoop on Ubuntu 20.04 LTS

Install Apache Hadoop on Ubuntu 20.04

In this tutorial, we will show you how to install Apache Hadoop on Ubuntu 20.04 LTS. For those of you who didn’t know, Apache Hadoop is an open-source framework used for distributed storage as well as distributed processing of big data on clusters of computers that run on commodity hardware. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo‘ to the commands to get root privileges. I will show you through the step by step installation of Flask on Ubuntu 20.04 (Focal Fossa). You can follow the same instructions for Ubuntu 18.04, 16.04, and any other Debian based distribution like Linux Mint.

Install Apache Hadoop on Ubuntu 20.04 LTS Focal Fossa

Step 1. First, make sure that all your system packages are up-to-date by running the following apt commands in the terminal.

sudo apt update
sudo apt upgrade

Step 2. Installing Java.

In order to run Hadoop, you need to have Java 8 install on your machine. To do so, use the following command:

sudo apt install default-jdk default-jre

Once installed, you can verify the installed version of Java with the following command:

java -version

Step 3. Create Hadoop User.

First, create a new user named Hadoop with the following command:

sudo addgroup hadoopgroup
sudo adduser —ingroup hadoopgroup hadoopuser

Next, log in with a Hadoop user and generate an SSH key pair with the following command:

su - hadoopuser
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

After that, verify the passwordless SSH with the following command:

ssh localhost

Once you are login without a password, you can proceed to the next step.

Step 4. Installing Apache Hadoop on Ubuntu 20.04.

Now we download the latest stable version of Apache Hadoop, At the moment of writing this article it is version 3.3.0:

su - hadoop
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
tar -xvzf hadoop-3.3.0.tar.gz

Next, move the extracted directory to the /usr/local/:

sudo mv hadoop-3.3.0 /usr/local/hadoop
sudo mkdir /usr/local/hadoop/logs

We change the ownership of the Hadoop directory to Hadoop:

sudo chown -R hadoop:hadoop /usr/local/hadoop

Step 5. Configure Apache Hadoop.

Setting up the environment variables. Edit ~/.bashrc the file and append the following values at end of the file:

nano ~/.bashrc

Add the following lines:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Apply environmental variables to the currently running session:

source ~/.bashrc

Next, you will need to define Java environment variables in hadoop-env.sh to configure YARN, HDFS, MapReduce, and Hadoop-related project settings:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

You can now verify the Hadoop version using the following command:

hadoop version

Step 6. Configure core-site.xml file.

Open the core-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following lines:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://0.0.0.0:9000</value>
      <description>The default file system URI</description>
   </property>
</configuration>

Step 7. Configure hdfs-site.xml File.

Use the following command to open the hdfs-site.xml file for editing:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following lines:

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hdfs/datanode</value>
   </property>
</configuration>

Step 8. Configure mapred-site.xml File.

Use the following command to access the mapred-site.xmlfile:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following lines:

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Step 9. Configure yarn-site.xml File.

Open the yarn-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following lines:

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

Step 10. Format HDFS NameNode.

Now we log in with a Hadoop user and format the HDFS NameNode with the following command:

su - hadoop
hdfs namenode -format

Step 11. Start the Hadoop Cluster.

Now start the NameNode and DataNode with the following command:

start-dfs.sh

Then, start YARN resource and nodemanagers:

start-yarn.sh

You should observe the output to ascertain that it tries to start datanode on slave nodes one by one. To check if all services are started well using ‘jps‘ command:

jps

Step 12. Accessing Apache Hadoop.

The default port number 9870 gives you access to the Hadoop NameNode UI:

http://your-server-ip:9870

Install Apache Hadoop on Ubuntu 20.04

The default port 9864 is used to access individual DataNodes directly from your browser:

http://your-server-ip:9864

Install Apache Hadoop on Ubuntu 20.04

The YARN Resource Manager is accessible on port 8088:

http://your-server-ip:8088

Install Apache Hadoop on Ubuntu 20.04

Congratulations! You have successfully installed Hadoop. Thanks for using this tutorial for installing the Apache Hadoop on your Ubuntu 20.04 LTS Focal Fossa system. For additional help or useful information, we recommend you to check the official Apache Hadoop website.

VPS Manage Service Offer
If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!