In this tutorial, we will show you how to install Apache Hadoop on Debian 9. For those of you who didn’t know, Apache Hadoop is an open-source framework used for distributed storage as well as distributed processing of big data on clusters of computers that runs on commodity hardware. Hadoop stores data in the Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides an API for requesting and allocating resources in the Hadoop cluster.
This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo’ to the commands to get root privileges. I will show you through the step by step installation of Apache Hadoop on a Debian 9 (Stretch) server.
Install Apache Hadoop on Debian 9 Stretch
Step 1. Before we install any software, it’s important to make sure your system is up to date by running these following apt-get commands in the terminal:
apt-get update apt-get upgrade
Step 2. Installing Java (OpenJDK).
Apache Hadoop requires Java version 8 and above. So, you can choose to install either OpenJDK or Oracle JDK:
email@example.com ~# java -version java version "1.8.0_192" Java(TM) SE Runtime Environment (build 1.8.0_192-b02) Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)
Step 3. Installing Apache Hadoop on Debian 9.
To avoid security issues, we recommend to set up a new Hadoop user group and user account to deal with all Hadoop related activities, following command:
sudo addgroup hadoopgroup sudo adduser —ingroup hadoopgroup hadoopuser
After creating the user, it also required to set up key-based ssh on its own account. To do this use execute the following commands:
su - hadoopuser ssh-keygen -t rsa -P "" cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys chmod 600 authorized_keys ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1 ssh slave-1
Next, download the latest stable version of Apache Hadoop, At the moment of writing this article it is version 2.8.1:
wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz tar xzf hadoop-3.1.1.tar.gz mv hadoop-3.1.1 hadoop
Step 4. Setup Environment Apache Hadoop.
Setting up the environment variables. Edit ~/.bashrc file and append following values at end of the file:
export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Apply environmental variables to the current running session:
Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable:
Hadoop has many configuration files, which need to configure as per the requirements of your Hadoop infrastructure. Let’s start with the configuration with basic Hadoop single node cluster setup:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value> </property> </configuration>
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Now format namenode using the following command, do not forget to check the storage directory:
hdfs namenode -format
Start all Hadoop services use the following command:
cd $HADOOP_HOME/sbin/ start-dfs.sh start-yarn.sh
You should observe the output to ascertain that it tries to start datanode on slave nodes one by one. To check if all services are started well using ‘jps‘ command:
Step 5. Setup Firewall for Apache Hadoop.
Allow Apache Hadoop through the firewall:
ufw allow 50070/tcp ufw allow 8088/tcp ufw reload
Step 6. Accessing Apache Hadoop.
Apache Hadoop will be available on HTTP port 8088 and port 50070 by default. Open your favorite browser and navigate to http://yourdomain.com:50070 or http://server-ip:50070.
Congratulations! You have successfully installed Apache Hadoop. Thanks for using this tutorial for installing Apache Hadoop in Debian 9 Stretch systems. For additional help or useful information, we recommend you to check the official Apache Hadoop website.