-Using Ubuntu 20.04 & Hadoop 3.3.0-
*inspiration taken from https://www.digitalocean.com/community/tutorials/how-to-spin-up-a-hadoop-cluster-with-digitalocean-droplets but changes have been made to reflect updates to Ubuntu and Hadoop*
Initial Configuration (on each node)
Open a terminal on each node and run the following commands:
sudo apt-get update && sudo apt-get -y dist-upgrade
sudo adduser hadoop
sudo usermod -aG sudo hadoop
On each node, change the hostname to something unique. For my setup, I used “namenode” on the name node, and “datanode” on the worker node. Use the following command on each respective node to do this, replacing “<nodename>” with the name you choose:
sudo hostnamectl set-hostname <nodename>
Reboot each node using the following command:
sudo reboot
Login to the newly created “hadoop” user. Open a terminal and run the following command to edit the “hosts” file. Comment out the “localhost” entries by adding a preceding “#” and add the IP addresses and respective hostnames of your Hadoop nodes. See the picture below for an example.
sudo nano /etc/hosts
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following commands:
sudo apt-get -y install openjdk-8-jdk
sudo apt install openssh-server openssh-client -y
mkdir my-hadoop-install && cd my-hadoop-install
wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.o.tar.gz
tar xvzf hadoop-3.3.0.tar.gz
Hadoop Environment Configuration (on each node)
Open a terminal on each node and run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hadoop-env.sh
Add the following lines anywhere in the file, and make sure they aren’t commented out (make sure there is no preceding “#”)
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER="hadoop"
export HDFS_DATANODE_USER="hadoop"
export HDFS_SECONDARYNAMENODE_USER="hadoop"
export YARN_RESOURCEMANAGER_USER="hadoop"
export YARN_NODEMANAGER_USER="hadoop"
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following commands:
source ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hadoop-env.sh
sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs/data
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/core-site.xml
Between “” and “”, add the following, replacing “master-server-ip” with your own master/namenode server’s IP address. Do NOT replace it with each respective server’s own IP, as the DigitalOcean guide recommends.
<property>
<name>fs.defaultFS</name>
<value>hdfs://master-server-ip:9000</value>
</property>
Setup Passwordless SSH
On the master/namenode server, run the following command, and then press “Enter” three times.
ssh-keygen
Run the following command and copy the entire output onto your clipboard.
cat ~/.ssh/id_rsa.pub
Run the following command on both the master/namenode server and any worker nodes, and paste in the output from the previous command.
nano ~/.ssh/authorized_keys
On the master/namenode server, run the following command:
nano ~/.ssh/config
Edit the file using the following format, replacing the underlined portions with your respective master/namenode and worker node IPs.
Host hadoop-master-server-ip
HostName hadoop-master-server-ip
User hadoop
IdentityFile ~/.ssh/id_rsa
Host hadoop-worker-01-server-ip
HostName hadoop-worker-01-server-ip
User hadoop
IdentityFile ~/.ssh/id_rsa
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.
SSH into your worker node(s) from your master/namenode server, replacing “hadoop-worker-01-ip” with your respective IP(s).
ssh hadoop@hadoop-worker-01-server-ip
Reply to the prompt with “yes”, and then logout by typing the following (admittedly self-explanatory) command:
logout
Configure the Master Node
On the master node, run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hdfs-site.xml
Between “<configuration>” and “</configuration>”, add the following:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/mapred-site.xml
Between “<configuration>” and “</configuration>”, add the following, replacing “hadoop-master-server-ip” with your master/namenode server’s IP address.
<property>
<name>mapreduce.jobtracker.address</name>
<value>hadoop-master-server-ip:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/yarn-site.xml
Between “<configuration>” and “</configuration>”, add the following, replacing “hadoop-master-server-ip” (underlined) with your master/namenode server’s IP address.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master-server-ip</value>
</property>
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/masters
Type in your master/namenode server’s IP address.
hadoop-master-server-ip
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/workers
Type in your worker node(s) IP address(es), one per line, below the “localhost” entry.
localhost
hadoop-worker-01-server-ip
hadoop-worker-02-server-ip
hadoop-worker-03-server-ip
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.
Configure the Worker Node(s)
On the worker node(s), run the following command:
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hdfs-site.xml
Between “<configuration>” and “</configuration>”, add the following.
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.
Starting up Hadoop
On the master/namenode server, run the following commands:
cd ~/my-hadoop-install/hadoop-3.3.0/
sudo ./bin/hdfs namenode -format
sudo ./sbin/start-dfs.sh
./sbin/start-yarn.sh
Verify Functionality
On each of your nodes, run the following command to ensure Hadoop processes are running:
jps
On the master/namenode, visit the following URL in a web browser, replacing “hadoop-master-server-ip” with your own master/namenode server’s IP address:
http://hadoop-master-server-ip:9870
Click on “Datanodes” on the menu bar, and ensure that all of your worker nodes’ IP addresses are showing up on the web GUI.