Skip to content

Configuring Hadoop on Ubuntu 20.04

    -Using Ubuntu 20.04 & Hadoop 3.3.0-

    *inspiration taken from https://www.digitalocean.com/community/tutorials/how-to-spin-up-a-hadoop-cluster-with-digitalocean-droplets but changes have been made to reflect updates to Ubuntu and Hadoop*


    Initial Configuration (on each node)

    Open a terminal on each node and run the following commands:

    sudo apt-get update && sudo apt-get -y dist-upgrade
    sudo adduser hadoop
    sudo usermod -aG sudo hadoop

    On each node, change the hostname to something unique. For my setup, I used “namenode” on the name node, and “datanode” on the worker node. Use the following command on each respective node to do this, replacing “<nodename>” with the name you choose:

    sudo hostnamectl set-hostname <nodename>

    Reboot each node using the following command:

    sudo reboot

    Login to the newly created “hadoop” user. Open a terminal and run the following command to edit the “hosts” file. Comment out the “localhost” entries by adding a preceding “#” and add the IP addresses and respective hostnames of your Hadoop nodes. See the picture below for an example.

    sudo nano /etc/hosts

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following commands:

    sudo apt-get -y install openjdk-8-jdk
    sudo apt install openssh-server openssh-client -y
    mkdir my-hadoop-install && cd my-hadoop-install
    wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.o.tar.gz
    tar xvzf hadoop-3.3.0.tar.gz

    Hadoop Environment Configuration (on each node)

    Open a terminal on each node and run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hadoop-env.sh

    Add the following lines anywhere in the file, and make sure they aren’t commented out (make sure there is no preceding “#”)

    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    export HDFS_NAMENODE_USER="hadoop"
    export HDFS_DATANODE_USER="hadoop"
    export HDFS_SECONDARYNAMENODE_USER="hadoop"
    export YARN_RESOURCEMANAGER_USER="hadoop"
    export YARN_NODEMANAGER_USER="hadoop"

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following commands:

    source ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hadoop-env.sh
    sudo mkdir -p /usr/local/hadoop/hdfs/data
    sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs/data
    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/core-site.xml

    Between “” and “”, add the following, replacing “master-server-ip” with your own master/namenode server’s IP address. Do NOT replace it with each respective server’s own IP, as the DigitalOcean guide recommends.

    <property>
            <name>fs.defaultFS</name>
            <value>hdfs://master-server-ip:9000</value>
    </property>

    Setup Passwordless SSH

    On the master/namenode server, run the following command, and then press “Enter” three times.

    ssh-keygen

    Run the following command and copy the entire output onto your clipboard.

    cat ~/.ssh/id_rsa.pub

    Run the following command on both the master/namenode server and any worker nodes, and paste in the output from the previous command.

    nano ~/.ssh/authorized_keys

    On the master/namenode server, run the following command:

    nano ~/.ssh/config

    Edit the file using the following format, replacing the underlined portions with your respective master/namenode and worker node IPs.

    Host hadoop-master-server-ip
        HostName hadoop-master-server-ip
        User hadoop
        IdentityFile ~/.ssh/id_rsa
    
    Host hadoop-worker-01-server-ip
        HostName hadoop-worker-01-server-ip
        User hadoop
        IdentityFile ~/.ssh/id_rsa

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.

    SSH into your worker node(s) from your master/namenode server, replacing “hadoop-worker-01-ip” with your respective IP(s).

    ssh hadoop@hadoop-worker-01-server-ip

    Reply to the prompt with “yes”, and then logout by typing the following (admittedly self-explanatory) command:

    logout

    Configure the Master Node

    On the master node, run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hdfs-site.xml

    Between “<configuration>” and “</configuration>”, add the following:

    <property>
            <name>dfs.replication</name>
            <value>3</value>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/mapred-site.xml

    Between “<configuration>” and “</configuration>”, add the following, replacing “hadoop-master-server-ip” with your master/namenode server’s IP address.

    <property>
            <name>mapreduce.jobtracker.address</name>
            <value>hadoop-master-server-ip:54311</value>
        </property>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/yarn-site.xml

    Between “<configuration>” and “</configuration>”, add the following, replacing “hadoop-master-server-ip” (underlined) with your master/namenode server’s IP address.

    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>hadoop-master-server-ip</value>
    </property>

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/masters

    Type in your master/namenode server’s IP address.

    hadoop-master-server-ip

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/workers

    Type in your worker node(s) IP address(es), one per line, below the “localhost” entry.

    localhost
    hadoop-worker-01-server-ip
    hadoop-worker-02-server-ip
    hadoop-worker-03-server-ip

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.

    Configure the Worker Node(s)

    On the worker node(s), run the following command:

    nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hdfs-site.xml

    Between “<configuration>” and “</configuration>”, add the following.

    <property>
            <name>dfs.replication</name>
            <value>3</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>

    Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.

    Starting up Hadoop

    On the master/namenode server, run the following commands:

    cd ~/my-hadoop-install/hadoop-3.3.0/
    sudo ./bin/hdfs namenode -format
    sudo ./sbin/start-dfs.sh
    ./sbin/start-yarn.sh

    Verify Functionality

    On each of your nodes, run the following command to ensure Hadoop processes are running:

    jps

    On the master/namenode, visit the following URL in a web browser, replacing “hadoop-master-server-ip” with your own master/namenode server’s IP address:

    http://hadoop-master-server-ip:9870

    Click on “Datanodes” on the menu bar, and ensure that all of your worker nodes’ IP addresses are showing up on the web GUI.