Goal
Our goal is to install Hadoop on a cluster, at the end we will check that this installation went well by launching the hdfs and yarn services.
Introduction
Here we present the different actions to perform on each node in our cluster to successfully install hadoop. We also propose a script that automates these actions so that the user only has to edit some configuration files and (happily) launch the script on each computer in the cluster. This script is open and available on this github repository, for a good understanding of what the script does and how to fill in the requested configuration files, we suggest to the reader wishing to use it, to take time to read the installation procedure below before returning to make use of the script. You will also find on the github repository, a readme.md file explaining how to use it.
Installation procedure
On each node in the cluster, you have to:
- Create a hadoop user, we will use the same credentials for all nodes
- Add the domain name + IP address mappings of all the other nodes to the /etc/hosts file.
- Install a ssh client and server and allow passwordless connection to the host from any other node.
- Install Java, ideally version 8 or 7
- Download Hadoop from the official website and put it in an appropriate directory (eg. /usr/local/) later on we shall refer to this folder as HADOOP_PARENT_DIR. The full path of the hadoop home directory shall instead be referred to as HADOOP_HOME.
- Set the hadoop user as the owner of the hadoop home directory.
- Configure Hadoop, it's just about editing a couple of configuration files, you will find them in the $HADOOP_PARENT_DIR/etc/hadoop/ directory. The files to be edited are core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml workers, hadoop-env.sh:
- core-site.xml Copy and paste the following
<? xml version = "1.0" encoding = "UTF-8"?>
<? xml-stylesheet type = "text/xsl" href = "configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
You may not use this file in compliance with the License.
You can get a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law
distributed under the license is distributed on an ASIS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
limitations under the License. See accompanying LICENSE file.
-->
<!-- Site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> fs.defaultFS </name>
<Value> hdfs:/master:9000 </value>
</Property>
</Configuration>We can see a property which specifies that the hdfs namenode process is hosted by the node which hostname is master, and is running on port 9000.
- hdfs-site.xml Copy and paste the following
<? xml version = "1.0" encoding = "UTF-8"?>
<? xml-stylesheet type = "text/xsl" href = "configuration.xsl"?>
<! -
Licensed under the Apache License, Version 2.0 (the "License");
You may not use this file in compliance with the License.
You can get a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law
distributed under the license is distributed on an ASIS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
limitations under the License. See accompanying LICENSE file.
->
<!-- Site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> dfs.replication </name>
<Value> 3 </value>
</Property>
<!-- namenode storage dir property -->
<Property>
<Name> dfs.namenode.name.dir </name>
<Value> /home/hadoop/data/hdfs/NameNode </value>
</Property>
<!-- datanodes storage dir property -->
<Property>
<Name> dfs.datanode.data.dir </name>
<Value> /home/hadoop/data/hdfs/datanode </value>
</Property>
</Configuration>In this file, we specify the replication factor for hdfs, here it is set to 3 , we also specify where to store the hdfs data on each node. Here in the /home/hadoop/data/hdfs/namenode folder for the namenode, and in the /home/hadoop/data/hdfs/datanode folder for the datanodes.
- mapred-site.xml Copy and paste the following
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
</configuration>In this file, the amount of RAM to allocate to the mapreduce applications in the node is specified. We maintain the same configurations for each node, the exact values you should use depend on the specs of the computer and its workload besides Hadoop. The above configuration is one adopted on an existing cluster with 4GB ram datanodes
- yarn-site.xml Copy and paste the following
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>For more information on how to choose the memory configuration values when editing the mapred-site.xml and yarn-site.xml files, I suggest to read this post.
- workers Probably the simplest configuration file, we just have to list the hostnames of the nodes hosting the datanodes in our cluster (one hostname per line).
- hadoop-env.sh We only have to set the JAVA_HOME environment variable at the end of the file.
8. La last step is to update the hadoop user's bash profile, that is to edit the /home/hadoop/.bashrc file. We have to:
- Set and export the JAVA_HOME environment variable
- Set the PDSH_RCMD_TYPE environment variable to ssh
- Set the HADOOP_HOME environment variable
- Add the $HADOOP_HOME/bin and $HADOOP_HOME/sbin to the PATH environment variable, then export it.
9. Finished! To check if all went right you can launch yarn and hdfs by typing the following command lines on your master node.
hdfs namenode -format
start-dfs.sh
start-yarn.sh