Doing the Install: Cloudera CDH5.12 on CentOS7.3

by Alex McLintock and Alan Duval of Alephant.co.uk, Sept 2017

 

This documentation has been put together to detail the process that one might undertake to install CDH5.12 on CentOS7.3. As such, we follow a particular path through Cloudera's latest install documentation, noting throughout which documents we are referencing, and which decisions we have taken.

We assume that you have already read the introduction - Installing Cloudera Hadoop on CentOS/RedHat - and Preparing to install Cloudera CDH, or have done this yourself from the Cloudera documentation.

Once you have started to install CDH you need some way of testing that it is working.... So please also read Smoke-Testing Cloudera CDH install

 

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-1-Intro

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-2-Prep

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-4-Testing 

 

Overview

Note that you should ideally be looking at the latest version of the Cloudera docs. 

Note - If you encounter any issues not mentioned in this document, the search box on Cloudera tends to default to the version that you have most recently selected. If you somehow end up on a URL with enterprise/5-8-x/topics or similar (this is highly likely if you use Google rather than Cloudera's search box), you can replace the version number in the URL (e.g. 5-8-x or 5-4-x) with the word latest, and the URL will be correct. We have yet to find a page that this is not true for. You cannot, however, replace these with 5-12-x. We think this is unhelpful, as once the next version comes out, any URLs pointing to the documentation will potentially be incorrect (seems like it would be better to have URLs for both 5-12-x AND latest). 

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/installation_reqts.html

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/installation_installation.html

 

Chosen Installation path

We are conducting Installation Path B - Installation Using Cloudera Manager Parcels or Packages.

This requires us to set up

  1. the machines and networking first,
  2. OPTIONALLY setup a local repository of the externally stored software,
  3. pre-requisites for Cloudera Manager,
  4. Cloudera Manager itself
  5. everything else (in CDH) - using Cloudera Manager

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html

 

Follow Install Path B (we recommend)

We are following Cloudera's Install Path B documentation.

In this case we are installing all the required software via Cloudera Manager.

In essence, we install Cloudera Manager Server on one box (assuming we are not opting for a High Availability setup). We then install the Cloudera Manager Agents on all boxes in the cluster (plus any edge nodes which talk to the cluster). We are NOT installing via tarballs as that was deprecated a few versions ago, though it remains in the documentation.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html

 

Single User Mode?

!- Consider: are we installing Cloudera as a single user (e.g. 'cdh') and NOT root? 

If so then we need to go about Configuring Single User Mode.

Note - only do this if you have to for security reasons. For us installing and running as the normal user is adequate.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/install_singleuser_reqts.html#xd_583c10bfdbd326ba--69adf108-1492ec0ce48--7ade

 

Install and Configure External Databases (probably necessary)

Oozie, Hive Metastore, and Cloudera manager all need databases to store their info.

The Cloudera documentation is not, we think, 100% clear on this... nor is it idiot proof. In some places it reads like you need to have Cloudera Manager installed before setting up the external databases, whereas other pages say you need the databases first. This may be because, if you are setting up a demonstration or proof-of-concept system you CAN use the Postgres database that is embedded in Cloudera Manager.

PostgreSQL is perfectly adequate for many production needs so I (Alex) was originally expecting this to be fine. However, other Cloudera documentation says that the PostgreSQL database is not adequate for production. In fact if you use Cloudera Manager without configuring an external database you may find that it does NOT connect to the internal embedded one.

 

Note - For some reason Cloudera recommend that the databases run on the same boxes as the services which need the databases. This presumably cuts down on network traffic and reduces latency, but does not sound like an easy system to maintain.

 

External Database

In Preparing a Cloudera Manager Server External Database the available choices are MariaDB, MySQL, Oracle, and (an external) Postgres database.

!- Our solution is to create one (or more) MySQL databases to act as data stores for lots of tools. This is one of those cases where we had to follow the docs, conveniently titled .

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_installing_configuring_dbs.html#concept_i2r_m3m_hn

 

MySQL

Our documentation broadly follows Cloudera's Configuring and Starting the MySQL Server

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_mysql.html#cmig_topic_5_5_2

 

As user root on your Cloudera Manager machine 
 

# make sure we can fetch files with wget
yum install wget
# fetch meta information about the mysql community code
wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm
# install the file you just fetched
rpm -ivh mysql-community-release-el7-5.noarch.rpm
yum update
# Install the database server
yum install mysql-server
systemctl start mysqld

 

Reconfigure MySQL for Cloudera's purposes

Stop MySQL, backup the existing configuration, and write in Cloudera's preferred configuration.

systemctl stop mysqld

mv /var/lib/mysql/ib_logfile* /backup_location

vi /etc/my.cnf

 

Here is a version of my.cnf supplied by Cloudera....

[mysqld]
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
# symbolic-links = 0
key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1
max_connections = 550
#expire_logs_days = 10
#max_binlog_size = 100M
#log_bin should be on a disk with enough free space. Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your system
#and chown the specified folder to the mysql user.
log_bin=/var/lib/mysql/mysql_binary_log
# For MySQL version 5.1.8 or later. For older versions, reference MySQL documentation for configuration help.
binlog_format = mixed
read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M
# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
sql_mode=STRICT_ALL_TABLES

 

Note - Make sure MySQL starts when the machine boots, and then follow instructions below to secure the installation (set root password, remove test databases and users, but DO allow root access from other hosts)

 

systemctl start mysqld
# or alternatively
service mysqld start
# now tidy up the initial install
sudo /usr/bin/mysql_secure_installation
# but do not disallow root login remotely

 

Note - From now on you will need to remember the root password for the MySQL install - which is of course not necessarily the same as the root password for your box.

 

Creating databases, users and passwords in mySQL

The following pairs of commands creates users and passwords for key databases that CDH requires: Activity Monitor (amon), Reports Manager (rman), Hive Metastore Server (metastore), Sentry Server (sentry), Cloudera Navigator Audit Server (nav), and Cloudera Navigator Metadata Server (navms).

Note - You can copy and paste the whole lot, and hit enter.

 

create database amon DEFAULT CHARACTER SET utf8;
grant all on amon.* TO 'amon'@'%' IDENTIFIED BY 'amon_password';
create database rman DEFAULT CHARACTER SET utf8;
grant all on rman.* TO 'rman'@'%' IDENTIFIED BY 'rman_password';
create database metastore DEFAULT CHARACTER SET utf8;
grant all on metastore.* TO 'hive'@'%' IDENTIFIED BY 'hive_password';
create database sentry DEFAULT CHARACTER SET utf8;
grant all on sentry.* TO 'sentry'@'%' IDENTIFIED BY 'sentry_password';
create database nav DEFAULT CHARACTER SET utf8;
grant all on nav.* TO 'nav'@'%' IDENTIFIED BY 'nav_password';
create database navms DEFAULT CHARACTER SET utf8;
grant all on navms.* TO 'navms'@'%' IDENTIFIED BY 'navms_password';

 

Note - Cloudera documentation only lists the database requirements for the above services, however the same process is required for oozie and hue. We chose to follow exactly the same process, right down to <servicename>_password, and at the same time, as there doesn't appear to be any particular reason to do this later. You may wish to generate more secure passwords for each of the services now to save having to change these later, we were only installing a test cluster so weren't too concerned about this.

 

create database oozie DEFAULT CHARACTER SET utf8;
grant all on oozie.* TO 'oozie'@'%' IDENTIFIED BY 'oozie_password'; 
create database hue DEFAULT CHARACTER SET utf8; 
grant all on hue.* TO 'hue'@'%' IDENTIFIED BY 'hue_password';

 

Now Install Cloudera Manager Server!

Everything to this point has been about setting up the OS, software, and settings that Cloudera Manager relies on. Now we can Install Cloudera Manager Server Software! ...after we've installed Java.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html#concept_qyv_bt1_v5

 

Install Java on the Cloudera Manager host

For us we picked our 01 machine as the Cloudera Manager host.

!- The Cloudera documentation suggests that you have two choices here - either install Oracle Java via Cloudera's own repository, or install it yourself.

?- The big problem here is that we can no longer find Oracle Java on Cloudera's repository even though the documentation says to install it from there! Can any of our readers point to where this repository now is?  

# This will fail!!!
yum install oracle-j2sdk1.7

We mostly followed Cloudera's page on Installing the Oracle JDK to manually install Oracle Java JDK 8 on every machine in the cluster.  You might choose a different Java, but that is your problem.

!- In theory Cloudera Manager could install Java itself - but personally I (Alex) would not choose that option.

?- You can try it if you want - let us know how you get on.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_jdk_installation.html#topic_29_1

 

Manual Install

We found this third-party walk-through on How To Install Oracle Java On Fedora And CentOS useful.

Find the latest Java SE Development Kit 8 Downloads and download using a web-browser, and ensure the file is in a relevant directory (put it in /tmp for example).

 

URL: http://www.linuxandubuntu.com/home/how-to-install-oracle-java-78-on-fedora-and-centos

URL: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

 

You may need to do this on some other machine and transfer the file to all the other cluster machines when finished.

 

Once transferred to all the machines, use rpm to install it on each machine:

rpm -Uvh /path/jdk-version-linux-architecture.rpm

 

Now you can

...Install Cloudera Manager Server Software!

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html#concept_qyv_bt1_v5

 

Install Cloudera Manager from the Cloudera Manager repository

This needs to be done on the Cloudera Manager machine and any other controller machines, e.g. 01, 02, but not the worker nodes (e.g. fleet[03-05]

curl https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-manager.repo  > /etc/yum.repos.d/cloudera-manager.repo
yum update
yum install cloudera-manager-daemons cloudera-manager-server

 

Now you can fill in the empty database created earlier for Cloudera Manager with some tables... Note - if you see a "file not found" when doing this then you may have copy and pasted some funny character (e.g. a minus instead of a dash, which look the same), depending upon where you cut and paste it from (we had this problem when copying from Cloudera's page. Check the URL in a web browser and maybe cut and paste it from there.

Run the scm_prepare_database.sh script to actually fill in important tables in the database Cloudera Manager is trying to use:

/usr/share/cmf/schema/scm_prepare_database.sh database-type [options] database-name username password

 

Note - You can also run scm_prepare_database.sh without options to see the syntax:

/usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -u root -p  scmdb scmuser scmpass

 

Enter the mysql root password when requested. The '-p' says that you want to enter a password, but does not enter it in the command line.)

 

Get and install mySQL connector for Java

All Java programs need drivers to get them to talk to specific databases. The MySQL connector for Java is one of these. It is not supplied directly with Java or MySQL, but you can download it from the MySQL Organisation (now owned by Oracle). 

On every box in the cluster:

mkdir /usr/share/java

You need to download the MySQL connector for Java (mysql-connector-java.jar) and put it there. You can download it from MySQL but it comes inside a tar.gz or zip file.

(Actual commands for unpacking an archive are left to the reader.)

 

Note - There may be other ways of getting this file which work just as well. Be aware that the file in the archive contains the version number. We copied the versioned file something like this:

 

cd /usr/share/java
cp mysql-connector-java-5.1.43-bin.jar mysql-connector-java.jar

 

Cloudera will copy this file several times to where it is needed.

 

Install Cloudera Manager agents (optional here)

If you like installing things manually you can (Optional) Manually Install the Oracle JDK, Cloudera Manager Agent, and CDH and Managed Service Packages, and do something similar on all hosts in the cluster.

 

Note - "You can install the Cloudera Manager agent manually on all hosts, or Cloudera Manager can install the Agents in a later step. To use Cloudera Manager to install the agents, skip this section."

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html#cmig_topic_6_6_3

 

# Optional - you can get Cloudera manager to do this later if you like...
curl https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-manager.repo  > /etc/yum.repos.d/cloudera-manager.repo
yum update 
yum install cloudera-manager-agent cloudera-manager-daemons


Note - We are doing this on the Cloudera Manager machine as well if it is running any CDH services at all. There should not be a problem with using yum to install something more than once. Obviously, we don't need to download the same repo file multiple times on the same machine so you can skip that if you know what you are doing. 

Note - if you get a "file not found" when doing this, then probably you have cut and pasted some funny character. Check the URL in a web browser and maybe cut and paste it from there.

If you have installed them yourself you now need to tell each of the agents where to find the Cloudera Manager Server which controls them. 

 

vi  /etc/cloudera-scm-agent/config.ini
# change 
server_host=localhost
# to 
server_host=fleet01

 

Alternatively trying to do this on one line, as below, where fleet01 is replaced by whatever machine you are installing Cloudera Manager onto.

sed -i.bak '/server_host=localhost/c\server_host=fleet01' /etc/cloudera-scm-agent/config.ini

We will assume that the port is still the default port and does not need changing.

 

And, finally, on all boxes where you just installed Cloudera Manager agents, you need to start those agents with this command 

service cloudera-scm-agent start

 

Note - Troubleshooting Installation and Upgrade Problems may be useful for any issues you encounter here.

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_troubleshooting.html

 

Start Cloudera Manager Server

"When you Start the Cloudera Manager Server and Agents, Cloudera Manager assumes you are not already running HDFS and MapReduce." If you are, then stop them.

Frankly, I (Alex) don't understand why this is a problem for Cloudera Manager. Ambari does not have this problem.

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html#id_znh_q4m_25

 

Note - If required, there is usueful documentation on Stopping CDH Services Using the Command Line and Configuring init to Start Hadoop System Services.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_services_stop.html#topic_27_3

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_init_configure.html#topic_27_2

 

Run this command on the Cloudera Manager Server host (fleet01 for us):

service cloudera-scm-server start

 

And wait several minutes (tea/coffee optional)...

 

Start and Log into the Cloudera Manager Admin Console

With Cloudera Manager Server started, you can now Start and Log into the Cloudera Manager Admin Console.

Note - The Cloudera Manager Server URL takes the following form http://Server host:port, where Server host is the fully qualified domain name (FQDN) or IP address of the host where the Cloudera Manager Server is installed, and port is the port configured for the Cloudera Manager Server. The default port is 7180.
In a web browser, enter http://Server host:7180, where Server host is the FQDN or IP address of the host where the Cloudera Manager Server is running.

Username: admin

Password: admin

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html#id_znh_q4m_25

 

Cloudera License Options

Once you have accessed the Admin Console, you need to accept the Cloudera Manager End User License Terms and Conditions, and choose the license you will be operating under.

This is a bit frustrating to Alex because he thought we had done away with this sort of thing (licensing vs. open source).

 

Name of License

Type

Note

Cloudera Express Free/No license required Limited in features. Can be upgraded to full license later on
Cloudera Enterprise Enterprise Data Hub Edition Trial 60 days trial only. NO renewal Only useful for temporary clusters, but can be upgraded to full license later on
Cloudera Enterprise - Basic Edition basic license - typically annual subscription The simplest license which gets you Cloudera Support
Cloudera Enterprise - Flex Edition Possibly replaced by three other licenses now One of three options which gives you more support in specific fields
Cloudera Enterprise Enterprise Data Hub Edition  complete license - typically annual subscription Support for everything available in CDH.


See Managing Licenses for more info.

!- For us the best version is "Cloudera Enterprise Enterprise Data Hub Edition Trial," but you will select the appropriate license for your requirements.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ag_licenses.html#cmug_topic_13_7 

 

Tell Cloudera Manager which hosts are in the cluster

Choose Cloudera Manager Hosts by telling Cloudera Manager which hosts it looks after in the cluster - i.e. all the ones you installed an agent on.

For us it was fleet[01-05].mycompany.com

 

URL:https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html#id_rtd_dtm_25

 

Next

To some extent Cloudera Manager can guess which machines each individual service (or role) should be run on. You can read about Cluster Hosts and Role Assignments and the five suggested layouts for clusters, depending on how big your hardware collection is... 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_host_allocations.html

 

Once you have started to install CDH you need some way of testing that it is working.... So please read Smoke-Testing Cloudera CDH install

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-4-Testing

 

?- If you would like us to include an article in this series on Host Allocations, let us know in the comments.

Tags