by Alex McLintock and Alan Duval of Alephant.co.uk, Sept 2017
This documentation has been put together to detail the process that one might undertake to install CDH5.12 on CentOS7.3. As such, we follow a particular path through Cloudera's latest install documentation, noting throughout which documents we are referencing, and which decisions we have taken.
We assume that you have already read the introduction - Installing Cloudera Hadoop on CentOS/RedHat - and Preparing to install Cloudera CDH, or have done this yourself from the Cloudera documentation.
Once you have started to install CDH you need some way of testing that it is working.... So please also read Smoke-Testing Cloudera CDH install
Note that you should ideally be looking at the latest version of the Cloudera docs.
- Configuration Requirements for Cloudera Manager, Cloudera Navigator, and CDH 5
- Installing Cloudera Manager and CDH (This talks about the installation paths A, B and C.)
Note - If you encounter any issues not mentioned in this document, the search box on Cloudera tends to default to the version that you have most recently selected. If you somehow end up on a URL with enterprise/5-8-x/topics or similar (this is highly likely if you use Google rather than Cloudera's search box), you can replace the version number in the URL (e.g. 5-8-x or 5-4-x) with the word latest, and the URL will be correct. We have yet to find a page that this is not true for. You cannot, however, replace these with 5-12-x. We think this is unhelpful, as once the next version comes out, any URLs pointing to the documentation will potentially be incorrect (seems like it would be better to have URLs for both 5-12-x AND latest).
Chosen Installation path
This requires us to set up
- the machines and networking first,
- OPTIONALLY setup a local repository of the externally stored software,
- pre-requisites for Cloudera Manager,
- Cloudera Manager itself
- everything else (in CDH) - using Cloudera Manager
Follow Install Path B (we recommend)
We are following Cloudera's Install Path B documentation.
In this case we are installing all the required software via Cloudera Manager.
In essence, we install Cloudera Manager Server on one box (assuming we are not opting for a High Availability setup). We then install the Cloudera Manager Agents on all boxes in the cluster (plus any edge nodes which talk to the cluster). We are NOT installing via tarballs as that was deprecated a few versions ago, though it remains in the documentation.
Single User Mode?
!- Consider: are we installing Cloudera as a single user (e.g. 'cdh') and NOT root?
If so then we need to go about Configuring Single User Mode.
Note - only do this if you have to for security reasons. For us installing and running as the normal user is adequate.
Install and Configure External Databases (probably necessary)
Oozie, Hive Metastore, and Cloudera manager all need databases to store their info.
The Cloudera documentation is not, we think, 100% clear on this... nor is it idiot proof. In some places it reads like you need to have Cloudera Manager installed before setting up the external databases, whereas other pages say you need the databases first. This may be because, if you are setting up a demonstration or proof-of-concept system you CAN use the Postgres database that is embedded in Cloudera Manager.
PostgreSQL is perfectly adequate for many production needs so I (Alex) was originally expecting this to be fine. However, other Cloudera documentation says that the PostgreSQL database is not adequate for production. In fact if you use Cloudera Manager without configuring an external database you may find that it does NOT connect to the internal embedded one.
Note - For some reason Cloudera recommend that the databases run on the same boxes as the services which need the databases. This presumably cuts down on network traffic and reduces latency, but does not sound like an easy system to maintain.
In Preparing a Cloudera Manager Server External Database the available choices are MariaDB, MySQL, Oracle, and (an external) Postgres database.
!- Our solution is to create one (or more) MySQL databases to act as data stores for lots of tools. This is one of those cases where we had to follow the docs, conveniently titled .
Our documentation broadly follows Cloudera's Configuring and Starting the MySQL Server
As user root on your Cloudera Manager machine
# make sure we can fetch files with wget yum install wget # fetch meta information about the mysql community code wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm # install the file you just fetched rpm -ivh mysql-community-release-el7-5.noarch.rpm yum update # Install the database server yum install mysql-server systemctl start mysqld
Reconfigure MySQL for Cloudera's purposes
Stop MySQL, backup the existing configuration, and write in Cloudera's preferred configuration.
systemctl stop mysqld mv /var/lib/mysql/ib_logfile* /backup_location vi /etc/my.cnf
Here is a version of my.cnf supplied by Cloudera....
[mysqld] transaction-isolation = READ-COMMITTED # Disabling symbolic-links is recommended to prevent assorted security risks; # to do so, uncomment this line: # symbolic-links = 0 key_buffer_size = 32M max_allowed_packet = 32M thread_stack = 256K thread_cache_size = 64 query_cache_limit = 8M query_cache_size = 64M query_cache_type = 1 max_connections = 550 #expire_logs_days = 10 #max_binlog_size = 100M #log_bin should be on a disk with enough free space. Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your system #and chown the specified folder to the mysql user. log_bin=/var/lib/mysql/mysql_binary_log # For MySQL version 5.1.8 or later. For older versions, reference MySQL documentation for configuration help. binlog_format = mixed read_buffer_size = 2M read_rnd_buffer_size = 16M sort_buffer_size = 8M join_buffer_size = 8M # InnoDB settings innodb_file_per_table = 1 innodb_flush_log_at_trx_commit = 2 innodb_log_buffer_size = 64M innodb_buffer_pool_size = 4G innodb_thread_concurrency = 8 innodb_flush_method = O_DIRECT innodb_log_file_size = 512M [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid sql_mode=STRICT_ALL_TABLES
Note - Make sure MySQL starts when the machine boots, and then follow instructions below to secure the installation (set root password, remove test databases and users, but DO allow root access from other hosts)
systemctl start mysqld # or alternatively service mysqld start # now tidy up the initial install sudo /usr/bin/mysql_secure_installation # but do not disallow root login remotely
Note - From now on you will need to remember the root password for the MySQL install - which is of course not necessarily the same as the root password for your box.
Creating databases, users and passwords in mySQL
The following pairs of commands creates users and passwords for key databases that CDH requires: Activity Monitor (amon), Reports Manager (rman), Hive Metastore Server (metastore), Sentry Server (sentry), Cloudera Navigator Audit Server (nav), and Cloudera Navigator Metadata Server (navms).
Note - You can copy and paste the whole lot, and hit enter.
create database amon DEFAULT CHARACTER SET utf8; grant all on amon.* TO 'amon'@'%' IDENTIFIED BY 'amon_password'; create database rman DEFAULT CHARACTER SET utf8; grant all on rman.* TO 'rman'@'%' IDENTIFIED BY 'rman_password'; create database metastore DEFAULT CHARACTER SET utf8; grant all on metastore.* TO 'hive'@'%' IDENTIFIED BY 'hive_password'; create database sentry DEFAULT CHARACTER SET utf8; grant all on sentry.* TO 'sentry'@'%' IDENTIFIED BY 'sentry_password'; create database nav DEFAULT CHARACTER SET utf8; grant all on nav.* TO 'nav'@'%' IDENTIFIED BY 'nav_password'; create database navms DEFAULT CHARACTER SET utf8; grant all on navms.* TO 'navms'@'%' IDENTIFIED BY 'navms_password';
Note - Cloudera documentation only lists the database requirements for the above services, however the same process is required for oozie and hue. We chose to follow exactly the same process, right down to <servicename>_password, and at the same time, as there doesn't appear to be any particular reason to do this later. You may wish to generate more secure passwords for each of the services now to save having to change these later, we were only installing a test cluster so weren't too concerned about this.
create database oozie DEFAULT CHARACTER SET utf8; grant all on oozie.* TO 'oozie'@'%' IDENTIFIED BY 'oozie_password'; create database hue DEFAULT CHARACTER SET utf8; grant all on hue.* TO 'hue'@'%' IDENTIFIED BY 'hue_password';
Now Install Cloudera Manager Server!
Everything to this point has been about setting up the OS, software, and settings that Cloudera Manager relies on. Now we can Install Cloudera Manager Server Software! ...after we've installed Java.
Install Java on the Cloudera Manager host
For us we picked our 01 machine as the Cloudera Manager host.
!- The Cloudera documentation suggests that you have two choices here - either install Oracle Java via Cloudera's own repository, or install it yourself.
?- The big problem here is that we can no longer find Oracle Java on Cloudera's repository even though the documentation says to install it from there! Can any of our readers point to where this repository now is?
# This will fail!!! yum install oracle-j2sdk1.7
We mostly followed Cloudera's page on Installing the Oracle JDK to manually install Oracle Java JDK 8 on every machine in the cluster. You might choose a different Java, but that is your problem.
!- In theory Cloudera Manager could install Java itself - but personally I (Alex) would not choose that option.
?- You can try it if you want - let us know how you get on.
We found this third-party walk-through on How To Install Oracle Java On Fedora And CentOS useful.
Find the latest Java SE Development Kit 8 Downloads and download using a web-browser, and ensure the file is in a relevant directory (put it in /tmp for example).
You may need to do this on some other machine and transfer the file to all the other cluster machines when finished.
Once transferred to all the machines, use rpm to install it on each machine:
rpm -Uvh /path/jdk-version-linux-architecture.rpm
Now you can
Install Cloudera Manager from the Cloudera Manager repository
This needs to be done on the Cloudera Manager machine and any other controller machines, e.g. 01, 02, but not the worker nodes (e.g. fleet[03-05]
curl https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-manager.repo > /etc/yum.repos.d/cloudera-manager.repo yum update yum install cloudera-manager-daemons cloudera-manager-server
Now you can fill in the empty database created earlier for Cloudera Manager with some tables... Note - if you see a "file not found" when doing this then you may have copy and pasted some funny character (e.g. a minus instead of a dash, which look the same), depending upon where you cut and paste it from (we had this problem when copying from Cloudera's page. Check the URL in a web browser and maybe cut and paste it from there.
Run the scm_prepare_database.sh script to actually fill in important tables in the database Cloudera Manager is trying to use:
/usr/share/cmf/schema/scm_prepare_database.sh database-type [options] database-name username password
Note - You can also run scm_prepare_database.sh without options to see the syntax:
/usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -u root -p scmdb scmuser scmpass
Enter the mysql root password when requested. The '-p' says that you want to enter a password, but does not enter it in the command line.)
Get and install mySQL connector for Java
All Java programs need drivers to get them to talk to specific databases. The MySQL connector for Java is one of these. It is not supplied directly with Java or MySQL, but you can download it from the MySQL Organisation (now owned by Oracle).
On every box in the cluster:
You need to download the MySQL connector for Java (mysql-connector-java.jar) and put it there. You can download it from MySQL but it comes inside a tar.gz or zip file.
(Actual commands for unpacking an archive are left to the reader.)
Note - There may be other ways of getting this file which work just as well. Be aware that the file in the archive contains the version number. We copied the versioned file something like this:
cd /usr/share/java cp mysql-connector-java-5.1.43-bin.jar mysql-connector-java.jar
Cloudera will copy this file several times to where it is needed.
Install Cloudera Manager agents (optional here)
If you like installing things manually you can (Optional) Manually Install the Oracle JDK, Cloudera Manager Agent, and CDH and Managed Service Packages, and do something similar on all hosts in the cluster.
Note - "You can install the Cloudera Manager agent manually on all hosts, or Cloudera Manager can install the Agents in a later step. To use Cloudera Manager to install the agents, skip this section."
# Optional - you can get Cloudera manager to do this later if you like... curl https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-manager.repo > /etc/yum.repos.d/cloudera-manager.repo yum update yum install cloudera-manager-agent cloudera-manager-daemons
Note - We are doing this on the Cloudera Manager machine as well if it is running any CDH services at all. There should not be a problem with using yum to install something more than once. Obviously, we don't need to download the same repo file multiple times on the same machine so you can skip that if you know what you are doing.
Note - if you get a "file not found" when doing this, then probably you have cut and pasted some funny character. Check the URL in a web browser and maybe cut and paste it from there.
If you have installed them yourself you now need to tell each of the agents where to find the Cloudera Manager Server which controls them.
vi /etc/cloudera-scm-agent/config.ini # change server_host=localhost # to server_host=fleet01
Alternatively trying to do this on one line, as below, where fleet01 is replaced by whatever machine you are installing Cloudera Manager onto.
sed -i.bak '/server_host=localhost/c\
We will assume that the port is still the default port and does not need changing.
And, finally, on all boxes where you just installed Cloudera Manager agents, you need to start those agents with this command
service cloudera-scm-agent start
Note - Troubleshooting Installation and Upgrade Problems may be useful for any issues you encounter here.
Start Cloudera Manager Server
"When you Start the Cloudera Manager Server and Agents, Cloudera Manager assumes you are not already running HDFS and MapReduce." If you are, then stop them.
Frankly, I (Alex) don't understand why this is a problem for Cloudera Manager. Ambari does not have this problem.
Run this command on the Cloudera Manager Server host (fleet01 for us):
service cloudera-scm-server start
And wait several minutes (tea/coffee optional)...
Start and Log into the Cloudera Manager Admin Console
With Cloudera Manager Server started, you can now Start and Log into the Cloudera Manager Admin Console.
Note - The Cloudera Manager Server URL takes the following form http://Server host:port, where Server host is the fully qualified domain name (FQDN) or IP address of the host where the Cloudera Manager Server is installed, and port is the port configured for the Cloudera Manager Server. The default port is 7180.
In a web browser, enter http://Server host:7180, where Server host is the FQDN or IP address of the host where the Cloudera Manager Server is running.
Cloudera License Options
Once you have accessed the Admin Console, you need to accept the Cloudera Manager End User License Terms and Conditions, and choose the license you will be operating under.
This is a bit frustrating to Alex because he thought we had done away with this sort of thing (licensing vs. open source).
Name of License
|Cloudera Express||Free/No license required||Limited in features. Can be upgraded to full license later on|
|Cloudera Enterprise Enterprise Data Hub Edition Trial||60 days trial only. NO renewal||Only useful for temporary clusters, but can be upgraded to full license later on|
|Cloudera Enterprise - Basic Edition||basic license - typically annual subscription||The simplest license which gets you Cloudera Support|
|Cloudera Enterprise - Flex Edition||Possibly replaced by three other licenses now||One of three options which gives you more support in specific fields|
|Cloudera Enterprise Enterprise Data Hub Edition||complete license - typically annual subscription||Support for everything available in CDH.|
See Managing Licenses for more info.
!- For us the best version is "Cloudera Enterprise Enterprise Data Hub Edition Trial," but you will select the appropriate license for your requirements.
Tell Cloudera Manager which hosts are in the cluster
Choose Cloudera Manager Hosts by telling Cloudera Manager which hosts it looks after in the cluster - i.e. all the ones you installed an agent on.
For us it was fleet[01-05].mycompany.com
To some extent Cloudera Manager can guess which machines each individual service (or role) should be run on. You can read about Cluster Hosts and Role Assignments and the five suggested layouts for clusters, depending on how big your hardware collection is...
Once you have started to install CDH you need some way of testing that it is working.... So please read Smoke-Testing Cloudera CDH install
?- If you would like us to include an article in this series on Host Allocations, let us know in the comments.