Google Cloud Images

We have pre-built Cloudbreak Deployer cloud image for Google Cloud Platform (GCP). You can launch the latest Cloudbreak Deployer image at the Google Developers Console.

Alternatively, instead of using the pre-built cloud images for GCP, you can install Cloudbreak Deployer on your own VM. See installation page for more information.

Please make sure you added the following ports to your firewall rules:

Cloudbreak Deployer GCP Image Details

Import Cloudbreak Deployer Image

You can import the latest Cloudbreak Deployer image on the Google Developers Console with the help of the Google Cloud Shell.

Just click on the Activate Google Cloud Shell icon in the top right corner of the page:

Full size here.

Images are global resources, so you can use these across zones and projects. Full size here.

You can create your own Cloudbreak Deployer (cbd) instance from the imported image on the Google Developers Console.

Minimum and Recommended VM requirements: 4GB RAM, 10GB disk, 2 core

Google Setup

Cloudbreak Deployer Highlights

Setup Cloudbreak Deployer

You should already have the Cloudbreak Deployer either by using the GCP Cloud Images or by installing the Cloudbreak Deployer manually on your own VM. (The minimum instance type which is fit for cloudbreak is n1-standard-2)

If you have your own installed VM, you should check the Initialize your Profile section here before starting the provisioning.

You have several opportunities to connect to the previously created cbd VM.

Open the cloudbreak-deployment directory:

cd /var/lib/cloudbreak-deployment

This is the directory of the configuration files and the supporting binaries for Cloudbreak Deployer.

Initialize your Profile

First initialize cbd by creating a Profile file:

cbd init

It will create a Profile file in the current directory. Please open the Profile file then check the PUBLIC_IP. This is mandatory, because of to can access the Cloudbreak UI (called Uluwatu). In some cases the cbd tool tries to guess it. If cbd cannot get the IP address during the initialization, please set the appropriate value.

Start Cloudbreak Deployer

To start the Cloudbreak application use the following command. This will start all the Docker containers and initialize the application.

cbd start

At the very first time it will take for a while, because of need to download all the necessary docker images.

The cbd start command includes the cbd generate command which applies the following steps:

Validate the started Cloudbreak Deployer

After the cbd start command finishes followings are worthy to check:

   cbd doctor

In case of cbd update is needed, please check the related documentation for Cloudbreak Deployer Update. Most of the cbd commands require root permissions.

   cbd logs cloudbreak

Cloudbreak should start within a minute - you should see a line like this: `Started CloudbreakApplication in 36.823 seconds

Provisioning Prerequisites

Creating a Google Cloud Service Account

Follow the instructions in Google Cloud's documentation to create a Service account and Generate a new P12 key.

Make sure that at API level (APIs and auth menu) you have enabled:

If you have enabled every API then you have to wait about 10 minutes for the provider.

When creating GCP credentials in Cloudbreak you will have to provide the email address of your Service Account (from the Service accounts page of your Google Cloud Platform Permissions) and the Project ID (from the Dashboard of your Google Cloud Platform Home) where the service account is created. You'll also have to upload the generated P12 file and provide an OpenSSH formatted public key that will be used as an SSH key.

Generate a new SSH key

All the instances created by Cloudbreak are configured to allow key-based SSH, so you'll need to provide an SSH public key that can be used later to SSH onto the instances in the clusters you'll create with Cloudbreak. You can use one of your existing keys or you can generate a new one.

To generate a new SSH keypair:

ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
# Creates a new ssh key, using the provided email as a label
# Generating public/private rsa key pair.
# Enter file in which to save the key (/Users/you/.ssh/id_rsa): [Press enter]
You'll be asked to enter a passphrase, but you can leave it empty.

# Enter passphrase (empty for no passphrase): [Type a passphrase]
# Enter same passphrase again: [Type passphrase again]

After you enter a passphrase the keypair is generated. The output should look something like below.

# Your identification has been saved in /Users/you/.ssh/id_rsa.
# Your public key has been saved in /Users/you/.ssh/id_rsa.pub.
# The key fingerprint is:
# 01:0f:f4:3b:ca:85:sd:17:sd:7d:sd:68:9d:sd:a2:sd your_email@example.com

Later you'll need to pass the .pub file's contents to Cloudbreak and use the private part to SSH to the instances

Provisioning via Browser

You can log into the Cloudbreak application at http://<PUBLIC_IP>:3000.

The main goal of the Cloudbreak UI is to easily create clusters on your own cloud provider account. This description details the GCP setup - if you'd like to use a different cloud provider check out its manual.

This document explains the four steps that need to be followed to create Cloudbreak clusters from the UI:

IMPORTANT Make sure that you have sufficient qouta (CPU, network, etc) for the requested cluster size.

Setting up GCP credentials

Cloudbreak works by connecting your GCP account through so called Credentials, and then uses these credentials to create resources on your behalf. The credentials can be configured on the manage credentials panel on the Cloudbreak Dashboard.

To create a new GCP credential follow these steps:

  1. Fill out the new credential Name
    • Only alphanumeric and lowercase characters (min 5, max 100 characters) can be applied
  2. Copy your GCP project ID to the Project Id field
  3. Copy your GCP Service Account email address to the Service Account Email Address field
  4. Upload your GCP Service Account private key (generated p12 Key) to the Service Account Private (p12) Key field
  5. Copy your SSH public key to the SSH public key field
    • The SSH public key must be in OpenSSH format and it's private keypair can be used later to SSH onto every instance of every cluster you'll create with this credential.
    • The SSH username for the GCP instances is cloudbreak.

Any other parameter is optional here.

Public in account means that all the users belonging to your account will be able to use this credential to create clusters, but cannot delete it.

Full size here.

Infrastructure templates

After your GCP account is linked to Cloudbreak you can start creating resource templates that describe your clusters' infrastructure:

When you create one of the above resource, Cloudbreak does not make any requests to GCP. Resources are only created on GCP after the create cluster button has pushed. These templates are saved to Cloudbreak's database and can be reused with multiple clusters to describe the infrastructure.

Templates

Templates describe the instances of your cluster - the instance type and the attached volumes. A typical setup is to combine multiple templates in a cluster for the different types of nodes. For example you may want to attach multiple large disks to the datanodes or have memory optimized instances for Spark nodes.

The instance templates can be configured on the manage templates panel on the Cloudbreak Dashboard.

If Public in account is checked all the users belonging to your account will be able to use this resource to create clusters, but cannot delete it

Networks

Your clusters can be created in their own networks or in one of your already existing one. If you choose an existing network, it is possible to create a new subnet within the network. The subnet's IP range must be defined in the Subnet (CIDR) field using the general CIDR notation. You can read more about GCP Networks and Subnet networks.

Default GCP Network

If you don't want to create or use your custom network, you can use the default-gcp-network for all your Cloudbreak clusters. It will create a new network with a 10.0.0.0/16 subnet every time a cluster is created.

Custom GCP Network

If you'd like to deploy a cluster to a custom network you'll have to create a new network template on the manage networks panel.

You have the following options:

IMPORTANT Please make sure the defined subnet here doesn't overlap with any of your already deployed subnet in the network, because of the validation only happens after the cluster creation starts.

In case of existing subnet make sure you have enough room within your network space for the new instances. The provided subnet CIDR will be ignored, but a proper CIDR range will be used.

If Public in account is checked all the users belonging to your account will be able to use this network template to create clusters, but cannot delete it.

NOTE The new networks are created on GCP only after the the cluster provisioning starts with the selected network template.

Full size here.

Security groups

Security group templates are very similar to the Firewalls on GCP. They describe the allowed inbound traffic to the instances in the cluster. Currently only one security group template can be selected for a Cloudbreak cluster and all the instances have a public IP address so all the instances in the cluster will belong to the same security group. This may change in a later release.

Default Security Group

You can also use the two pre-defined security groups in Cloudbreak.

only-ssh-and-ssl: all ports are locked down except for SSH and gateway HTTPS (you can't access Hadoop services outside of the virtual network):

all-services-port: all Hadoop services and SSH, gateway HTTPS are accessible by default:

Custom Security Group

You can define your own security group by adding all the ports, protocols and CIDR range you'd like to use. The rules defined here doesn't need to contain the internal rules, those are automatically added by Cloudbreak to the security group on GCP.

IMPORTANT 443 and 22 ports needs to be there in every security group otherwise Cloudbreak won't be able to communicate with the provisioned cluster

If Public in account is checked all the users belonging to your account will be able to use this security group template to create clusters, but cannot delete it.

NOTE The security groups are created on GCP only after the cluster provisioning starts with the selected security group template.

Full size here.

Defining cluster services

Blueprints

Blueprints are your declarative definition of a Hadoop cluster. These are the same blueprints that are used by Ambari.

You can use the 3 default blueprints pre-defined in Cloudbreak or you can create your own ones. Blueprints can be added from file, URL (an example blueprint) or the whole JSON can be written in the JSON text box.

The host groups in the JSON will be mapped to a set of instances when starting the cluster. Besides this the services and components will also be installed on the corresponding nodes. Blueprints can be modified later from the Ambari UI.

NOTE Not necessary to define all the configuration in the blueprint. If a configuration is missing, Ambari will fill that with a default value.

If Public in account is checked all the users belonging to your account will be able to use this blueprint to create clusters, but cannot delete or modify it.

Full size here.

A blueprint can be exported from a running Ambari cluster that can be reused in Cloudbreak with slight modifications. There is no automatic way to modify an exported blueprint and make it instantly usable in Cloudbreak, the modifications have to be done manually. When the blueprint is exported some configurations are hardcoded for example domain names, memory configurations...etc. that won't be applicable to the Cloudbreak cluster

Cluster customization

Sometimes it can be useful to define some custom scripts so called Recipes in Cloudbreak that run during cluster creation and add some additional functionality.

For example it can be a service you'd like to install but it's not supported by Ambari or some script that automatically downloads some data to the necessary nodes. The most notable example is Ranger setup:

To learn more about these and check the Ranger recipe out, take a look at the Cluster customization

Cluster deployment

After all the cluster resources are configured you can deploy a new HDP cluster.

Here is a basic flow for cluster creation on Cloudbreak Web UI:

Full size here.

Configure Cluster tab

Setup Network and Security tab

Choose Blueprint tab

Add File System tab

Review and Launch tab

Cloudbreak uses Google Cloud Platform to create the resources - you can check out the resources created by Cloudbreak on the Compute Engine page of the Google Compute Platform. Full size here.

Besides these you can check the progress on the Cloudbreak Web UI itself if you open the new cluster's Event History. Full size here.

Advanced options

There are some advanced features when deploying a new cluster, these are the following:

Availability Zone You can restrict the instances to a specific availability zone. It may be useful if you're using reserved instances.

Minimum cluster size The provisioning strategy in case of the cloud provider cannot allocate all the requested nodes.

Validate blueprint This is selected by default. Cloudbreak validates the Ambari blueprint in this case.

Shipyard enabled cluster This is selected by default. Cloudbreak will start a Shipyard container which helps you to manage your containers.

Config recommendation strategy Strategy for configuration recommendations how will be applied. Recommended configurations gathered by the response of the stack advisor.

Start LDAP and configure SSSD Enables the System Security Services Daemon configuration.

Cluster termination

You can terminate running or stopped clusters with the terminate button in the cluster details.

IMPORTANT Always use Cloudbreak to terminate the cluster. If that fails for some reason, try to delete the GCP instances first. Instances are started in an Auto Scaling Group so they may be restarted if you terminate an instance manually!

Sometimes Cloudbreak cannot synchronize it's state with the cluster state at the cloud provider and the cluster can't be terminated. In this case the Forced termination option can help to terminate the cluster at the Cloudbreak side. If it has happened:

  1. You should check the related resources at the Google Cloud Platform
  2. If it is needed you need to manually remove resources from there

Full size here.

Interactive mode / Cloudbreak Shell

The goal with the Cloudbreak Shell (Cloudbreak CLI) was to provide an interactive command line tool which supports:

Start Cloudbreak Shell

To start the Cloudbreak CLI use the following commands:

   cd cloudbreak-deployment
   cbd start
   cbd util cloudbreak-shell

At the very first time it will take for a while, because of need to download all the necessary docker images.

This will launch the Cloudbreak shell inside a Docker container then it is ready to use. Full size here.

IMPORTANT You have to copy all your files into the cbd working directory, what you would like to use in shell. For example if your cbd working directory is ~/cloudbreak-deployment then copy your blueprint JSON, public ssh key file...etc. to here. You can refer to these files with their names from the shell.

Autocomplete and hints

Cloudbreak Shell helps to you with hint messages from the very beginning, for example:

cloudbreak-shell>hint
Hint: Add a blueprint with the 'blueprint add' command or select an existing one with 'blueprint select'
cloudbreak-shell>

Beyond this you can use the autocompletion (double-TAB) as well:

cloudbreak-shell>credential create --
credential create --AWS          credential create --AZURE        credential create --EC2          credential create --GCP          credential create --OPENSTACK

Provisioning via CLI

Setting up GCP credential

Cloudbreak works by connecting your GCP account through so called Credentials, and then uses these credentials to create resources on your behalf. Credentials can be configured with the following command for example:

credential create --GCP --description "sample description" --name my-gcp-credential --projectId <your gcp projectid> 
--serviceAccountId <your GCP service account mail address> --serviceAccountPrivateKeyPath /files/mykey.p12 
--sshKeyString "ssh-rsa AAAAB3***etc."

NOTE that Cloudbreak does not set your cloud user details - we work around the concept of GCP Service Account Credentials. You should have already a valid GCP service account. You can find further details here.

Alternatives to provide SSH Key:

You can check whether the credential was created successfully

credential list

You can switch between your existing credentials

credential select --name my-gcp-credential

Infrastructure templates

After your GCP account is linked to Cloudbreak you can start creating resource templates that describe your clusters' infrastructure:

When you create one of the above resource, Cloudbreak does not make any requests to GCP. Resources are only created on GCP after the cluster create has applied. These templates are saved to Cloudbreak's database and can be reused with multiple clusters to describe the infrastructure.

Templates

Templates describe the instances of your cluster - the instance type and the attached volumes. A typical setup is to combine multiple templates in a cluster for the different types of nodes. For example you may want to attach multiple large disks to the datanodes or have memory optimized instances for Spark nodes.

A template can be used repeatedly to create identical copies of the same stack (or to use as a foundation to start a new stack). Templates can be configured with the following command for example:

template create --GCP --name my-gcp-template --instanceType n1-standard-2 --volumeCount 2 --volumeSize 100

Other available options here:

--volumeType The default is pd-standard (HDD), other allowed value is pd-ssd (SSD).

--publicInAccount is true, all the users belonging to your account will be able to use this template to create clusters, but cannot delete it.

You can check whether the template was created successfully

template list

Networks

Your clusters can be created in their own networks or in one of your already existing one. If you choose an existing network, it is possible to create a new subnet within the network. The subnet's IP range must be defined in the Subnet (CIDR) field using the general CIDR notation. You can read more about GCP Networks and Subnet networks.

Default GCP Network

If you don't want to create or use your custom network, you can use the default-gcp-network for all your Cloudbreak clusters. It will create a new network with a 10.0.0.0/16 subnet every time a cluster is created.

Custom GCP Network

If you'd like to deploy a cluster to a custom network you'll have to apply the following command:

network create --GCP --name my-gcp-network --description "sample description"

Other available options here:

--networkId The Virtual Network Identifier of your network. This is an optional value and must be an ID of an existing GCP virtual network. If the identifier is provided, the subnet CIDR will be ignored and the existing network's CIDR range will be used.

--publicInAccount is true, all the users belonging to your account will be able to use this network template to create clusters, but cannot delete it.

--subnet specified subnet which will be used by the cluster (will be created under the provisioning).

--subnetId if you have an existing subnet in the network then you can specify the id here and the cluster will use that existing subnet.

IMPORTANT Please make sure the defined subnet here doesn't overlap with any of your already deployed subnet in the network, because of the validation only happens after the cluster creation starts.

In case of existing subnet make sure you have enough room within your network space for the new instances. The provided subnet CIDR will be ignored, but a proper CIDR range will be used.

You can check whether the network was created successfully

network list

NOTE The new networks are created on GCP only after the the cluster provisioning starts with the selected network template.

Defining cluster services

Blueprints

Blueprints are your declarative definition of a Hadoop cluster. These are the same blueprints that are used by Ambari.

You can use the 3 default blueprints pre-defined in Cloudbreak or you can create your own ones. Blueprints can be added from file or URL (an example blueprint).

The host groups in the JSON will be mapped to a set of instances when starting the cluster. Besides this the services and components will also be installed on the corresponding nodes. Blueprints can be modified later from the Ambari UI.

NOTE Not necessary to define all the configuration in the blueprint. If a configuration is missing, Ambari will fill that with a default value.

blueprint add --name my-blueprint --description "sample description" --file <the path of the blueprint>

Other available options:

--url the url of the blueprint

--publicInAccount If it is true, all the users belonging to your account will be able to use this blueprint to create clusters, but cannot delete it.

You can check whether the blueprint was created successfully

blueprint list

A blueprint can be exported from a running Ambari cluster that can be reused in Cloudbreak with slight modifications. There is no automatic way to modify an exported blueprint and make it instantly usable in Cloudbreak, the modifications have to be done manually. When the blueprint is exported some configurations are hardcoded for example domain names, memory configurations..etc. that won't be applicable to the Cloudbreak cluster.

Cluster customization

Sometimes it can be useful to define some custom scripts so called Recipes in Cloudbreak that run during cluster creation and add some additional functionality.

For example it can be a service you'd like to install but it's not supported by Ambari or some script that automatically downloads some data to the necessary nodes. The most notable example is Ranger setup:

To learn more about these and check the Ranger recipe out, take a look at the Cluster customization

Metadata show

You can check the stack metadata with

stack metadata --name myawsstack --instancegroup master

Other available options:

--id In this case you can select a stack with id.

--outputType In this case you can modify the outputformat of the command (RAW or JSON).

Cluster deployment

After all the cluster resources are configured you can deploy a new HDP cluster. The following sub-sections show you a basic flow for cluster creation with Cloudbreak Shell.

Select credential

Select one of your previously created GCP credential:

credential select --name my-gcp-credential

Select blueprint

Select one of your previously created blueprint which fits your needs:

blueprint select --name multi-node-hdfs-yarn

Configure instance groups

You must configure instance groups before provisioning. An instance group define a group of nodes with a specified template. Usually we create instance groups for host groups in the blueprint.

instancegroup configure --instanceGroup cbgateway --nodecount 1 --templateName minviable-gcp
instancegroup configure --instanceGroup master --nodecount 1 --templateName minviable-gcp
instancegroup configure --instanceGroup slave_1 --nodecount 1 --templateName minviable-gcp

Other available option:

--templateId Id of the template

Select network

Select one of your previously created network which fits your needs or a default one:

network select --name default-gcp-network

Select security group

Select one of your previously created security which fits your needs or a default one:

securitygroup select --name all-services-port

Create stack / Create cloud infrastructure

Stack means the running cloud infrastructure that is created based on the instance groups configured earlier (credential, instancegroups, network, securitygroup). Same as in case of the API or UI the new cluster will use your templates and by using GCP will launch your cloud stack. Use the following command to create a stack to be used with your Hadoop cluster:

stack create --name mygcpstack --region us-central1

The infrastructure is created asynchronously, the state of the stack can be checked with the stack show command. If it reports AVAILABLE, it means that the virtual machines and the corresponding infrastructure is running at the cloud provider.

Other available option is:

--wait - in this case the create command will return only after the process has finished.

Create a Hadoop cluster / Cloud provisioning

You are almost done! One more command and your Hadoop cluster is starting! Cloud provisioning is done once the cluster is up and running. The new cluster will use your selected blueprint and install your custom Hadoop cluster with the selected components and services.

cluster create --description "my first cluster"

Other available option is --wait - in this case the create command will return only after the process has finished.

You are done! You have several opportunities to check the progress during the infrastructure creation then provisioning:

Full size here.

         cluster show

Full size here.

Full size here.

Stop cluster

You have the ability to stop your existing stack then its cluster if you want to suspend the work on it.

Select a stack for example with its name:

stack select --name my-stack

Other available option to define a stack is its --id.

Every time you should stop the cluster first then the stack. So apply following commands to stop the previously selected stack:

cluster stop
stack stop

Restart cluster

Select your stack that you would like to restart after this you can apply:

stack start

After the stack has successfully restarted, you can restart the related cluster as well:

cluster start

Upscale cluster

If you need more instances to your infrastructure, you can upscale your selected stack:

stack node --ADD --instanceGroup host_group_slave_1 --adjustment 6

Other available option is --withClusterUpScale - this indicates also a cluster upscale after the stack upscale. You can upscale the related cluster separately if you want to do this:

cluster node --ADD --hostgroup host_group_slave_1 --adjustment 6

Downscale cluster

You also can reduce the number of instances in your infrastructure. After you selected your stack:

cluster node --REMOVE  --hostgroup host_group_slave_1 --adjustment -2

Other available option is --withStackDownScale - this indicates also a stack downscale after the cluster downscale. You can downscale the related stack separately if you want to do this:

stack node --REMOVE  --instanceGroup host_group_slave_1 --adjustment -2

Cluster termination

You can terminate running or stopped clusters with

stack terminate --name myawsstack

Other available option is --wait - in this case the terminate command will return only after the process has finished.

IMPORTANT Always use Cloudbreak to terminate the cluster. If that fails for some reason, try to delete the CloudFormation stack first. Instances are started in an Auto Scaling Group so they may be restarted if you terminate an instance manually!

Sometimes Cloudbreak cannot synchronize it's state with the cluster state at the cloud provider and the cluster can't be terminated. In this case the Forced termination option on the Cloudbreak Web UI can help to terminate the cluster at the Cloudbreak side. If it has happened:

  1. You should check the related resources at the AWS CloudFormation
  2. If it is needed you need to manually remove resources from ther

Silent mode

With Cloudbreak Shell you can execute script files as well. A script file contains shell commands and can be executed with the script cloudbreak shell command

script <your script file>

or with the cbd util cloudbreak-shell-quiet command

cbd util cloudbreak-shell-quiet < example.sh

IMPORTANT You have to copy all your files into the cbd working directory, what you would like to use in shell. For example if your cbd working directory is ~/cloudbreak-deployment then copy your script file to here.

Example

The following example creates a hadoop cluster with hdp-small-default blueprint on M3Xlarge instances with 2X100G attached disks on default-gcp-network network using all-services-port security group. You should copy your ssh public key file (with name id_rsa.pub) and your GCP service account generated private key ( with name gcp.p12) into your cbd working directory and change the <...> parts with your GCP credential details.

credential create --GCP --description "my credential" --name my-gcp-credential --projectId <your gcp projectid> --serviceAccountId <your GCP service account mail address> --serviceAccountPrivateKeyPath gcp.p12 --sshKeyFile id_rsa.pub
credential select --name my-gcp-credential
template create --GCP --name gcptemplate --description gcp-template --instanceType n1-standard-4 --volumeSize 100 
--volumeCount 2
blueprint select --name hdp-small-default
instancegroup configure --instanceGroup cbgateway --nodecount 1 --templateName gcptemplate
instancegroup configure --instanceGroup host_group_master_1 --nodecount 1 --templateName gcptemplate
instancegroup configure --instanceGroup host_group_master_2 --nodecount 1 --templateName gcptemplate
instancegroup configure --instanceGroup host_group_master_3 --nodecount 1 --templateName gcptemplate
instancegroup configure --instanceGroup host_group_client_1  --nodecount 1 --templateName gcptemplate
instancegroup configure --instanceGroup host_group_slave_1 --nodecount 3 --templateName gcptemplate
network select --name default-gcp-network
securitygroup select --name all-services-port
stack create --name my-first-stack --region us-central1
cluster create --description "My first cluster"

Congratulations! Your cluster should now be up and running on this way as well. To learn more about Cloudbreak and provisioning, we have some interesting insights for you

Edit on GitHub