Configuring the Azure Kubernetes infrastructure with Bash scripts and Ansible tasks

This article is the fourth one about a solution we have built with my friend Hervé Leclerc (Alter Way CTO) in order to automate a Kubernetes Cluster deployment on Azure:

·      Kubernetes and Microsoft Azure

· Programmatic scripting of Kubernetes deployment

· Provisioning the Azure Kubernetes infrastructure with a declarative template

· Configuring the Azure Kubernetes infrastructure with Bash scripts and Ansible tasks

· Automating the Kubernetes UI dashboard configuration

· Using Kubernetes…

As we have seen previously, it’s possible to use ARM extension to execute commands on the provisioned machines. In order to automate the required setup on each machine, we may consider different options. We could manage the cluster configuration with a collection of scripts but the more efficient way is surely to keep a declarative logic instead of a pure programming approach. What we need is a consistent, reliable, and secure way to manage the Kubernetes cluster environment configuration without adding extra complexity due to an automation framework. That’s why we’ve chosen Ansible.

Ansible role in Kubernetes configuration

Ansible is a solution for configuration management based on a state-driven resource model that describes the desired state of the target, not the paths to get them to this state. Ansible role is to transform the environment to the desired state and to allow reliable and repeatable tasks without the failures risks related to scripting. So in order to handle the Kubernetes configuration, we have provisioned a dedicated Ansible controller which role is to enact configuration on the different type of nodes.

image

Ansible playbooks

Ansible automation jobs are declared with a simple description language (YAML), both human-readable and machine-parsable, in what is called an Ansible playbook.

Ansible playbook implementation

Playbooks are the entry point of an Ansible provisioning. They contain information about the systems where the provisioning should be executed, as well as the directives or steps that should be executed. They can orchestrate steps of any ordered process, even as different steps must bounce back and forth between sets of machines in particular orders.

Each playbook is composed of one or more “plays” in a list, which goal is to map a group of hosts to some well-defined roles (in our Kubernetes cluster: “master”, “node”, “etcd”) represented by tasks that call Ansible modules. By composing a playbook of multiple “plays”, it is possible to orchestrate multi-machine deployments. In our context, Ansible playbook enables to orchestrate multiple slices of the Kubernetes infrastructure topology, with very detailed control over how many machines to tackle at a time. So one of the critical task of our implementation was to create these playbooks.

When Hervé and I began to work on this automated Ansible CentOS Kubernetes deployment, we checked if there weren’t any existing asset we should build the solution upon. And we found exactly what we were looking for with the repository “Deployment automation of Kubernetes on CentOS 7” on GitHub (https://github.com/xuant/ansible-kubernetes-centos)

Note: Hervé did a fork from this git repository in order to enable further modification ( https://github.com/herveleclerc/ansible-kubernetes-centos.git ) and I made a submodule ["ansible-kubernetes-centos"] from this fork in order to clone this Git repository and keep it as a subdirectory of our main Git repository (to keep commits separate).

Here's what the main playbook (https://github.com/herveleclerc/ansible-kubernetes-centos/blob/master/integrated-wait-deploy.yml) currently looks like:

 

---

- hosts: cluster

  become: yes

  gather_facts: false

 

  tasks:

  - name: "Wait for port 3333 to be ready"

    local_action: wait_for port=3333 host="{{ inventory_hostname }}" state=started connect_timeout=2 timeout=600

 

- hosts: cluster

  become: yes

  roles:

    - common

    - kubernetes

 

- hosts: etcd

  become: yes

  roles:

    - etcd

 

- hosts: masters

  become: yes

  roles:

    - master

 

- hosts: minions

  become: yes

  roles:

    - minion

 

- hosts: masters

  become: yes

  roles:

    - flannel-master

 

- hosts: minions

  become: yes

  roles:

    - flannel-minion

 

Ansible playbook deployment

Ansible allows you to act as another user (through the “become: yes” declaration), different from the user that is logged into the machine (remote user). This is done using existing privilege escalation tools like sudo. As the playbook is launched through the ARM custom extension, the default requirement for a “tty” has to be disabled for the sudo users (and after playbook execution, to be re-enabled). The corresponding code that triggers the “integrated-wait-deploy.yml” playbook deployment is given with the following function:

function deploy()

{

  sed -i 's/Defaults requiretty/Defaults !requiretty/g' /etc/sudoers

 

  ansible-playbook -i "${ANSIBLE_HOST_FILE}" integrated-wait-deploy.yml | tee -a /tmp/deploy-"${LOG_DATE}".log

 

  sed -i 's/Defaults !requiretty/Defaults requiretty/g' /etc/sudoers

}

 

Solution environment setup

This Kubernetes nodes and the Ansible controller should have the same level of OS packages.

Ansible controller and nodes software environment

The software environment of the virtual machines involved in this Kubernetes cluster is setup through the usage of the Open Source EPEL (Extra Packages for Enterprise Linux) Linux repository (a repository of add-on packages that complements the Fedora-based Red Hat Enterprise Linux (RHEL) and its compatible spinoffs, such as CentOS and Scientific Linux).

Current EPEL version is the following:

EPEL_REPO=https://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-8.noarch.rpm

 

Please note that sometimes a new EPEL release is published without any guaranty for the previous one retention. So a current limitation with our current template is the requirement to update this EPEL url when there’s a new version for software packages hosted on EPEL repository.

The version of the EPEL repository is specified in “ansible-config.sh” and “node-config.sh” scripts with the following function:

function install_epel_repo()

{

   rpm -iUvh "${EPEL_REPO}"

}

 

Ansible controller setup

Ansible setup is achieved through a simple clone from GitHub and a compilation of the corresponding code.

Here is the code of the corresponding function:

function install_ansible()

{

  rm -rf ansible

 

  git clone https://github.com/ansible/ansible.git --depth 1

 

  cd ansible || error_log "Unable to cd to ansible directory"

 

  git submodule update --init --recursive

 

  make install

}

 

The “make install” that does the compilation first requires “Development Tools groups” to be installed.

function install_required_groups()

{

  until yum -y group install "Development Tools"

  do

    log "Lock detected on VM init Try again..." "N"

    sleep 2

  done

}

 

Ansible configuration

Ansible configuration is defined in two files, the “hosts” inventory file and the “ansible.cfg” file. The Ansible control path has to be shorten to avoid errors with long host names, long user names or deeply nested home directories. The corresponding code for the “ansible.cfg” creation is the following:

function configure_ansible()

{

  rm -rf /etc/ansible

  mkdir -p /etc/ansible

  cp examples/hosts /etc/ansible/.

  printf "[defaults]\ndeprecation_warnings=False\n\n" >> "${ANSIBLE_CONFIG_FILE}"

  printf "[defaults]\nhost_key_checking = False\n\n" >> "${ANSIBLE_CONFIG_FILE}"

  echo $'[ssh_connection]\ncontrol_path = ~/.ssh/ansible-%%h-%%r' >>

                                                           "${ANSIBLE_CONFIG_FILE}"

  printf "\npipelining = True\n" >> "${ANSIBLE_CONFIG_FILE}"  

 

}

Ansible connection management

Ansible represents the machines it manages using a simple INI file that organizes the Kubernetes nodes in groups corresponding to their role. In our solution we created “etcd”,”master”,”minion” groups completed by a “cluster” role that gathers all the nodes.

Here's what the plain text inventory file (“/etc/ansible/hosts”) looks like:

image

This inventory has to be dynamically created depending on the cluster size requested in the ARM template.

Inventory based on Kubernetes nodes dynamically generated private IP

In our first ARM template implementation, each Kubernetes node launches a short “first-boot.sh” custom script in order to specify its dynamically generated private IP and its role in the cluster.

privateIP=$1

role=$2

FACTS=/etc/ansible/facts

 

mkdir -p $FACTS

echo "${privateIP},${role}" > $FACTS/private-ip-role.fact

 

chmod 755 /etc/ansible

chmod 755 /etc/ansible/facts

chmod a+r $FACTS/private-ip.fact

 

exit 0

 

In this version, the “deploy.sh” script makes ssh requests to nodes to get their private IP in order to create the inventory file. Here is an extract of the code of the corresponding function:

function get_private_ip()

{

 

  let numberOfMasters=$numberOfMasters-1

 

  for i in $(seq 0 $numberOfMasters)

  do

    let j=4+$i

  su - "${sshu}" -c "ssh -l ${sshu} ${subnetMasters3}.${j} cat $FACTS/private-ip-role.fact" >> /tmp/hosts.inv

  done

 

#...and so on for etcd, and minions nodes

}

 

Inventory based on Kubernetes nodes static ARM computed IP

In our second ARM implementation (the “linked” template composed from several ones), private IPs are computed directly in the ARM script. The following code is extracted from “KubeMasterNodes.json”:

  "resources": [

    {

      "apiVersion": "[parameters('apiVersion')]",

      "type": "Microsoft.Network/networkInterfaces",

      "name": "[concat(parameters('kubeMastersNicName'), copyindex())]",

      "location": "[resourceGroup().location]",

      "copy": {

        "name": "[variables('kubeMastersNetworkInterfacesCopy')]",

        "count": "[parameters('numberOfMasters')]"

      },

      "properties": {

        "ipConfigurations": [

          {

            "name": "MastersIpConfig",

            "properties": {

              "privateIPAllocationMethod": "Static",

"privateIPAddress" : "[concat(parameters('kubeMastersSubnetRoot'), '.',add(copyindex(),4) )]" ,

              "subnet": {

                "id": "[parameters('kubeMastersSubnetRef')]"

              },

              "loadBalancerBackendAddressPools": [

                {

                  "id": "[parameters('kubeMastersLbBackendPoolID')]"

            }

              ],

              "loadBalancerInboundNatRules": [

                {

                  "id": "[concat(parameters('kubeMastersLbID'),'/inboundNatRules/SSH-', copyindex())]"

                }

              ]

            }

          }

     ]

      }

    },

 In the “ansible-config.sh” custom script file, there is the function that creates this inventory:

function create_inventory()

{

  masters=""

  etcd=""

  minions=""

 

  for i in $(cat /tmp/hosts.inv)

  do

    ip=$(echo "$i"|cut -f1 -d,)

    role=$(echo "$i"|cut -f2 -d,)

 

    if [ "$role" = "masters" ]; then

      masters=$(printf "%s\n%s" "${masters}" "${ip} ansible_user=${ANSIBLE_USER} ansible_ssh_private_key_file=/home/${ANSIBLE_USER}/.ssh/idgen_rsa")

    elif [ "$role" = "minions" ]; then

      minions=$(printf "%s\n%s" "${minions}" "${ip} ansible_user=${ANSIBLE_USER} ansible_ssh_private_key_file=/home/${ANSIBLE_USER}/.ssh/idgen_rsa")

    elif [ "$role" = "etcd" ]; then

      etcd=$(printf "%s\n%s" "${etcd}" "${ip} ansible_user=${ANSIBLE_USER} ansible_ssh_private_key_file=/home/${ANSIBLE_USER}/.ssh/idgen_rsa")

    fi

  done

 

  printf "[cluster]%s\n" "${masters}${minions}${etcd}" >> "${ANSIBLE_HOST_FILE}"

  printf "[masters]%s\n" "${masters}" >> "${ANSIBLE_HOST_FILE}"

  printf "[minions]%s\n" "${minions}" >> "${ANSIBLE_HOST_FILE}"

  printf "[etcd]%s\n" "${etcd}" >> "${ANSIBLE_HOST_FILE}"

}

 

As you may have noticed, this file completes the IP addresses with a path for the SSH private key.

Ansible security model

Ansible does not require any remote agents to deliver modules to remote systems and to execute tasks. It respects the security model of the system under management by running user-supplied credentials and as we’ve seen includes support for “sudo”. To secure the remote configuration management system, Ansible may also rely on the default transport layer through the usage of OpenSSH, which is more secure than password authentication. This is the way we decided to secure our deployment.

In our second ARM implementation SSH key generation is done through a simple call to “ssh-keygen” (In the first one, we use existing keys).

function generate_sshkeys()

{

  echo -e 'y\n'|ssh-keygen -b 4096 -f idgen_rsa -t rsa -q -N ''

}

Then the configuration is done by copying the generated key file in the .ssh folder for “root” and “Ansible” users and by changing their permissions.

 

function ssh_config()

{

  printf "Host *\n user %s\n StrictHostKeyChecking no\n" "${ANSIBLE_USER}" >> "/home/${ANSIBLE_USER}/.ssh/config"

 

  cp idgen_rsa "/home/${ANSIBLE_USER}/.ssh/idgen_rsa"

  cp idgen_rsa.pub "/home/${ANSIBLE_USER}/.ssh/idgen_rsa.pub"

  chmod 700 "/home/${ANSIBLE_USER}/.ssh"

  chown -R "${ANSIBLE_USER}:" "/home/${ANSIBLE_USER}/.ssh"

  chmod 400 "/home/${ANSIBLE_USER}/.ssh/idgen_rsa"

  chmod 644 "/home/${ANSIBLE_USER}/.ssh/idgen_rsa.pub"

}

 

We also have to define a way to propagate SSH keys between Ansible controller and Kubernetes nodes. At least the SSH public key is stored in the .ssh/authorized_keys file on all the computers Ansible has to log in to while the private key is kept on the computer the Ansible user log in from (in fact we have automated the copy of both public and private keys, because having the ability to SSH from any machine to the other was sometimes quite useful during development phase…). In order to do this, we use Azure Storage.

SSH Key files sharing

The “ansible-config.sh” file launched in Ansible controller generates the SSH keys and stores them in a blob in an Azure storage container with the “WriteSSHToPrivateStorage.py” Python script.

function put_sshkeys()

 {

 

  log "Push ssh keys to Azure Storage" "N"

  python WriteSSHToPrivateStorage.py "${STORAGE_ACCOUNT_NAME}" "${STORAGE_ACCOUNT_KEY}" idgen_rsa

  error_log "Unable to write idgen_rsa to storage account ${STORAGE_ACCOUNT_NAME}"

  python WriteSSHToPrivateStorage.py "${STORAGE_ACCOUNT_NAME}" "${STORAGE_ACCOUNT_KEY}" idgen_rsa.pub

  error_log "Unable to write idgen_rsa.pub to storage account ${STORAGE_ACCOUNT_NAME}"

}

Here is the corresponding “WriteSSHToPrivateStorage.py” Python function:

import sys,os

from azure.storage.blob import BlockBlobService

from azure.storage.blob import ContentSettings

 

block_blob_service = BlockBlobService(account_name=str(sys.argv[1]), account_key=str(sys.argv[2]))

block_blob_service.create_container('keys')

 

block_blob_service.create_blob_from_path(

    'keys',

    str(sys.argv[3]),

    os.path.join(os.getcwd(),str(sys.argv[3])),

    content_settings=ContentSettings(content_type='text')

)

The “node-config.sh” file launched in Ansible controller gets the SSH key files in Azure storage with the “ReadSSHToPrivateStorage.py” Python script which code is the following:

import sys,os

 

from  azure.storage.blob import BlockBlobService

blob_service = BlockBlobService(account_name=str(sys.argv[1]), account_key=str(sys.argv[2]))

 

blob = blob_service.get_blob_to_path(

    'keys',

    str(sys.argv[3]),

    os.path.join(os.getcwd(),str(sys.argv[3])),

    max_connections=8

)

These two Python functions require the corresponding Python module to be installed on the Ansible controller and Kubernetes nodes. One easy way to do it is first to install “pip”. Pip is a package management system used to install and manage software packages written in Python. Here is the code corresponding to this setup.

function install_required_packages()

{

  until yum install -y git python2-devel python-pip libffi-devel libssl-dev openssl-devel

  do

    sleep 2

  done

}

 

function install_python_modules()

{

  pip install PyYAML jinja2 paramiko

  pip install --upgrade pip

  pip install azure-storage

}

But providing the technical mechanism to share SSH keys files between Kubernetes nodes and Ansible controller is not sufficient. We have to define how they both interact with each other.

Synchronization between Ansible controller and nodes

One of the complex requirements of the solution we propose is the way we handle the synchronization of the different configuration tasks.

ARM “depensOn” and SSH key files pre-upload solution

In our first ARM template implementation, this is easily handled through the usage of “dependsOn” attribute in the ARM template. This solution requires all the nodes to be deployed before launching the Ansible playbook with a custom “deploy.sh” script extension. This script is deployed on the Ansible controller with the “fileUris” settings (“scriptBlobUrl”). The others filesUris correspond to the SSH key files used for Ansible secure connection management (and another token file used for monitoring through Slack).

In this solution, synchronization between Ansible controllers and node is quite easy but requires the publication of SSH key files in an Azure storage account (or an Azure Key vault) before launching the template deployment, so Ansible controller may download them before running the playbooks. If these files are not deployed before launching the deployment, it will fail. Here are the corresponding files displayed with the Azure Management Studio tool.

image

 

The following code is extracted from the “azuredeploy.json” and corresponds to the Ansible extension provisioning.

image

Netcat synchronization and automated Python SSH key files sharing through Azure Storage

The second ARM implementation (the “linked” template composed from several ones) enables to avoid to pre-publish keys in a vault by directly generating these keys during deployment. As this solution may appear more elegant and more secure, by specializing a generated SSH key for Ansible configuration management instead of sharing existing keys for other purpose, it adds a significant level of complexity, because it requires now to handle in parallel both Ansible controller and Kubernetes nodes configuration.

The bash script on the Ansible controller generates SSH keys but have to wait for the nodes to get theses keys and declared them as authorized for logon before launching the Ansible playbook. A solution to this kind of scenario may be implemented thanks to the usage of Netcat, a networking utility using the TCP/IP protocol, which reads and writes data across network connections, and that can be used directly or easily driven by other programs and scripts. The way we synchronized both scripts “config-ansible.sh” and “config-node.sh” is described with the following schema.

image

So the first machines which have to wait are the nodes. As they are waiting for getting a file, it’s quite simple to have them making this request periodically in pull mode until the file is there (with a limit on attempts number). The code extracted from “config-node.sh” is the following.

function get_sshkeys()

 {

    c=0;

 

   sleep 80

    until python GetSSHFromPrivateStorage.py "${STORAGE_ACCOUNT_NAME}" "${STORAGE_ACCOUNT_KEY}" idgen_rsa

    do

        log "Fails to Get idgen_rsa key trying again ..." "N"

        sleep 80

        let c=${c}+1

        if [ "${c}" -gt 5 ]; then

           log "Timeout to get idgen_rsa key exiting ..." "1"

           exit 1

        fi

    done

    python GetSSHFromPrivateStorage.py "${STORAGE_ACCOUNT_NAME}" "${STORAGE_ACCOUNT_KEY}" idgen_rsa.pub

}

In order to launch the Netcat server in the background (using the & command at the end of the line), and logout from the “config-node.sh” extension script session without the process being killed, the job is executed with “nohup” command (which stands for “no hang up”).

To make sure that the Netcat server launching will not interfere with foreground commands and will continue to run after the extension script logouts, it’s also required to define the way input and output are handled. The standard output will be redirected to “/tmp/nohup.log” file (with a “> /tmp/nohup.log” and the standard error will be redirected to stdout (with the “2>&1” file descriptor), thus it will also go to this file which will contain both standard output and error messages from the Netcat server launching. Closing input avoids the process to read anything from standard input (waiting to be brought in the foreground to get this data). On Linux, running a job with “nohup” automatically closes its input, so we may consider not to handle this part, but finally the command I decided to use was this one.

function start_nc()

{

  log "Pause script for Control VM..." "N"

  nohup nc -l 3333 </dev/null >/tmp/nohup.log 2>&1 &

}

The “start_nc()” function is the last one called in the “config-node.sh” file, which is executed on each cluster node. On the other side, the “ansible-config.sh” file launched in Ansible controller executes without any pause and launches the Ansible playbook. This is the role of this playbook to wait for the availability of the Netcat Server (that shows that the “config-node.sh” script is terminated with success). This is easily done with the simple task we’ve already seen. This task is executed for any host belonging to the Ansible “cluster” group, i.e all the Kubernetes cluster nodes.

- hosts: cluster

  become: yes

  gather_facts: false

 

  tasks:

  - name: "Wait for port 3333 to be ready"

    local_action: wait_for port=3333 host="{{ inventory_hostname }}" state=started connect_timeout=2 timeout=600

One last important point is how we handle the monitoring of the automated tasks.

Monitoring the automated configuration

Monitoring this deployment part supposes to be able to monitor both custom script extension and Ansible playbook execution.

Custom script logs

One way to monitor both script extension and Ansible automated configuration is to connect to the provisioned VM and check the “stdout” and the “errout” files in the waagent (Azure Microsoft Azure Linux Agent that manages Linux VM interaction with the Azure Fabric Controller) directory : “/var/lib/waagent/Microsoft.OSTCExtensions.CustomScriptForLinux-1.4.1.0/download/0”

image

A more pleasant way is to monitor the custom script execution directly from Slack.

image

In order to get in realtime the output from the “ansible-config.sh” and “node-config.sh” deployment custom script, you just have to define “Incoming WebHooks” in your Slack team project.

image

This gives you a WebHook extension URL which can directly be called from those bash scripts in order to send your JSON payloads to this URL.

LOG_URL=https://hooks.slack.com/services/xxxxxxxx/yyyyyyy/zzzzzzzzzzzzzzzzzzzzzz

 

Then it’s quite easy to integrate this in a log function such as this one:

function log()

{

  #...

  mess="$(date) - $(hostname): $1 $x"

 

  payload="payload={\"icon_emoji\":\":cloud:\",\"text\":\"$mess\"}"

  curl -s -X POST --data-urlencode "$payload" "$LOG_URL" > /dev/null 2>&1

   

  echo "$(date) : $1"

}

Ansible logs

Ansible logs may also be read from playbook execution output which is piped to a tee command in order to write it to the terminal, and to the /tmp/deploy-"${LOG_DATE}".log file.

image

Once again it’s possible to use Slack to get Ansible logs in realtime. In order to do that we have used an existing Slack Ansible plugin (https://github.com/traveloka/slack-ansible-plugin) which I added as submodule of our GitHub solution (https://github.com/stephgou/slack-ansible-plugin). Once it’s done, we just have to get the Slack incoming WebHook token in order to set the SLACK_TOKEN and SLACK_CHANNEL.

Please note that for practical reasons, I used base64 encoding in order to avoid to handle vm extension “fileuris” parameters outside of Github, because Github forbids token archiving. This current implement choice is definitely not a best practice. An alternative could be to put a file in a vault or a storage account and copy this file from the config-ansible.sh.

The corresponding code is extracted from the following function:

 

function ansible_slack_notification()

{

  encoded=$ENCODED_SLACK

  token=$(base64 -d -i <<<"$encoded")

 

  export SLACK_TOKEN="$token"

  export SLACK_CHANNEL="ansible"

 

  mkdir -p "/usr/share/ansible_plugins/callback_plugins"

  cd "$CWD" || error_log "unable to back with cd .."

  cd "$local_kub8/$slack_repo"

  pip install -r requirements.txt

  cp slack-logger.py /usr/share/ansible_plugins/callback_plugins/slack-logger.py

}

Thanks to this plugin it’s now possible to directly get Ansible logs in Slack.

image

Conclusion

The combination of ARM templates, custom script extension files and Ansible playbooks enabled us to install the required system packages to setup and configure Ansible, to deploy required Python modules (for example in order to share SSH key files through Azure Storage), to synchronize our nodes configuration and execute Ansible playbooks to finalize the Kubernetes deployment configuration.

Next time we’ll see how to automate the Kubernetes UI dashboard configuration… Stay tuned !