Friday 12 July 2019

OVS + DPDK Bond + VLAN + Network Namespace

I was recently investigating a bug in an OpenStack deployment. Guest were failing to receive metadata on boot. Digging into the deployment revealed that the service providing the metadata was running inside a network namespace, the namespace was attached to an ovs bridge using a tap device, the tap device's ovs port was associated with a specific vlan id, the ovs bridge was in turn using dpdk and for external network access two network cards were bonded together using an ovs dpdk bond. After grabbing a cup of tea to calm my nerves I started reading a fair amount of documentation and running a few tests. As far as I could tell everything was configured as it should be.

To be able to file a bug I needed to be able to reproduce this setup on the latest version of each piece of software, preferably without needing a full OpenStack deployment. Then I could start removing layers of complexity until I had this simplest reproducer of the bug.

Although no one single part of the process of reproducing this setup was particularly difficult it did involve a fair few moving parts and below I run through them (mainly for my benefit when next week I've forgotten everything I did).

DPDK

I was lucky enough to have access to a server with two dpdk compatible network cards which I could deploy using maas. The server had also been setup to have hugepages created on install. This was done by creating a custom maas tag and assigning it to the server:

2 ubuntu@maas:~⟫ maas maas tag read dpdk
Success.
Machine-readable output follows:
{
    "definition": "",
    "name": "dpdk",
    "resource_uri": "/MAAS/api/2.0/tags/dpdk/",
    "kernel_opts": "hugepages=103117 iommu=pt intel_iommu=on",
    "comment": "DPDK enabled machines"
}

After installing the development release of Ubuntu (eoan) on the server it was time to install the dpdk and ovs packages.

ubuntu@node-licetus:~$ lsb_release -a                                                             
No LSB modules are available.                                                                              
Distributor ID: Ubuntu                                                                             
Description:    Ubuntu Eoan Ermine (development branch)                                               
Release:        19.10                                                                                  
Codename:       eoan                                                                                   
ubuntu@node-licetus:~$ sudo apt-get -q install -y dpdk openvswitch-switch-dpdk          

Update the system to use openvswitch-switch-dpdk for ovs-vswitchd. 

ubuntu@node-licetus:~$ sudo update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
update-alternatives: using /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in manual mode

dpdk references the network cards via their PCI address.  One way to find the PCI address of a network card, given its MAC address, is to examine the web of files and symbolic links in the /sys filesystem.

ubuntu@node-licetus:~$ grep -E 'a0:36:9f:dd:31:bc|a0:36:9f:dd:31:be' /sys/class/net/*/address
/sys/class/net/enp3s0f0/address:a0:36:9f:dd:31:bc
/sys/class/net/enp3s0f1/address:a0:36:9f:dd:31:be

ubuntu@node-licetus:~$ ls -ld /sys/class/net/enp3s0f0 /sys/class/net/enp3s0f1 | awk '{print $NF}' | awk 'BEGIN {FS="/"} {print $6}'
0000:03:00.0
0000:03:00.1
To switch the network cards from being kernel managed to being managed by dpdk the /etc/dpdk/interfaces is updated and dpdk restarted. When this is done the network cards will disappears from tools like ip.

root@node-licetus:~# ip -br addr show | grep enp
enp3s0f0         UP             fe80::a236:9fff:fedd:31bc/64 
enp3s0f1         UP             fe80::a236:9fff:fedd:31be/64 
root@node-licetus:~# echo "pci 0000:03:00.0 vfio-pci
> pci 0000:03:00.1 vfio-pci" >> /etc/dpdk/interfaces
root@node-licetus:~# systemctl restart dpdk
root@node-licetus:~# ip -br addr show | grep enp
root@node-licetus:~#


DPDK enabled OVS

There are a few global settings which need to be applied when using ovs with dpdk. The first relates to hugepages. Hugepages need to be allocated per NUMA node. First check that the hugepages have been created as requested by the kernel option specified in maas:

root@node-licetus:~# grep -i hugepages_ /proc/meminfo
HugePages_Total:   103117
HugePages_Free:    103117
HugePages_Rsvd:        0
HugePages_Surp:        0

Now see how many NUMA nodes there are:

# ls -ld /sys/devices/system/node/node* | wc -l
2

I chose to allocate 4096MB to each of the NUMA nodes. This is done via the dpdk-socket-mem option which takes a comma delimited list of hugepage numbers as its value:

root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096"
root@node-licetus:~#



Next white list the network cards in ovs via their PCI addresses:



root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-extra="--pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1"
root@node-licetus:~# 


Finally restart openvswitch-switch and check the log:

root@node-licetus:~# systemctl restart openvswitch-switch
root@node-licetus:~#
root@node-licetus:~# grep --color -E 'PCI|DPDK|ovs-vswitchd' /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.475Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.496Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
2019-07-11T13:07:54.806Z|00009|dpdk|ERR|DPDK not supported in this copy of Open vSwitch.
2019-07-11T13:08:17.961Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T13:08:17.969Z|00007|dpdk|INFO|Using DPDK 18.11.0
2019-07-11T13:08:17.969Z|00008|dpdk|INFO|DPDK Enabled - initializing...
2019-07-11T13:08:17.969Z|00011|dpdk|INFO|Per port memory for DPDK devices disabled.
2019-07-11T13:08:17.969Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd --pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1 --socket-mem 4096,4096 --socket-limit 4096,4096 -l 0.
2019-07-11T13:08:26.915Z|00019|dpdk|INFO|EAL: PCI device 0000:03:00.0 on NUMA socket 0
2019-07-11T13:08:27.600Z|00023|dpdk|INFO|EAL: PCI device 0000:03:00.1 on NUMA socket 0
2019-07-11T13:08:28.090Z|00026|dpdk|INFO|DPDK Enabled - initialized
2019-07-11T13:08:28.097Z|00051|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0

Bridge with DPDK Bonded NICs

As per  the OVS docs when creating the bridge the datapath_type needs to be set to netdev to tell ovs run in userspace mode . 

root@node-licetus:~# ovs-vsctl -- add-br br-test
root@node-licetus:~# ovs-vsctl -- set bridge br-test datapath_type=netdev

Now create the bond device and attach it to the bridge:

root@node-licetus:~# ovs-vsctl --may-exist add-bond br-test dpdk-bond0 dpdk-nic1 dpdk-nic2 \
>           -- set Interface dpdk-nic1 type=dpdk options:dpdk-devargs=0000:03:00.0 \
>           -- set Interface dpdk-nic2 type=dpdk options:dpdk-devargs=0000:03:00.1

ovs-vsctl seems quite happy to create the bond even if there is a problem so its worth taking a moment to check the device exists in the bridge without any errors:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

Part of my testing was to use jumbo frames so the final step is to set the mtu on the dpdk devices:

root@node-licetus:~# ovs-vsctl set Interface dpdk-nic1 mtu_request=9000
root@node-licetus:~# ovs-vsctl set Interface dpdk-nic2 mtu_request=9000
root@node-licetus:~# 

Tap in Network Namespace Attached to a Bridge.

Create a tap device called tap1 in the bridge:

root@node-licetus:~# ovs-vsctl add-port br-test tap1 -- set Interface tap1 type=internal
root@node-licetus:~# 

Create a network namespace called ns1 and place the tap1 into it.

root@node-licetus:~# ip netns add ns1
root@node-licetus:~# ip link set tap1 netns ns1


Bring up tap1 and assign it an IP address:

root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 up
root@node-licetus:~# ip netns exec ns1 ip link set dev lo up
root@node-licetus:~# ip netns exec ns1 ip addr add 172.20.0.1/24 dev tap1
root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 mtu 9000

Finally the network the tap needs to be on is vlan 2933 which is delivered to the network cards as part of a vlan trunk. To assign tap1 to the vlan with id 2933 the tap port needs to be tagged.

root@node-licetus:~# ovs-vsctl set port tap1 tag=2933
root@node-licetus:~#

Testing

That really is it. Below is the resulting bridge and tap interface:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port "tap1"
            tag: 2933
            Interface "tap1"
                type: internal
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

root@node-licetus:~# ip netns exec ns1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
12: tap1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 0a:96:ad:cd:2e:7f brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.1/24 scope global tap1
       valid_lft forever preferred_lft forever
    inet6 fe80::896:adff:fecd:2e7f/64 scope link 
       valid_lft forever preferred_lft forever

From another machine that has access to vlan 2933 I can ping the tap device:

ubuntu@node-husband:~$ ping -c3 172.20.0.1
PING 172.20.0.1 (172.20.0.1) 56(84) bytes of data.
64 bytes from 172.20.0.1: icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from 172.20.0.1: icmp_seq=2 ttl=64 time=0.146 ms
64 bytes from 172.20.0.1: icmp_seq=3 ttl=64 time=0.144 ms

--- 172.20.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2046ms
rtt min/avg/max/mdev = 0.144/0.168/0.215/0.034 ms

Final Thoughts

If you want to use dpdk with OpenStack then the OpenStack charms make it easy. The charms look after all of the above and much more.

If you want a more complete guide to dpdk try here The new simplicity to consume dpdk

Tuesday 16 April 2019

Quick OpenStack Charm Test with Zaza

I was investigating a bug recently that appeared to be a race condition causing RabbitMQ server to occasionally fail to cluster when the charm was upgraded. To attempt to reproduce the bug I decided to use Python library zaza.

Firstly I created a tests & bundles directory:

$ mkdir -p tests/bundles/

Then added a bundle in tests/bundles/ha.yaml :

series: xenial
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server
    constraints: mem=1G
    num_units: 3


Once the deployment is complete I needed to test upgrading the charm so the next step was to define a test in tests/tests_rabbit_upgrade.py to do this:



#!/usr/bin/env python3

import unittest
import zaza.model

class UpgradeTest(unittest.TestCase):

    def test_upgrade(self):
        zaza.model.upgrade_charm(
            'rabbitmq-server',
            switch='cs:~openstack-charmers-next/rabbitmq-server-343')
        zaza.model.block_until_all_units_idle()

The last step is to tell Zaza what to do in a test run and this is done in the tests/tests.yaml.

tests:
  - tests.tests_rabbit_upgrade.UpgradeTest
configure:
  - zaza.charm_tests.noop.setup.basic_setup
smoke_bundles:
  - ha


This tells Zaza which bundle to run for smoke tests and which test(s) to run. There is no configuration step needed but at the moment zaza expects one so it is pointed at a no-op method. Obviously you need to have zaza installed which can be done in a virtualenv:

$ virtualenv -q -ppython3 venv3                                                                                                   
Already using interpreter /usr/bin/python3                                                                                                                      
$ source venv3/bin/activate                                                                                                      
(venv3) $ pip install git+https://github.com/openstack-charmers/zaza.git#egg=zaza

Zaza will generate a new model for each run and does not bring the new model into focus so it is safe to kick of the test run as many times as you like in parallel.

(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[1] 4768
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[2] 4790
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[3] 4795
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[4] 4811


All four deployments and tests ran in parallel. Below is an example output:

$ cat /tmp/run.ed5fe80c-6045-11e9-83b5-77a3728f6280 
Controller: gnuoy-serverstack

Model              Cloud/Region             Type       Status      Machines  Cores  Access  Last connection
controller         serverstack/serverstack  openstack  available          1      4  admin   just now
default            serverstack/serverstack  openstack  available          0      -  admin   2018-11-13
zaza-310115a7b9f9  serverstack/serverstack  openstack  available          0      -  admin   just now
zaza-98856c9f8ec5  serverstack/serverstack  openstack  available          0      -  admin   just now
zaza-9c013d3f9ec9  serverstack/serverstack  openstack  available          0      -  admin   never connected

INFO:root:Deploying bundle './tests/bundles/ha.yaml' on to 'zaza-310115a7b9f9' model
Resolving charm: cs:rabbitmq-server
Executing changes:
- upload charm cs:rabbitmq-server-85 for series xenial
- deploy application rabbitmq-server on xenial using cs:rabbitmq-server-85
- add unit rabbitmq-server/0 to new machine 0
- add unit rabbitmq-server/1 to new machine 1
- add unit rabbitmq-server/2 to new machine 2
Deploy of bundle completed.
INFO:root:Waiting for environment to settle
INFO:root:Waiting for a unit to appear
INFO:root:Waiting for all units to be idle
INFO:root:Checking workload status of rabbitmq-server/0
INFO:root:Checking workload status message of rabbitmq-server/0
INFO:root:Checking workload status of rabbitmq-server/1
INFO:root:Checking workload status message of rabbitmq-server/1
INFO:root:Checking workload status of rabbitmq-server/2
INFO:root:Checking workload status message of rabbitmq-server/2
INFO:root:OK
INFO:root:## Running Test tests.tests_rabbit_upgrade.UpgradeTest ##
test_upgrade (tests.tests_rabbit_upgrade.UpgradeTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 87.692s

OK

Finally, because the --keep-model switch was used all four models are still available for any post test inspection:

$ juju list-models
Controller: gnuoy-serverstack

Model              Cloud/Region             Type       Status     Machines  Cores  Access  Last connection
controller         serverstack/serverstack  openstack  available         1      4  admin   just now
default*           serverstack/serverstack  openstack  available         0      -  admin   2018-11-13
zaza-310115a7b9f9  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago
zaza-98856c9f8ec5  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago
zaza-9c013d3f9ec9  serverstack/serverstack  openstack  available         3      6  admin   5 minutes ago
zaza-a1b9f911850b  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago


To read more about zaza head over to https://zaza.readthedocs.io/en/latest/

Saturday 13 April 2019

OpenStack Automated Instance Recovery

Who needs pets? It's all about the cattle in the cloud world, right? Unfortunately, reality has a habit of stepping on even the best laid plans. There is often one old legacy app (or more) which proves somewhat reticent to being turned into a self-healing, auto-load balancing cloud entity. In this case it does seems reasonable that if the app becomes unavailable the infrastructure should make some attempt to bring it back again. In the OpenStack world this is where projects like Masakari come in.

Masakari, when used in conjunction with masakari-monitors and Pacemaker, provides a service which detects the failure of a guest, or an entire compute node, and attempts to bring it back on line. The Masakari service provides an api and executor which reacts to notifications of a failure. Masakari-monitors, as the name suggests, detects failures and tells masakari about them. One mechanism it uses to detect failure is to monitor pacemaker, and when pacemaker reports a fellow compute node is down, masakari-monitors reports that to the Masakari service.

Masakari also provides two other recovery mechanisms. It provides a mechanism for detecting the failure of an individual guest and restarting that guest in situ. Finally, it also provides a mechanism for restarting an operating system process if it stops; this mechanism is not covered here and the charm is configured to disable it.

Below is a diagram from the Masakari Wiki showing the plumbing:


Introducing Pacemaker Remote

As mentioned above, Masakari monitors detect a compute node failure by querying the locally running cluster software. A hacluster charm already exists which will deploy corosync and pacemaker, and initial tests showed that this worked in a small environment. Unfortunately, the corosync/pacemaker is not designed to scale much past 16 nodes (see pacemaker documentation ) and a limit of 16 compute nodes is not acceptable in most clouds. 

This is where pacemaker remote comes in. A pacemaker remote node can run resources but does not run the full cluster stack in this case corosync is not installed on remote nodes and so the pacemaker remote nodes do not participate in the constant chatter needed to keep the cluster XML definitions up to date and consistent across the nodes.

A new Pacemaker Remote charm was developed which relates to an existing cluster and can run resources from that cluster. In this example the pacemaker remotes do not actually run any resources, they are just used as a mechanism of querying the state of all the nodes in the cluster. 

Unfortunately due to Bug #1728527 masakari-monitors fails to query pacemaker-remote properly.  A  patch has been proposed upstream. The patch is currently applied to the masakari-monitors package in stein in the Ubuntu Cloud Archive.


Charm Architecture

The masakari api service is deployed using a new masakari charm. This charm behaves in the same way as the other OpenStack API service charms. It expects relations with MySQL/Percona, RabbitMQ and Keystone. It can also be related to Vault to secure the endpoints with https. The only real difference to the other API charms is that it must be deployed in an ha configuration which means multiple units with an hacluster relation. The pacemaker-remote and masakari-monitors charms are both subordinates that need to be running on the compute nodes:


It might be expected that the masakari-monitors would have a direct relation to the masakari charm. In fact it does a lookup for the service in the catalogue so it only needs credentials which it obtains by having an identity-credentials relation with keystone.

STONITH

For a guest to be failed over to another compute host it must be using some form of shared storage. As a result, the scenario where a compute node has lost network access to its peers but continues to have access to shared storage must be considered. Masakari monitors on a peer compute node registers the compute node has gone and notifies the masakari api server. This in turn triggers the masakari engine to instigate a failover of that guest via nova. Assuming that nova concurs that the compute node has gone, it will attempt to bring it up on another node. At this point there are two guests both trying to write to the same shared storage which will likely lead to data corruption.

The solution is to enable stonith in pacemaker. When the cluster detects a node has disappeared, it runs a stonith plugin to power the compute node off. To enable stonith the hacluster charm now ships with a maas plugin that the stonith resource can use. The maas details are provided by the existing hacluster maas_url and maas_credentials config options. This allows the stonith resource to power off a node via the maas api which has the added advantage of abstracting the charm away from the individual power management system being used in this particular deployment.


Adding Masakari to a deployment

Below is an example of using an overlay to add masakari to an existing deployment. Obviously there is lots of configuration in the yaml below which is specific to each deployment and will need updating:

machines:
  '0':
    series: bionic
  '1':
    series: bionic
  '2':
    series: bionic
  '3':
    series: bionic
relations:
- - nova-compute:juju-info
  - masakari-monitors:container
- - masakari:ha
  - hacluster:ha
- - keystone:identity-credentials
  - masakari-monitors:identity-credentials
- - nova-compute:juju-info
  - pacemaker-remote:juju-info
- - hacluster:pacemaker-remote
  - pacemaker-remote:pacemaker-remote
- - masakari:identity-service
  - keystone:identity-service
- - masakari:shared-db
  - mysql:shared-db
- - masakari:amqp
  - rabbitmq-server:amqp
series: bionic
applications:
  masakari-monitors:
    charm: cs:~gnuoy/masakari-monitors-9
    series: bionic
  hacluster:
    charm: cs:~gnuoy/hacluster-38
    options:
      corosync_transport: unicast
      cluster_count: 3
      maas_url: http://10.0.0.205/MAAS
      maas_credentials: 3UC4:zSfGfk:Kaka2VUAD37zGF
  pacemaker-remote:
    charm: cs:~gnuoy/pacemaker-remote-10
    options:
      enable-stonith: True
      enable-resources: False
  masakari:
    charm: cs:~gnuoy/masakari-4
    series: bionic
    num_units: 3
    options:
      openstack-origin: cloud:bionic-rocky/proposed
      vip: '10.0.0.236 10.70.0.236 10.80.0.236'
    bindings:
      public: public
      admin: admin
      internal: internal
      shared-db: internal
      amqp: internal
    to:
    - 'lxd:1'
    - 'lxd:2'
    - 'lxd:3'

To add it to the existing model:

$ juju deploy base.yaml --overlay masakari-overlay.yaml --map-machines=0=0,1=1,2=2,3=3

Hacluster Resources

Each Pacemaker remote node has a corresponding resource which runs in the main cluster. The status of the Pacemaker nodes can be seen via crm status

$ sudo crm status
Stack: corosync
Current DC: juju-f0373f-1-lxd-2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Fri Apr 12 12:57:27 2019
Last change: Fri Apr 12 10:03:49 2019 by root via cibadmin on juju-f0373f-1-lxd-2

6 nodes configured
10 resources configured

Online: [ juju-f0373f-1-lxd-2 juju-f0373f-2-lxd-1 juju-f0373f-3-lxd-3 ]
RemoteOnline: [ frank-colt model-crow tidy-goose ]

Full list of resources:

 Resource Group: grp_masakari_vips
     res_masakari_29a0f9d_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
     res_masakari_321f78b_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
     res_masakari_578b519_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
 Clone Set: cl_res_masakari_haproxy [res_masakari_haproxy]
     Started: [ juju-f0373f-1-lxd-2 juju-f0373f-2-lxd-1 juju-f0373f-3-lxd-3 ]
 tidy-goose     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 frank-colt     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 model-crow     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 st-maas-5937691        (stonith:external/maas):        Started juju-f0373f-1-lxd-2


The output above shows that the three pacemaker-remote nodes (frank-colt, model-crow & tidy-goose) are online. It also shows each remote nodes corresponding ocf::pacemaker:remote resource and where that resource is running.

$ sudo crm configure show tidy-goose
primitive tidy-goose ocf:pacemaker:remote \
        params server=10.0.0.161 reconnect_interval=60 \
        op monitor interval=30s

By default the cluster setup by the hacluster charm has symmetric-cluster set to true. This means when a cluster resource is defined it is eligible to run on any node in the cluster. If a node should not run any given resource then the node has to opt out. In the masakari deployment the pacemaker-remote nodes are joined to the masakari api cluster. This would mean that the VIP used for accessing the api service and haproxy clone set would be eligible to run on the nova-compute nodes which would break. Given that there may be a large number of pacemaker-remote nodes (one per compute node) and there are likely to be exactly three masakari api units, it makes sense to switch the cluster to be being an opt-in cluster.  To achieve this symmetric-cluster is set to false and location rules are created allowing the VIP and haproxy set to run on the api units. Below is an example of these location rules:


$ sudo crm configure show
...
location loc-grp_masakari_vips-juju-f0373f-1-lxd-2 grp_masakari_vips 0: juju-f0373f-1-lxd-2
location loc-grp_masakari_vips-juju-f0373f-2-lxd-1 grp_masakari_vips 0: juju-f0373f-2-lxd-1
location loc-grp_masakari_vips-juju-f0373f-3-lxd-3 grp_masakari_vips 0: juju-f0373f-3-lxd-3


Each rule permits the grp_masakari_vips to run on that node, the score of 0 in the rule shows that all nodes are equal, it is not preferred that the resource run on one node rather than another.

The clones require one additional trick, they attempt to run everywhere regardless of the symmetric-cluster setting. To limit them to the api nodes location, rules are needed as before. So an additional setting called clone-max is applied to the clone resource limiting the number of places it should run.

clone cl_res_masakari_haproxy res_masakari_haproxy \
        meta clone-max=3


Finally the stonith configuration is also managed as a resource, as can be seen from the crm status at the start of the section where st-maas-5937691 is up and running.

$ sudo crm configure show st-maas-5937691
primitive st-maas-5937691 stonith:external/maas \
        params url="http://10.0.0.205/MAAS" apikey="3UC4:fGfk:37zGF" \
        hostnames="frank-colt model-crow tidy-goose" \
        op monitor interval=25 start-delay=25 timeout=25


As can be seen above, there is one stonith resource for all three nodes and the resource contains all the information needed to interact with the maas api.

Configuring Masakari

In Masakari the compute nodes are grouped into failover segments. In the event of a failure, guests are moved onto other nodes within the same segment. Which compute node is chosen to house the evacuated guests is determined by the recovery method of that segment. 

'AUTO' Recovery Method

With auto recovery the guests are relocated to any of the available nodes in the same segment. The problem with this approach is that there is no guarantee that resources will be available to accommodate guests from a failed compute node.

To configure a group of compute hosts for auto recovery, first create a segment with the recovery method set to auto:

$ openstack segment create segment1 auto COMPUTE
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| created_at      | 2019-04-12T13:59:50.000000           |
| updated_at      | None                                 |
| uuid            | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| name            | segment1                             |
| description     | None                                 |
| id              | 1                                    |
| service_type    | COMPUTE                              |
| recovery_method | auto                                 |
+-----------------+--------------------------------------+

Next the hypervisors need to be added into the segment, these should be referenced by their unqualified hostname:

$ openstack segment host create tidy-goose COMPUTE SSH 691b8ef3-7481-48b2-afb6-908a98c8a768             
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| created_at          | 2019-04-12T14:18:24.000000           |
| updated_at          | None                                 |
| uuid                | 11b85c9d-2b97-4b83-b773-0e9565e407b5 |
| name                | tidy-goose                           |
| type                | COMPUTE                              |
| control_attributes  | SSH                                  |
| reserved            | False                                |
| on_maintenance      | False                                |
| failover_segment_id | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
+---------------------+--------------------------------------+

Repeat above for all remaining hypervisors:

$ openstack segment host list 691b8ef3-7481-48b2-afb6-908a98c8a768
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 75afadbb-67cc-47b2-914e-e3bf848028e4 | frank-colt | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| 11b85c9d-2b97-4b83-b773-0e9565e407b5 | tidy-goose | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| f1e9b0b4-3ac9-4f07-9f83-5af2f9151109 | model-crow | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

'RESERVED_HOST' Recovery Method

With reserved_host recovery compute hosts are allocated as reserved which allows an operator to guarantee there is sufficient capacity available for any guests in need of evacuation.

Firstly create a segment with the reserved_host recovery method:

$ openstack segment create segment1 reserved_host COMPUTE -c uuid -f value
2598f8aa-3612-4731-9716-e126ca6cc280

Add a host using the --reserved switch to indicate that it will act as a standby:

$ openstack segment host create model-crow --reserved True COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280

Add the remaining hypervisors as before:

$ openstack segment host create frank-colt COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280
$ openstack segment host create tidy-goose COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280

Listing the segment hosts shows that model-crow is a reserved host:

$ openstack segment host list 2598f8aa-3612-4731-9716-e126ca6cc280
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 4769e08c-ed52-440a-866e-832b977aa5e2 | tidy-goose | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| 90aedbd2-e03b-4dbd-b330-a1c848f300df | frank-colt | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| c77574cc-b6e7-440e-9c86-84e91981f15e | model-crow | COMPUTE | SSH                | True     | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

Finally disable the reserved host in nova so that it remains available for failover:

$ openstack compute service set --disable model-crow nova-compute
$ openstack compute service list
+----+----------------+---------------------+----------+----------+-------+----------------------------+
| ID | Binary         | Host                | Zone     | Status   | State | Updated At                 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+
|  1 | nova-scheduler | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T10:59:10.000000 |
|  5 | nova-conductor | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T10:59:08.000000 |
|  7 | nova-compute   | tidy-goose          | nova     | enabled  | up    | 2019-04-13T10:59:11.000000 |
|  8 | nova-compute   | frank-colt          | nova     | enabled  | up    | 2019-04-13T10:59:05.000000 |
|  9 | nova-compute   | model-crow          | nova     | disabled | up    | 2019-04-13T10:59:12.000000 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+

When a compute node failure is detected, masakari will disable the failed node and enable the reserve node in nova. After simulating a failure of frank-colt the service list now looks like this:

$ openstack compute service list
+----+----------------+---------------------+----------+----------+-------+----------------------------+
| ID | Binary         | Host                | Zone     | Status   | State | Updated At                 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+
|  1 | nova-scheduler | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T11:05:20.000000 |
|  5 | nova-conductor | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T11:05:28.000000 |
|  7 | nova-compute   | tidy-goose          | nova     | enabled  | up    | 2019-04-13T11:05:21.000000 |
|  8 | nova-compute   | frank-colt          | nova     | disabled | down  | 2019-04-13T11:03:56.000000 |
|  9 | nova-compute   | model-crow          | nova     | enabled  | up    | 2019-04-13T11:05:22.000000 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+

Since the reserved host has now been enabled and is hosting evacuated guests, masakari has removed the reserved flag from it. Masakari has also placed the failed node in maintenance mode.

$ openstack segment host list 2598f8aa-3612-4731-9716-e126ca6cc280
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 4769e08c-ed52-440a-866e-832b977aa5e2 | tidy-goose | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| 90aedbd2-e03b-4dbd-b330-a1c848f300df | frank-colt | COMPUTE | SSH                | False    | True           | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| c77574cc-b6e7-440e-9c86-84e91981f15e | model-crow | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

 ‘AUTO_PRIORITY’ and ‘RH_PRIORITY’ Recovery Methods

These methods appear to chain the previous methods together. So, auto_priority attempts to move the guest using the auto method first and if that fails it tries the reserved_host method. rh_priority does the same thing but in the reverse order. See Pike Release Note for details.

Individual Instance Recovery

Finally, to use the masakari feature which reacts to a single guest failing rather than a whole hypervisor, the guest(s) need to be marked with a small piece of metadata:

$ openstack server set --property HA_Enabled=True server_120419134342




Testing Instance Failure

The simplest scenario to test is single guest recovery. It is worth noting that the whole stack is quite good at detecting intentional shutdown and will do nothing if it detects it. So to test masakari, processes need to be shutdown in a disorderly fashion, in this case sending a SIGKILL to the guests qemu process:

root@model-crow:~# pgrep -f qemu-system-x86_64
733213
root@model-crow:~# pkill -f -9 qemu-system-x86_64; sleep 10; pgrep -f qemu-system-x86_64
733851

The guest was killed and then restarted. Check the masakari instance monitor log to see what happened:

2019-04-12 14:30:56.269 189712 INFO masakarimonitors.instancemonitor.libvirt_handler.callback [-] Libvirt Event: type=VM, hostname=model-crow, uuid=4ce60f57-e8af-4a9a-b3f5-852d428c6890, time=2019-04-12 14:30:56.269703, event_id=LIFECYCLE, detail=STOPPED_FAILED)
2019-04-12 14:30:56.270 189712 INFO masakarimonitors.ha.masakari [-] Send a notification. 
{'notification':
{'type': 'VM',
     'hostname': 'model-crow',
     'generated_time': datetime.datetime(2019, 4, 12, 14, 30, 56, 269703),
     'payload': {
         'event': 'LIFECYCLE',
         'instance_uuid': '4ce60f57-e8af-4a9a-b3f5-852d428c6890',
         'vir_domain_event': 'STOPPED_FAILED'}}}
2019-04-12 14:30:56.695 189712 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=VM, hostname=model-crow, generated_time=2019-04-12T14:30:56.269703, 
payload={
   'event': 'LIFECYCLE',
   'instance_uuid': '4ce60f57-e8af-4a9a-b3f5-852d428c6890',
    'vir_domain_event': 'STOPPED_FAILED'}, id=4,
    notification_uuid=8c29844b-a79c-45ad-a7c7-2931fa263dab,
    source_host_uuid=f1e9b0b4-3ac9-4f07-9f83-5af2f9151109,
    status=new, created_at=2019-04-12T14:30:56.655930,
    updated_at=None)



The log shows the instance monitor spotting the guest going down and informing the masakari api service - note that the instance monitor does not start the guest itself. It is the masakari engine which resides on the maskari api units that performs the recovery:

# tail -11 /var/log/masakari/masakari-engine.log
2019-04-12 14:30:56.696 39396 INFO masakari.engine.manager [req-9b1465ed-5261-4f6b-b091-e1fb91a60583 33d5abf329d24757aa972c6a06a75e96 6f6b9d4f6c314fbba2753d7210052cd2 - - -] Processing notification 8c29844b-a79c-45ad-a7c7-2931fa263dab of type: VM
2019-04-12 14:30:56.723 39396 INFO masakari.compute.nova [req-c8d5cd4a-2f47-4677-af4e-9939067768dc masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:57.309 39396 INFO masakari.compute.nova [req-698f61d9-9765-4109-97f4-e56f5685a251 masakari - - - -] Call stop server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:57.780 39396 INFO masakari.compute.nova [req-1b9f58c1-de57-488f-904d-be244ce8d990 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:58.406 39396 INFO masakari.compute.nova [req-ddb32f52-d907-40c8-b790-bbe156f38878 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:59.127 39396 INFO masakari.compute.nova [req-b7b28eaa-4559-4fc4-9a89-e565694038d4 masakari - - - -] Call start server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:59.692 39396 INFO masakari.compute.nova [req-18fd6893-cbc9-4b15-b413-bf780b9583e7 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:00.698 39396 INFO masakari.compute.nova [req-1674a605-48af-4986-bc4b-1bc80111bf4d masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:01.700 39396 INFO masakari.compute.nova [req-86ba7d2d-5fd9-40e3-86e8-2cf506f02217 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:02.703 39396 INFO masakari.compute.nova [req-f4e19958-7c04-4c29-bf02-de98064729b2 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:03.262 39396 INFO masakari.engine.manager [req-f4e19958-7c04-4c29-bf02-de98064729b2 masakari - - - -] Notification 8c29844b-a79c-45ad-a7c7-2931fa263dab exits with status: finished.

Testing Compute Node failure

In the event of a compute node failing, pacemaker should spot this. The masakari host monitor periodically checks the node status as reported by pacemaker and in the event of a failure a notification is sent to the masakari api. Pacemaker should run the stonith resource to power off the node and masakari should move guests that were running on the compute node on to another available node.

The first thing to do is to check what hypervisor the guest is running on:

$ openstack server show server_120419134342 -c OS-EXT-SRV-ATTR:host -f value
model-crow


On that hypervisor disable systemd recovery for both the pacemaker_remote and nova-compte:

root@model-crow:~# sed -i -e 's/^Restart=.*/Restart=no/g' /lib/systemd/system/pacemaker_remote.service 
root@model-crow:~# sed -i -e 's/^Restart=.*/Restart=no/g' /lib/systemd/system/nova-compute.service
root@model-crow:~# systemctl daemon-reload

Now kill the processes:

root@model-crow:~# pkill -9 -f /usr/bin/nova-compute
root@model-crow:~# pkill -9 -f pacemaker_remoted


In the case of a compute node failure it is the compute nodes peers who should detect the failure and post the notification to masakari:

2019-04-12 15:21:29.464 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] Works on pacemaker-remote.
2019-04-12 15:21:29.643 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'model-crow' is 'offline'.
2019-04-12 15:21:29.644 886030 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'COMPUTE_HOST', 'hostname': 'model-crow', 'generated_time': datetime.datetime(2019, 4, 12, 15, 21, 29, 644323), 'payload': {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}}}
2019-04-12 15:21:30.746 886030 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=COMPUTE_HOST, hostname=model-crow, generated_time=2019-04-12T15:21:29.644323, payload={'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}, id=5, notification_uuid=2f62f9dd-7edf-406c-aa16-0a4e8e3a3726, source_host_uuid=f1e9b0b4-3ac9-4f07-9f83-5af2f9151109, status=new, created_at=2019-04-12T15:21:30.706730, updated_at=None)
2019-04-12 15:21:30.747 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'tidy-goose' is 'online'.


As before, check the masakari engine has responded:

$ tail -n12  /var/log/masakari/masakari-engine.log
2019-04-12 14:30:22.004 31553 INFO masakari.compute.nova [req-ec98fbd0-717b-41db-883b-3d46eea82a0e masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:22.556 31553 INFO masakari.engine.manager [req-ec98fbd0-717b-41db-883b-3d46eea82a0e masakari - - - -] Notification cee025d2-9736-4238-91e2-378661373f9d exits with status: finished.
2019-04-12 15:21:30.744 31553 INFO masakari.engine.manager [req-7713b994-c112-4e7f-905d-ef2cc497ca9a 33d5abf329d24757aa972c6a06a75e96 6f6b9d4f6c314fbba2753d7210052cd2 - - -] Processing notification 2f62f9dd-7edf-406c-aa16-0a4e8e3a3726 of type: COMPUTE_HOST
2019-04-12 15:21:30.827 31553 INFO masakari.compute.nova [req-8abbebe8-a6fb-48f1-90fc-113ad68d1c38 masakari - - - -] Disable nova-compute on model-crow
2019-04-12 15:21:31.252 31553 INFO masakari.engine.drivers.taskflow.host_failure [req-8abbebe8-a6fb-48f1-90fc-113ad68d1c38 masakari - - - -] Sleeping 30 sec before starting recovery thread until nova recognizes the node down.
2019-04-12 15:22:01.257 31553 INFO masakari.compute.nova [req-658b10fc-cd2c-4b28-8f1c-21e9da5ca14e masakari - - - -] Fetch Server list on model-crow
2019-04-12 15:22:02.309 31553 INFO masakari.compute.nova [req-9c8c65c4-0a69-4cc5-90db-5aef4f35f22a masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:02.963 31553 INFO masakari.compute.nova [req-71a4c4a6-2950-416a-916c-b075eff51b4c masakari - - - -] Call lock server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:03.872 31553 INFO masakari.compute.nova [req-71a4c4a6-2950-416a-916c-b075eff51b4c masakari - - - -] Call evacuate command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890 on host None
2019-04-12 15:22:04.473 31553 INFO masakari.compute.nova [req-8363a1bf-0214-4db4-b45a-c720c4113830 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:05.221 31553 INFO masakari.compute.nova [req-f7410918-02e7-4bff-afdd-81a1c24c6ee6 masakari - - - -] Call unlock server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:05.783 31553 INFO masakari.engine.manager [req-f7410918-02e7-4bff-afdd-81a1c24c6ee6 masakari - - - -] Notification 2f62f9dd-7edf-406c-aa16-0a4e8e3a3726 exits with status: finished.


Now check that the guest has moved:

$ openstack server show server_120419134342 -c OS-EXT-SRV-ATTR:host -f value
frank-colt

And finally that the stonith resource ran:

root@juju-f0373f-2-lxd-1:~# tail -f  /var/log/syslog
Apr 12 15:21:39 juju-f0373f-1-lxd-2 external/maas[201018]: info: Performing power reset on model-crow
Apr 12 15:21:42 juju-f0373f-1-lxd-2 external/maas[201018]: info: model-crow is in power state unknown
Apr 12 15:21:42 juju-f0373f-1-lxd-2 external/maas[201018]: info: Powering off model-crow