Friday, 12 July 2019

OVS + DPDK Bond + VLAN + Network Namespace

I was recently investigating a bug in an OpenStack deployment. Guest were failing to receive metadata on boot. Digging into the deployment revealed that the service providing the metadata was running inside a network namespace, the namespace was attached to an ovs bridge using a tap device, the tap device's ovs port was associated with a specific vlan id, the ovs bridge was in turn using dpdk and for external network access two network cards were bonded together using an ovs dpdk bond. After grabbing a cup of tea to calm my nerves I started reading a fair amount of documentation and running a few tests. As far as I could tell everything was configured as it should be.

To be able to file a bug I needed to be able to reproduce this setup on the latest version of each piece of software, preferably without needing a full OpenStack deployment. Then I could start removing layers of complexity until I had this simplest reproducer of the bug.

Although no one single part of the process of reproducing this setup was particularly difficult it did involve a fair few moving parts and below I run through them (mainly for my benefit when next week I've forgotten everything I did).

DPDK

I was lucky enough to have access to a server with two dpdk compatible network cards which I could deploy using maas. The server had also been setup to have hugepages created on install. This was done by creating a custom maas tag and assigning it to the server:

2 ubuntu@maas:~⟫ maas maas tag read dpdk
Success.
Machine-readable output follows:
{
    "definition": "",
    "name": "dpdk",
    "resource_uri": "/MAAS/api/2.0/tags/dpdk/",
    "kernel_opts": "hugepages=103117 iommu=pt intel_iommu=on",
    "comment": "DPDK enabled machines"
}

After installing the development release of Ubuntu (eoan) on the server it was time to install the dpdk and ovs packages.

ubuntu@node-licetus:~$ lsb_release -a                                                             
No LSB modules are available.                                                                              
Distributor ID: Ubuntu                                                                             
Description:    Ubuntu Eoan Ermine (development branch)                                               
Release:        19.10                                                                                  
Codename:       eoan                                                                                   
ubuntu@node-licetus:~$ sudo apt-get -q install -y dpdk openvswitch-switch-dpdk          

Update the system to use openvswitch-switch-dpdk for ovs-vswitchd. 

ubuntu@node-licetus:~$ sudo update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
update-alternatives: using /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in manual mode

dpdk references the network cards via their PCI address.  One way to find the PCI address of a network card, given its MAC address, is to examine the web of files and symbolic links in the /sys filesystem.

ubuntu@node-licetus:~$ grep -E 'a0:36:9f:dd:31:bc|a0:36:9f:dd:31:be' /sys/class/net/*/address
/sys/class/net/enp3s0f0/address:a0:36:9f:dd:31:bc
/sys/class/net/enp3s0f1/address:a0:36:9f:dd:31:be

ubuntu@node-licetus:~$ ls -ld /sys/class/net/enp3s0f0 /sys/class/net/enp3s0f1 | awk '{print $NF}' | awk 'BEGIN {FS="/"} {print $6}'
0000:03:00.0
0000:03:00.1
To switch the network cards from being kernel managed to being managed by dpdk the /etc/dpdk/interfaces is updated and dpdk restarted. When this is done the network cards will disappears from tools like ip.

root@node-licetus:~# ip -br addr show | grep enp
enp3s0f0         UP             fe80::a236:9fff:fedd:31bc/64 
enp3s0f1         UP             fe80::a236:9fff:fedd:31be/64 
root@node-licetus:~# echo "pci 0000:03:00.0 vfio-pci
> pci 0000:03:00.1 vfio-pci" >> /etc/dpdk/interfaces
root@node-licetus:~# systemctl restart dpdk
root@node-licetus:~# ip -br addr show | grep enp
root@node-licetus:~#


DPDK enabled OVS

There are a few global settings which need to be applied when using ovs with dpdk. The first relates to hugepages. Hugepages need to be allocated per NUMA node. First check that the hugepages have been created as requested by the kernel option specified in maas:

root@node-licetus:~# grep -i hugepages_ /proc/meminfo
HugePages_Total:   103117
HugePages_Free:    103117
HugePages_Rsvd:        0
HugePages_Surp:        0

Now see how many NUMA nodes there are:

# ls -ld /sys/devices/system/node/node* | wc -l
2

I chose to allocate 4096MB to each of the NUMA nodes. This is done via the dpdk-socket-mem option which takes a comma delimited list of hugepage numbers as its value:

root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096"
root@node-licetus:~#



Next white list the network cards in ovs via their PCI addresses:



root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-extra="--pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1"
root@node-licetus:~# 


Finally restart openvswitch-switch and check the log:

root@node-licetus:~# systemctl restart openvswitch-switch
root@node-licetus:~#
root@node-licetus:~# grep --color -E 'PCI|DPDK|ovs-vswitchd' /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.475Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.496Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
2019-07-11T13:07:54.806Z|00009|dpdk|ERR|DPDK not supported in this copy of Open vSwitch.
2019-07-11T13:08:17.961Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T13:08:17.969Z|00007|dpdk|INFO|Using DPDK 18.11.0
2019-07-11T13:08:17.969Z|00008|dpdk|INFO|DPDK Enabled - initializing...
2019-07-11T13:08:17.969Z|00011|dpdk|INFO|Per port memory for DPDK devices disabled.
2019-07-11T13:08:17.969Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd --pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1 --socket-mem 4096,4096 --socket-limit 4096,4096 -l 0.
2019-07-11T13:08:26.915Z|00019|dpdk|INFO|EAL: PCI device 0000:03:00.0 on NUMA socket 0
2019-07-11T13:08:27.600Z|00023|dpdk|INFO|EAL: PCI device 0000:03:00.1 on NUMA socket 0
2019-07-11T13:08:28.090Z|00026|dpdk|INFO|DPDK Enabled - initialized
2019-07-11T13:08:28.097Z|00051|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0

Bridge with DPDK Bonded NICs

As per  the OVS docs when creating the bridge the datapath_type needs to be set to netdev to tell ovs run in userspace mode . 

root@node-licetus:~# ovs-vsctl -- add-br br-test
root@node-licetus:~# ovs-vsctl -- set bridge br-test datapath_type=netdev

Now create the bond device and attach it to the bridge:

root@node-licetus:~# ovs-vsctl --may-exist add-bond br-test dpdk-bond0 dpdk-nic1 dpdk-nic2 \
>           -- set Interface dpdk-nic1 type=dpdk options:dpdk-devargs=0000:03:00.0 \
>           -- set Interface dpdk-nic2 type=dpdk options:dpdk-devargs=0000:03:00.1

ovs-vsctl seems quite happy to create the bond even if there is a problem so its worth taking a moment to check the device exists in the bridge without any errors:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

Part of my testing was to use jumbo frames so the final step is to set the mtu on the dpdk devices:

root@node-licetus:~# ovs-vsctl set Interface dpdk-nic1 mtu_request=9000
root@node-licetus:~# ovs-vsctl set Interface dpdk-nic2 mtu_request=9000
root@node-licetus:~# 

Tap in Network Namespace Attached to a Bridge.

Create a tap device called tap1 in the bridge:

root@node-licetus:~# ovs-vsctl add-port br-test tap1 -- set Interface tap1 type=internal
root@node-licetus:~# 

Create a network namespace called ns1 and place the tap1 into it.

root@node-licetus:~# ip netns add ns1
root@node-licetus:~# ip link set tap1 netns ns1


Bring up tap1 and assign it an IP address:

root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 up
root@node-licetus:~# ip netns exec ns1 ip link set dev lo up
root@node-licetus:~# ip netns exec ns1 ip addr add 172.20.0.1/24 dev tap1
root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 mtu 9000

Finally the network the tap needs to be on is vlan 2933 which is delivered to the network cards as part of a vlan trunk. To assign tap1 to the vlan with id 2933 the tap port needs to be tagged.

root@node-licetus:~# ovs-vsctl set port tap1 tag=2933
root@node-licetus:~#

Testing

That really is it. Below is the resulting bridge and tap interface:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port "tap1"
            tag: 2933
            Interface "tap1"
                type: internal
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

root@node-licetus:~# ip netns exec ns1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
12: tap1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 0a:96:ad:cd:2e:7f brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.1/24 scope global tap1
       valid_lft forever preferred_lft forever
    inet6 fe80::896:adff:fecd:2e7f/64 scope link 
       valid_lft forever preferred_lft forever

From another machine that has access to vlan 2933 I can ping the tap device:

ubuntu@node-husband:~$ ping -c3 172.20.0.1
PING 172.20.0.1 (172.20.0.1) 56(84) bytes of data.
64 bytes from 172.20.0.1: icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from 172.20.0.1: icmp_seq=2 ttl=64 time=0.146 ms
64 bytes from 172.20.0.1: icmp_seq=3 ttl=64 time=0.144 ms

--- 172.20.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2046ms
rtt min/avg/max/mdev = 0.144/0.168/0.215/0.034 ms

Final Thoughts

If you want to use dpdk with OpenStack then the OpenStack charms make it easy. The charms look after all of the above and much more.

If you want a more complete guide to dpdk try here The new simplicity to consume dpdk

Tuesday, 16 April 2019

Quick OpenStack Charm Test with Zaza

I was investigating a bug recently that appeared to be a race condition causing RabbitMQ server to occasionally fail to cluster when the charm was upgraded. To attempt to reproduce the bug I decided to use Python library zaza.

Firstly I created a tests & bundles directory:

$ mkdir -p tests/bundles/

Then added a bundle in tests/bundles/ha.yaml :

series: xenial
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server
    constraints: mem=1G
    num_units: 3


Once the deployment is complete I needed to test upgrading the charm so the next step was to define a test in tests/tests_rabbit_upgrade.py to do this:



#!/usr/bin/env python3

import unittest
import zaza.model

class UpgradeTest(unittest.TestCase):

    def test_upgrade(self):
        zaza.model.upgrade_charm(
            'rabbitmq-server',
            switch='cs:~openstack-charmers-next/rabbitmq-server-343')
        zaza.model.block_until_all_units_idle()

The last step is to tell Zaza what to do in a test run and this is done in the tests/tests.yaml.

tests:
  - tests.tests_rabbit_upgrade.UpgradeTest
configure:
  - zaza.charm_tests.noop.setup.basic_setup
smoke_bundles:
  - ha


This tells Zaza which bundle to run for smoke tests and which test(s) to run. There is no configuration step needed but at the moment zaza expects one so it is pointed at a no-op method. Obviously you need to have zaza installed which can be done in a virtualenv:

$ virtualenv -q -ppython3 venv3                                                                                                   
Already using interpreter /usr/bin/python3                                                                                                                      
$ source venv3/bin/activate                                                                                                      
(venv3) $ pip install git+https://github.com/openstack-charmers/zaza.git#egg=zaza

Zaza will generate a new model for each run and does not bring the new model into focus so it is safe to kick of the test run as many times as you like in parallel.

(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[1] 4768
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[2] 4790
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[3] 4795
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[4] 4811


All four deployments and tests ran in parallel. Below is an example output:

$ cat /tmp/run.ed5fe80c-6045-11e9-83b5-77a3728f6280 
Controller: gnuoy-serverstack

Model              Cloud/Region             Type       Status      Machines  Cores  Access  Last connection
controller         serverstack/serverstack  openstack  available          1      4  admin   just now
default            serverstack/serverstack  openstack  available          0      -  admin   2018-11-13
zaza-310115a7b9f9  serverstack/serverstack  openstack  available          0      -  admin   just now
zaza-98856c9f8ec5  serverstack/serverstack  openstack  available          0      -  admin   just now
zaza-9c013d3f9ec9  serverstack/serverstack  openstack  available          0      -  admin   never connected

INFO:root:Deploying bundle './tests/bundles/ha.yaml' on to 'zaza-310115a7b9f9' model
Resolving charm: cs:rabbitmq-server
Executing changes:
- upload charm cs:rabbitmq-server-85 for series xenial
- deploy application rabbitmq-server on xenial using cs:rabbitmq-server-85
- add unit rabbitmq-server/0 to new machine 0
- add unit rabbitmq-server/1 to new machine 1
- add unit rabbitmq-server/2 to new machine 2
Deploy of bundle completed.
INFO:root:Waiting for environment to settle
INFO:root:Waiting for a unit to appear
INFO:root:Waiting for all units to be idle
INFO:root:Checking workload status of rabbitmq-server/0
INFO:root:Checking workload status message of rabbitmq-server/0
INFO:root:Checking workload status of rabbitmq-server/1
INFO:root:Checking workload status message of rabbitmq-server/1
INFO:root:Checking workload status of rabbitmq-server/2
INFO:root:Checking workload status message of rabbitmq-server/2
INFO:root:OK
INFO:root:## Running Test tests.tests_rabbit_upgrade.UpgradeTest ##
test_upgrade (tests.tests_rabbit_upgrade.UpgradeTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 87.692s

OK

Finally, because the --keep-model switch was used all four models are still available for any post test inspection:

$ juju list-models
Controller: gnuoy-serverstack

Model              Cloud/Region             Type       Status     Machines  Cores  Access  Last connection
controller         serverstack/serverstack  openstack  available         1      4  admin   just now
default*           serverstack/serverstack  openstack  available         0      -  admin   2018-11-13
zaza-310115a7b9f9  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago
zaza-98856c9f8ec5  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago
zaza-9c013d3f9ec9  serverstack/serverstack  openstack  available         3      6  admin   5 minutes ago
zaza-a1b9f911850b  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago


To read more about zaza head over to https://zaza.readthedocs.io/en/latest/

Saturday, 13 April 2019

OpenStack Automated Instance Recovery

Who needs pets? It's all about the cattle in the cloud world, right? Unfortunately, reality has a habit of stepping on even the best laid plans. There is often one old legacy app (or more) which proves somewhat reticent to being turned into a self-healing, auto-load balancing cloud entity. In this case it does seems reasonable that if the app becomes unavailable the infrastructure should make some attempt to bring it back again. In the OpenStack world this is where projects like Masakari come in.

Masakari, when used in conjunction with masakari-monitors and Pacemaker, provides a service which detects the failure of a guest, or an entire compute node, and attempts to bring it back on line. The Masakari service provides an api and executor which reacts to notifications of a failure. Masakari-monitors, as the name suggests, detects failures and tells masakari about them. One mechanism it uses to detect failure is to monitor pacemaker, and when pacemaker reports a fellow compute node is down, masakari-monitors reports that to the Masakari service.

Masakari also provides two other recovery mechanisms. It provides a mechanism for detecting the failure of an individual guest and restarting that guest in situ. Finally, it also provides a mechanism for restarting an operating system process if it stops; this mechanism is not covered here and the charm is configured to disable it.

Below is a diagram from the Masakari Wiki showing the plumbing:


Introducing Pacemaker Remote

As mentioned above, Masakari monitors detect a compute node failure by querying the locally running cluster software. A hacluster charm already exists which will deploy corosync and pacemaker, and initial tests showed that this worked in a small environment. Unfortunately, the corosync/pacemaker is not designed to scale much past 16 nodes (see pacemaker documentation ) and a limit of 16 compute nodes is not acceptable in most clouds. 

This is where pacemaker remote comes in. A pacemaker remote node can run resources but does not run the full cluster stack in this case corosync is not installed on remote nodes and so the pacemaker remote nodes do not participate in the constant chatter needed to keep the cluster XML definitions up to date and consistent across the nodes.

A new Pacemaker Remote charm was developed which relates to an existing cluster and can run resources from that cluster. In this example the pacemaker remotes do not actually run any resources, they are just used as a mechanism of querying the state of all the nodes in the cluster. 

Unfortunately due to Bug #1728527 masakari-monitors fails to query pacemaker-remote properly.  A  patch has been proposed upstream. The patch is currently applied to the masakari-monitors package in stein in the Ubuntu Cloud Archive.


Charm Architecture

The masakari api service is deployed using a new masakari charm. This charm behaves in the same way as the other OpenStack API service charms. It expects relations with MySQL/Percona, RabbitMQ and Keystone. It can also be related to Vault to secure the endpoints with https. The only real difference to the other API charms is that it must be deployed in an ha configuration which means multiple units with an hacluster relation. The pacemaker-remote and masakari-monitors charms are both subordinates that need to be running on the compute nodes:


It might be expected that the masakari-monitors would have a direct relation to the masakari charm. In fact it does a lookup for the service in the catalogue so it only needs credentials which it obtains by having an identity-credentials relation with keystone.

STONITH

For a guest to be failed over to another compute host it must be using some form of shared storage. As a result, the scenario where a compute node has lost network access to its peers but continues to have access to shared storage must be considered. Masakari monitors on a peer compute node registers the compute node has gone and notifies the masakari api server. This in turn triggers the masakari engine to instigate a failover of that guest via nova. Assuming that nova concurs that the compute node has gone, it will attempt to bring it up on another node. At this point there are two guests both trying to write to the same shared storage which will likely lead to data corruption.

The solution is to enable stonith in pacemaker. When the cluster detects a node has disappeared, it runs a stonith plugin to power the compute node off. To enable stonith the hacluster charm now ships with a maas plugin that the stonith resource can use. The maas details are provided by the existing hacluster maas_url and maas_credentials config options. This allows the stonith resource to power off a node via the maas api which has the added advantage of abstracting the charm away from the individual power management system being used in this particular deployment.


Adding Masakari to a deployment

Below is an example of using an overlay to add masakari to an existing deployment. Obviously there is lots of configuration in the yaml below which is specific to each deployment and will need updating:

machines:
  '0':
    series: bionic
  '1':
    series: bionic
  '2':
    series: bionic
  '3':
    series: bionic
relations:
- - nova-compute:juju-info
  - masakari-monitors:container
- - masakari:ha
  - hacluster:ha
- - keystone:identity-credentials
  - masakari-monitors:identity-credentials
- - nova-compute:juju-info
  - pacemaker-remote:juju-info
- - hacluster:pacemaker-remote
  - pacemaker-remote:pacemaker-remote
- - masakari:identity-service
  - keystone:identity-service
- - masakari:shared-db
  - mysql:shared-db
- - masakari:amqp
  - rabbitmq-server:amqp
series: bionic
applications:
  masakari-monitors:
    charm: cs:~gnuoy/masakari-monitors-9
    series: bionic
  hacluster:
    charm: cs:~gnuoy/hacluster-38
    options:
      corosync_transport: unicast
      cluster_count: 3
      maas_url: http://10.0.0.205/MAAS
      maas_credentials: 3UC4:zSfGfk:Kaka2VUAD37zGF
  pacemaker-remote:
    charm: cs:~gnuoy/pacemaker-remote-10
    options:
      enable-stonith: True
      enable-resources: False
  masakari:
    charm: cs:~gnuoy/masakari-4
    series: bionic
    num_units: 3
    options:
      openstack-origin: cloud:bionic-rocky/proposed
      vip: '10.0.0.236 10.70.0.236 10.80.0.236'
    bindings:
      public: public
      admin: admin
      internal: internal
      shared-db: internal
      amqp: internal
    to:
    - 'lxd:1'
    - 'lxd:2'
    - 'lxd:3'

To add it to the existing model:

$ juju deploy base.yaml --overlay masakari-overlay.yaml --map-machines=0=0,1=1,2=2,3=3

Hacluster Resources

Each Pacemaker remote node has a corresponding resource which runs in the main cluster. The status of the Pacemaker nodes can be seen via crm status

$ sudo crm status
Stack: corosync
Current DC: juju-f0373f-1-lxd-2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Fri Apr 12 12:57:27 2019
Last change: Fri Apr 12 10:03:49 2019 by root via cibadmin on juju-f0373f-1-lxd-2

6 nodes configured
10 resources configured

Online: [ juju-f0373f-1-lxd-2 juju-f0373f-2-lxd-1 juju-f0373f-3-lxd-3 ]
RemoteOnline: [ frank-colt model-crow tidy-goose ]

Full list of resources:

 Resource Group: grp_masakari_vips
     res_masakari_29a0f9d_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
     res_masakari_321f78b_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
     res_masakari_578b519_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
 Clone Set: cl_res_masakari_haproxy [res_masakari_haproxy]
     Started: [ juju-f0373f-1-lxd-2 juju-f0373f-2-lxd-1 juju-f0373f-3-lxd-3 ]
 tidy-goose     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 frank-colt     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 model-crow     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 st-maas-5937691        (stonith:external/maas):        Started juju-f0373f-1-lxd-2


The output above shows that the three pacemaker-remote nodes (frank-colt, model-crow & tidy-goose) are online. It also shows each remote nodes corresponding ocf::pacemaker:remote resource and where that resource is running.

$ sudo crm configure show tidy-goose
primitive tidy-goose ocf:pacemaker:remote \
        params server=10.0.0.161 reconnect_interval=60 \
        op monitor interval=30s

By default the cluster setup by the hacluster charm has symmetric-cluster set to true. This means when a cluster resource is defined it is eligible to run on any node in the cluster. If a node should not run any given resource then the node has to opt out. In the masakari deployment the pacemaker-remote nodes are joined to the masakari api cluster. This would mean that the VIP used for accessing the api service and haproxy clone set would be eligible to run on the nova-compute nodes which would break. Given that there may be a large number of pacemaker-remote nodes (one per compute node) and there are likely to be exactly three masakari api units, it makes sense to switch the cluster to be being an opt-in cluster.  To achieve this symmetric-cluster is set to false and location rules are created allowing the VIP and haproxy set to run on the api units. Below is an example of these location rules:


$ sudo crm configure show
...
location loc-grp_masakari_vips-juju-f0373f-1-lxd-2 grp_masakari_vips 0: juju-f0373f-1-lxd-2
location loc-grp_masakari_vips-juju-f0373f-2-lxd-1 grp_masakari_vips 0: juju-f0373f-2-lxd-1
location loc-grp_masakari_vips-juju-f0373f-3-lxd-3 grp_masakari_vips 0: juju-f0373f-3-lxd-3


Each rule permits the grp_masakari_vips to run on that node, the score of 0 in the rule shows that all nodes are equal, it is not preferred that the resource run on one node rather than another.

The clones require one additional trick, they attempt to run everywhere regardless of the symmetric-cluster setting. To limit them to the api nodes location, rules are needed as before. So an additional setting called clone-max is applied to the clone resource limiting the number of places it should run.

clone cl_res_masakari_haproxy res_masakari_haproxy \
        meta clone-max=3


Finally the stonith configuration is also managed as a resource, as can be seen from the crm status at the start of the section where st-maas-5937691 is up and running.

$ sudo crm configure show st-maas-5937691
primitive st-maas-5937691 stonith:external/maas \
        params url="http://10.0.0.205/MAAS" apikey="3UC4:fGfk:37zGF" \
        hostnames="frank-colt model-crow tidy-goose" \
        op monitor interval=25 start-delay=25 timeout=25


As can be seen above, there is one stonith resource for all three nodes and the resource contains all the information needed to interact with the maas api.

Configuring Masakari

In Masakari the compute nodes are grouped into failover segments. In the event of a failure, guests are moved onto other nodes within the same segment. Which compute node is chosen to house the evacuated guests is determined by the recovery method of that segment. 

'AUTO' Recovery Method

With auto recovery the guests are relocated to any of the available nodes in the same segment. The problem with this approach is that there is no guarantee that resources will be available to accommodate guests from a failed compute node.

To configure a group of compute hosts for auto recovery, first create a segment with the recovery method set to auto:

$ openstack segment create segment1 auto COMPUTE
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| created_at      | 2019-04-12T13:59:50.000000           |
| updated_at      | None                                 |
| uuid            | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| name            | segment1                             |
| description     | None                                 |
| id              | 1                                    |
| service_type    | COMPUTE                              |
| recovery_method | auto                                 |
+-----------------+--------------------------------------+

Next the hypervisors need to be added into the segment, these should be referenced by their unqualified hostname:

$ openstack segment host create tidy-goose COMPUTE SSH 691b8ef3-7481-48b2-afb6-908a98c8a768             
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| created_at          | 2019-04-12T14:18:24.000000           |
| updated_at          | None                                 |
| uuid                | 11b85c9d-2b97-4b83-b773-0e9565e407b5 |
| name                | tidy-goose                           |
| type                | COMPUTE                              |
| control_attributes  | SSH                                  |
| reserved            | False                                |
| on_maintenance      | False                                |
| failover_segment_id | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
+---------------------+--------------------------------------+

Repeat above for all remaining hypervisors:

$ openstack segment host list 691b8ef3-7481-48b2-afb6-908a98c8a768
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 75afadbb-67cc-47b2-914e-e3bf848028e4 | frank-colt | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| 11b85c9d-2b97-4b83-b773-0e9565e407b5 | tidy-goose | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| f1e9b0b4-3ac9-4f07-9f83-5af2f9151109 | model-crow | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

'RESERVED_HOST' Recovery Method

With reserved_host recovery compute hosts are allocated as reserved which allows an operator to guarantee there is sufficient capacity available for any guests in need of evacuation.

Firstly create a segment with the reserved_host recovery method:

$ openstack segment create segment1 reserved_host COMPUTE -c uuid -f value
2598f8aa-3612-4731-9716-e126ca6cc280

Add a host using the --reserved switch to indicate that it will act as a standby:

$ openstack segment host create model-crow --reserved True COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280

Add the remaining hypervisors as before:

$ openstack segment host create frank-colt COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280
$ openstack segment host create tidy-goose COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280

Listing the segment hosts shows that model-crow is a reserved host:

$ openstack segment host list 2598f8aa-3612-4731-9716-e126ca6cc280
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 4769e08c-ed52-440a-866e-832b977aa5e2 | tidy-goose | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| 90aedbd2-e03b-4dbd-b330-a1c848f300df | frank-colt | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| c77574cc-b6e7-440e-9c86-84e91981f15e | model-crow | COMPUTE | SSH                | True     | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

Finally disable the reserved host in nova so that it remains available for failover:

$ openstack compute service set --disable model-crow nova-compute
$ openstack compute service list
+----+----------------+---------------------+----------+----------+-------+----------------------------+
| ID | Binary         | Host                | Zone     | Status   | State | Updated At                 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+
|  1 | nova-scheduler | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T10:59:10.000000 |
|  5 | nova-conductor | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T10:59:08.000000 |
|  7 | nova-compute   | tidy-goose          | nova     | enabled  | up    | 2019-04-13T10:59:11.000000 |
|  8 | nova-compute   | frank-colt          | nova     | enabled  | up    | 2019-04-13T10:59:05.000000 |
|  9 | nova-compute   | model-crow          | nova     | disabled | up    | 2019-04-13T10:59:12.000000 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+

When a compute node failure is detected, masakari will disable the failed node and enable the reserve node in nova. After simulating a failure of frank-colt the service list now looks like this:

$ openstack compute service list
+----+----------------+---------------------+----------+----------+-------+----------------------------+
| ID | Binary         | Host                | Zone     | Status   | State | Updated At                 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+
|  1 | nova-scheduler | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T11:05:20.000000 |
|  5 | nova-conductor | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T11:05:28.000000 |
|  7 | nova-compute   | tidy-goose          | nova     | enabled  | up    | 2019-04-13T11:05:21.000000 |
|  8 | nova-compute   | frank-colt          | nova     | disabled | down  | 2019-04-13T11:03:56.000000 |
|  9 | nova-compute   | model-crow          | nova     | enabled  | up    | 2019-04-13T11:05:22.000000 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+

Since the reserved host has now been enabled and is hosting evacuated guests, masakari has removed the reserved flag from it. Masakari has also placed the failed node in maintenance mode.

$ openstack segment host list 2598f8aa-3612-4731-9716-e126ca6cc280
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 4769e08c-ed52-440a-866e-832b977aa5e2 | tidy-goose | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| 90aedbd2-e03b-4dbd-b330-a1c848f300df | frank-colt | COMPUTE | SSH                | False    | True           | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| c77574cc-b6e7-440e-9c86-84e91981f15e | model-crow | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

 ‘AUTO_PRIORITY’ and ‘RH_PRIORITY’ Recovery Methods

These methods appear to chain the previous methods together. So, auto_priority attempts to move the guest using the auto method first and if that fails it tries the reserved_host method. rh_priority does the same thing but in the reverse order. See Pike Release Note for details.

Individual Instance Recovery

Finally, to use the masakari feature which reacts to a single guest failing rather than a whole hypervisor, the guest(s) need to be marked with a small piece of metadata:

$ openstack server set --property HA_Enabled=True server_120419134342




Testing Instance Failure

The simplest scenario to test is single guest recovery. It is worth noting that the whole stack is quite good at detecting intentional shutdown and will do nothing if it detects it. So to test masakari, processes need to be shutdown in a disorderly fashion, in this case sending a SIGKILL to the guests qemu process:

root@model-crow:~# pgrep -f qemu-system-x86_64
733213
root@model-crow:~# pkill -f -9 qemu-system-x86_64; sleep 10; pgrep -f qemu-system-x86_64
733851

The guest was killed and then restarted. Check the masakari instance monitor log to see what happened:

2019-04-12 14:30:56.269 189712 INFO masakarimonitors.instancemonitor.libvirt_handler.callback [-] Libvirt Event: type=VM, hostname=model-crow, uuid=4ce60f57-e8af-4a9a-b3f5-852d428c6890, time=2019-04-12 14:30:56.269703, event_id=LIFECYCLE, detail=STOPPED_FAILED)
2019-04-12 14:30:56.270 189712 INFO masakarimonitors.ha.masakari [-] Send a notification. 
{'notification':
{'type': 'VM',
     'hostname': 'model-crow',
     'generated_time': datetime.datetime(2019, 4, 12, 14, 30, 56, 269703),
     'payload': {
         'event': 'LIFECYCLE',
         'instance_uuid': '4ce60f57-e8af-4a9a-b3f5-852d428c6890',
         'vir_domain_event': 'STOPPED_FAILED'}}}
2019-04-12 14:30:56.695 189712 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=VM, hostname=model-crow, generated_time=2019-04-12T14:30:56.269703, 
payload={
   'event': 'LIFECYCLE',
   'instance_uuid': '4ce60f57-e8af-4a9a-b3f5-852d428c6890',
    'vir_domain_event': 'STOPPED_FAILED'}, id=4,
    notification_uuid=8c29844b-a79c-45ad-a7c7-2931fa263dab,
    source_host_uuid=f1e9b0b4-3ac9-4f07-9f83-5af2f9151109,
    status=new, created_at=2019-04-12T14:30:56.655930,
    updated_at=None)



The log shows the instance monitor spotting the guest going down and informing the masakari api service - note that the instance monitor does not start the guest itself. It is the masakari engine which resides on the maskari api units that performs the recovery:

# tail -11 /var/log/masakari/masakari-engine.log
2019-04-12 14:30:56.696 39396 INFO masakari.engine.manager [req-9b1465ed-5261-4f6b-b091-e1fb91a60583 33d5abf329d24757aa972c6a06a75e96 6f6b9d4f6c314fbba2753d7210052cd2 - - -] Processing notification 8c29844b-a79c-45ad-a7c7-2931fa263dab of type: VM
2019-04-12 14:30:56.723 39396 INFO masakari.compute.nova [req-c8d5cd4a-2f47-4677-af4e-9939067768dc masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:57.309 39396 INFO masakari.compute.nova [req-698f61d9-9765-4109-97f4-e56f5685a251 masakari - - - -] Call stop server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:57.780 39396 INFO masakari.compute.nova [req-1b9f58c1-de57-488f-904d-be244ce8d990 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:58.406 39396 INFO masakari.compute.nova [req-ddb32f52-d907-40c8-b790-bbe156f38878 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:59.127 39396 INFO masakari.compute.nova [req-b7b28eaa-4559-4fc4-9a89-e565694038d4 masakari - - - -] Call start server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:59.692 39396 INFO masakari.compute.nova [req-18fd6893-cbc9-4b15-b413-bf780b9583e7 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:00.698 39396 INFO masakari.compute.nova [req-1674a605-48af-4986-bc4b-1bc80111bf4d masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:01.700 39396 INFO masakari.compute.nova [req-86ba7d2d-5fd9-40e3-86e8-2cf506f02217 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:02.703 39396 INFO masakari.compute.nova [req-f4e19958-7c04-4c29-bf02-de98064729b2 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:03.262 39396 INFO masakari.engine.manager [req-f4e19958-7c04-4c29-bf02-de98064729b2 masakari - - - -] Notification 8c29844b-a79c-45ad-a7c7-2931fa263dab exits with status: finished.

Testing Compute Node failure

In the event of a compute node failing, pacemaker should spot this. The masakari host monitor periodically checks the node status as reported by pacemaker and in the event of a failure a notification is sent to the masakari api. Pacemaker should run the stonith resource to power off the node and masakari should move guests that were running on the compute node on to another available node.

The first thing to do is to check what hypervisor the guest is running on:

$ openstack server show server_120419134342 -c OS-EXT-SRV-ATTR:host -f value
model-crow


On that hypervisor disable systemd recovery for both the pacemaker_remote and nova-compte:

root@model-crow:~# sed -i -e 's/^Restart=.*/Restart=no/g' /lib/systemd/system/pacemaker_remote.service 
root@model-crow:~# sed -i -e 's/^Restart=.*/Restart=no/g' /lib/systemd/system/nova-compute.service
root@model-crow:~# systemctl daemon-reload

Now kill the processes:

root@model-crow:~# pkill -9 -f /usr/bin/nova-compute
root@model-crow:~# pkill -9 -f pacemaker_remoted


In the case of a compute node failure it is the compute nodes peers who should detect the failure and post the notification to masakari:

2019-04-12 15:21:29.464 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] Works on pacemaker-remote.
2019-04-12 15:21:29.643 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'model-crow' is 'offline'.
2019-04-12 15:21:29.644 886030 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'COMPUTE_HOST', 'hostname': 'model-crow', 'generated_time': datetime.datetime(2019, 4, 12, 15, 21, 29, 644323), 'payload': {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}}}
2019-04-12 15:21:30.746 886030 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=COMPUTE_HOST, hostname=model-crow, generated_time=2019-04-12T15:21:29.644323, payload={'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}, id=5, notification_uuid=2f62f9dd-7edf-406c-aa16-0a4e8e3a3726, source_host_uuid=f1e9b0b4-3ac9-4f07-9f83-5af2f9151109, status=new, created_at=2019-04-12T15:21:30.706730, updated_at=None)
2019-04-12 15:21:30.747 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'tidy-goose' is 'online'.


As before, check the masakari engine has responded:

$ tail -n12  /var/log/masakari/masakari-engine.log
2019-04-12 14:30:22.004 31553 INFO masakari.compute.nova [req-ec98fbd0-717b-41db-883b-3d46eea82a0e masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:22.556 31553 INFO masakari.engine.manager [req-ec98fbd0-717b-41db-883b-3d46eea82a0e masakari - - - -] Notification cee025d2-9736-4238-91e2-378661373f9d exits with status: finished.
2019-04-12 15:21:30.744 31553 INFO masakari.engine.manager [req-7713b994-c112-4e7f-905d-ef2cc497ca9a 33d5abf329d24757aa972c6a06a75e96 6f6b9d4f6c314fbba2753d7210052cd2 - - -] Processing notification 2f62f9dd-7edf-406c-aa16-0a4e8e3a3726 of type: COMPUTE_HOST
2019-04-12 15:21:30.827 31553 INFO masakari.compute.nova [req-8abbebe8-a6fb-48f1-90fc-113ad68d1c38 masakari - - - -] Disable nova-compute on model-crow
2019-04-12 15:21:31.252 31553 INFO masakari.engine.drivers.taskflow.host_failure [req-8abbebe8-a6fb-48f1-90fc-113ad68d1c38 masakari - - - -] Sleeping 30 sec before starting recovery thread until nova recognizes the node down.
2019-04-12 15:22:01.257 31553 INFO masakari.compute.nova [req-658b10fc-cd2c-4b28-8f1c-21e9da5ca14e masakari - - - -] Fetch Server list on model-crow
2019-04-12 15:22:02.309 31553 INFO masakari.compute.nova [req-9c8c65c4-0a69-4cc5-90db-5aef4f35f22a masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:02.963 31553 INFO masakari.compute.nova [req-71a4c4a6-2950-416a-916c-b075eff51b4c masakari - - - -] Call lock server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:03.872 31553 INFO masakari.compute.nova [req-71a4c4a6-2950-416a-916c-b075eff51b4c masakari - - - -] Call evacuate command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890 on host None
2019-04-12 15:22:04.473 31553 INFO masakari.compute.nova [req-8363a1bf-0214-4db4-b45a-c720c4113830 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:05.221 31553 INFO masakari.compute.nova [req-f7410918-02e7-4bff-afdd-81a1c24c6ee6 masakari - - - -] Call unlock server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:05.783 31553 INFO masakari.engine.manager [req-f7410918-02e7-4bff-afdd-81a1c24c6ee6 masakari - - - -] Notification 2f62f9dd-7edf-406c-aa16-0a4e8e3a3726 exits with status: finished.


Now check that the guest has moved:

$ openstack server show server_120419134342 -c OS-EXT-SRV-ATTR:host -f value
frank-colt

And finally that the stonith resource ran:

root@juju-f0373f-2-lxd-1:~# tail -f  /var/log/syslog
Apr 12 15:21:39 juju-f0373f-1-lxd-2 external/maas[201018]: info: Performing power reset on model-crow
Apr 12 15:21:42 juju-f0373f-1-lxd-2 external/maas[201018]: info: model-crow is in power state unknown
Apr 12 15:21:42 juju-f0373f-1-lxd-2 external/maas[201018]: info: Powering off model-crow

Wednesday, 29 November 2017

Enabling QoS with OpenStack Charms

The next charm release will include the option to enable the QoS plugin for neutron. QoS can be enabled via the 'enable-qos' setting in the neutron-api charm.

Enabling QoS

The charm encapsulates all the work of setting up QoS so all that is needed to enable it is:

$ juju config neutron-api enable-qos=True

Testing QoS On a Guest

In the test below traffic speed between two OpenStack guests is generated with iperf with and without a QoS bandwidth policy applied to the guest 'qos-test-1'. Below are the two guests used in the test.

$ nova list
+--------------------------------------+------------+--------+------------+-------------+------------------------------------+
| ID                                   | Name       | Status | Task State | Power State | Networks                           |
+--------------------------------------+------------+--------+------------+-------------+------------------------------------+
| 0b8fcb80-8210-4034-918a-a253f2f810bf | qos-test-1 | ACTIVE | -          | Running     | private=192.168.21.8, 10.5.150.8   |
| ee7d2a15-fb20-448e-81c2-cedd52bc45c3 | qos-test-2 | ACTIVE | -          | Running     | private=192.168.21.12, 10.5.150.12 |
+--------------------------------------+------------+--------+------------+-------------+------------------------------------+

The 192.168 addresses are on the internal project network and the 10.5 addresses are floating IPs which have been assigned to the guests.

The first step is to create a QoS policy that rules can be added to:

$ neutron qos-policy-create bw-limiter
Created a new policy:
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| created_at      | 2017-09-19T16:27:31Z                 |
| description     |                                      |
| id              | 5440ddb6-b93c-44d6-b0a8-b3b36e552db3 |
| name            | bw-limiter                           |
| project_id      | 7a6b7bcc7284483f88dc44f765cad5df     |
| revision_number | 1                                    |
| rules           |                                      |
| shared          | False                                |
| tenant_id       | 7a6b7bcc7284483f88dc44f765cad5df     |
| updated_at      | 2017-09-19T16:27:31Z                 |
+-----------------+--------------------------------------+

Now add a rule to the policy:

$ neutron qos-bandwidth-limit-rule-create bw-limiter --max-kbps 3000 --max-burst-kbps 300
Created a new bandwidth_limit_rule:
+----------------+--------------------------------------+
| Field          | Value                                |
+----------------+--------------------------------------+
| id             | 9d0bd218-d04e-4444-83e5-35f6e9ec8b44 |
| max_burst_kbps | 300                                  |
| max_kbps       | 3000                                 |
+----------------+--------------------------------------+

qos-test-1's IP address on the internal network is 192.168.21.8 and the corresponding port will have the QoS rule applied to it. List the ports and find the one that corresponds with the guests IP address


$ neutron port-list
+--------------------------------------+------+-------------------+--------------------------------------------------------------------------------------+
| id                                   | name | mac_address       | fixed_ips                                                                            |
+--------------------------------------+------+-------------------+--------------------------------------------------------------------------------------+
| 735b2928-578d-491e-9fff-3e45bd82237e |      | fa:16:3e:35:01:2d | {"subnet_id": "e76ed2d3-2f2e-4ca5-9c5c-503cd17c0e91", "ip_address": "192.168.21.8"}  |
| bb97d685-a891-4492-93b7-ea7b6caa367e |      | fa:16:3e:03:0a:4c | {"subnet_id": "e76ed2d3-2f2e-4ca5-9c5c-503cd17c0e91", "ip_address": "192.168.21.12"} |
| bf8aac5f-959f-4382-9b4d-10ce8aded036 |      | fa:16:3e:55:60:28 | {"subnet_id": "e76ed2d3-2f2e-4ca5-9c5c-503cd17c0e91", "ip_address": "192.168.21.2"}  |
| c07f336f-72ee-4ed3-849a-7181bd2a0492 |      | fa:16:3e:cc:a5:b1 | {"subnet_id": "3dc53883-8ca9-43f6-bb19-ea4742bd9357", "ip_address": "10.5.150.4"}    |
| c6c04b68-a1e3-48f1-a2af-7cfabf989c97 |      | fa:16:3e:7d:4d:d9 | {"subnet_id": "e76ed2d3-2f2e-4ca5-9c5c-503cd17c0e91", "ip_address": "192.168.21.1"}  |
| d75d8ee3-71c1-41f9-90ed-f379f0256f1d |      | fa:16:3e:ff:2b:2d | {"subnet_id": "3dc53883-8ca9-43f6-bb19-ea4742bd9357", "ip_address": "10.5.150.12"}   |
| dd33eb79-aaf7-4135-aa4b-28abfbb9fb50 |      | fa:16:3e:f7:6a:4d | {"subnet_id": "3dc53883-8ca9-43f6-bb19-ea4742bd9357", "ip_address": "10.5.150.8"}    |
| ea056cbc-be5b-4223-a7b3-e548aca8a2cb |      | fa:16:3e:6d:dc:94 | {"subnet_id": "3dc53883-8ca9-43f6-bb19-ea4742bd9357", "ip_address": "10.5.150.9"}    |
+--------------------------------------+------+-------------------+--------------------------------------------------------------------------------------+


Apply the policy to the port:

$ neutron port-update 735b2928-578d-491e-9fff-3e45bd82237e --qos-policy bw-limiter
Updated port: 735b2928-578d-491e-9fff-3e45bd82237e

On qos-test-2 start the iperf server:

ubuntu@qos-test-2:~$ iperf -s -p 8080
------------------------------------------------------------
Server listening on TCP port 8080
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
On qos-test-1 (the guest with the qos rule applied) run iperf connecting to qos-test-2:

ubuntu@qos-test-1:~$ iperf -c 192.168.21.12 -p 8080
------------------------------------------------------------
Client connecting to 192.168.21.12, TCP port 8080
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.21.8 port 49704 connected with 192.168.21.12 port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.2 sec  2.12 MBytes  1.60 Mbits/sec

On that test achieved a bandwidth of 1.6 MBits/s. It is worth running the test a few times to validate the rate. Now, rerun test with the QoS policy removed. To remove the policy from the port:

$ neutron port-update 735b2928-578d-491e-9fff-3e45bd82237e --no-qos-policy
Updated port: 735b2928-578d-491e-9fff-3e45bd82237e

Rerunning the test:

ubuntu@qos-test-1:~$ iperf -c 192.168.21.12 -p 8080
------------------------------------------------------------
Client connecting to 192.168.21.12, TCP port 8080
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.21.8 port 49705 connected with 192.168.21.12 port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   403 MBytes   338 Mbits/sec

This time the test achieved a bandwidth of 388 MBits/s.

Testing QoS on a Network

The QoS policy can also be applied to the router port to affect all guests on a given network. In the previous example the last test removed the QoS policy from qos-test-1's port but just in case double check it is gone:


$ neutron port-show 735b2928-578d-491e-9fff-3e45bd82237e | grep qos_policy_id
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
| qos_policy_id         |                                                                                     |

In example below there is an iperf server running on an external host at 10.5.0.9. Collect iperf performance figures from both guests to the external host:


ubuntu@qos-test-1:~$ iperf -c 10.5.0.9 -p 8080                                                                                                                                            
------------------------------------------------------------
Client connecting to 10.5.0.9, TCP port 8080
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 192.168.21.8 port 34654 connected with 10.5.0.9 port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   321 MBytes   269 Mbits/sec
ubuntu@qos-test-1:~$ 

and the second guest:


ubuntu@qos-test-2:~$ iperf -c 10.5.0.9 -p 8080
------------------------------------------------------------
Client connecting to 10.5.0.9, TCP port 8080
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 192.168.21.12 port 59839 connected with 10.5.0.9 port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   287 MBytes   241 Mbits/sec
ubuntu@qos-test-2:~$

Now get the external ip of the project router and apply the QoS policy to it:


$ neutron router-list
+--------------------------------------+-----------------+------------------------------------------------------------------------+-------------+-------+
| id                                   | name            | external_gateway_info                                                  | distributed | ha    |
+--------------------------------------+-----------------+------------------------------------------------------------------------+-------------+-------+
| 51a15d70-5917-4dc2-9ab4-03c0f03e4fd7 | provider-router | {"network_id": "73c17d70-0298-4a0f-9158-f40bd73ee92e", "enable_snat":  | False       | False |
|                                      |                 | true, "external_fixed_ips": [{"subnet_id":                             |             |       |
|                                      |                 | "3dc53883-8ca9-43f6-bb19-ea4742bd9357", "ip_address": "10.5.150.4"}]}  |             |       |
+--------------------------------------+-----------------+------------------------------------------------------------------------+-------------+-------+

$ neutron port-list | grep 10.5.150.4
| c07f336f-72ee-4ed3-849a-7181bd2a0492 |      | fa:16:3e:cc:a5:b1 | {"subnet_id": "3dc53883-8ca9-43f6-bb19-ea4742bd9357", "ip_address": "10.5.150.4"}    |

$ neutron port-update c07f336f-72ee-4ed3-849a-7181bd2a0492 --qos-policy bw-limiter
Updated port: c07f336f-72ee-4ed3-849a-7181bd2a0492

Rerunning the benchmarks:


ubuntu@qos-test-1:~$ iperf -c 10.5.0.9 -p 8080
------------------------------------------------------------
Client connecting to 10.5.0.9, TCP port 8080
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 192.168.21.8 port 34655 connected with 10.5.0.9 port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.1 sec  2.38 MBytes  1.80 Mbits/sec


ubuntu@qos-test-2:~$ iperf -c 10.5.0.9 -p 8080
------------------------------------------------------------
Client connecting to 10.5.0.9, TCP port 8080
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 192.168.21.12 port 59840 connected with 10.5.0.9 port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.8 sec  2.25 MBytes  1.75 Mbits/sec

Under the Hood

Connect to the neutron-gateway and find the IP of the ovs port corresponding to the external router port.

$ juju ssh neutron-gateway/0
$ sudo su -
# ovs-vsctl show | grep "tapc07f336f"                                                                                                                          
"tapc07f336f-72"                                                                                                                                                             │···················
            Interface "tapc07f336f-72"  

Looking at the interface seeting on the port shows the qos rules:

# ovs-vsctl list interface tapc07f336f-72
_uuid               : f2d0313d-a945-493f-a268-b70885a134c3
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : full
error               : []
external_ids        : {attached-mac="fa:16:3e:cc:a5:b1", iface-id="c07f336f-72ee-4ed3-849a-7181bd2a0492", iface-status=active}
ifindex             : 9
ingress_policing_burst: 300
ingress_policing_rate: 3000
lacp_current        : []
link_resets         : 1
link_speed          : 10000000000
link_state          : up
lldp                : {}
mac                 : []
mac_in_use          : "c2:fd:5c:d0:f1:4b"
mtu                 : 1500
mtu_request         : []
name                : "tapc07f336f-72"
ofport              : 3
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {collisions=0, rx_bytes=654378013, rx_crc_err=0, rx_dropped=0, rx_errors=0, rx_frame_err=0, rx_over_err=0, rx_packets=88605, tx_bytes=6777887, tx_dropped=0, tx_errors=0, tx_packets=98736}
status              : {driver_name=veth, driver_version="1.0", firmware_version=""}



Wednesday, 23 September 2015

PC Power Control with a Raspberry Pi and Maas

I recently decided to setup a small cluster of computers at home to be managed by Juju and MAAS. The computers are in the attic which meant that finger based power management was going to quickly lose its appeal. Many of my friends and colleagues have enviable home computer setups, with power control being done elegantly by iLO/LOM/Intel AMT or some such. None of my old tin boasted anything as grand as that. I could have used wake-on-lan as both MAAS and my old machines support it but it doesn't provide a reliable way to power machines off (they don't always have shell access) or to check their current power state (they may not be on the network). What I'd have loved to do was build a rebot but that would be costly and I'd have to design a bespoke rig for each piece of hardware since my old kit is not in neat NUC form. In the end I decided to get a Raspberry Pi Model B to do the job. Unfortunately, I know next to nothing about electronics and do not own a soldering iron so my solution had to be solder free.

Research

I found this blog (minus tampering with the power feed between power supply and mother board) and this blog (but extended to control multiple machines) both very helpful and they inspired the final design.

The Basic Design

ATX motherboards expose pins that a computer case uses to wire in the reset button, power button and power leds. I removed the connections to the case and instead wired the power and reset pins to relays controlled by the Pi and wired the pins controlling the power led into one of the PiFaces' Input ports.

The prototype had jumpers cabled directly from the pins to the Pi but I wanted to be able to unplug the computers from the Pi and have them some distance apart so I used Ethernet cables to connect the Pi to the computers pins. Essentially the solution looks like this...

The Raspberry Pi

I used a Raspberry Pi Model B with a Piface hat. I had the Piface in a drawer so I used that for the prototype as it comes with two relays.

Setting up the Relays

The relays on the Piface worked but I needed two relays per PC. To get more relays I bought an Andoer 5V Active Low 8 Channel Road Relay Module Control Board. Oh, and by the way, one RJ45 jack cost twice as much as the 8 relay Arduino board.

Connecting the relay board to the Pi was straight forward. I attached the ground pin to the Raspberry Pi ground and the 7 remaining pins to the 7 PiFace output pins. The relay also came with a jumper which I put across the VCC-JD and VCC pins for reasons.

At the PC End

The PC part of the puzzle was simple. I used jumper cables with a female end and used them to connect the motherboard pins to a depressingly expensive RJ45 jack. The connections on the RJ45 jack are numbered, below is how I wired them to the motherboard.

Colour Green Orange Red Yellow Blue Purple
ATX Pin Power Power Reset Reset LED - LED +
Ethernet Cable Number 1 3 6 2 5 4


Raspberry Pi End 

The ethernet cable from the PC was again terminated with a RJ45 jack whose connections were wired into the Pi as below:

Colour Green Orange Red Yellow Blue Purple
Raspberry Pi Location Relay 1 Relay 1 Relay 2 Relay 2 GPIO 1 Ground
Ethernet Cable Number 1 3 6 2 5 4


Finished Hardware



  • The breadboard in the picture was used for prototyping but was not used in the final design. 
  • The colours of the jumpers may not match those in the tables above because some got damaged and were replaced.
  • I had problems with the second relay so it was not used.

 

Software

Once everything was connected, MAAS needs a way to remotely control the relays and to check the state of the input pins. I began by writing a Python Library and then adding a Rest Interface that MAAS could call. Finally I installed gunicorn to run the server.

cd ~
apt-get install -y python-pifacecommon python-pifacedigitalio bzr gunicorn python-yaml
bzr branch lp:~gnuoy/+junk/pimaaspowersudo 
cp pimaaspower/pipower /etc/gunicorn.d/
sudo service gunicorn restart

MAAS

MAAS knows about a number of different power management interfaces and it was fairly straight forward to plug a new one in, although because it involves editing files managed by the MAAS package these changes will need reapplying after a package update. I believe that making power types more pluggable in MASS is in the pipeline.

  • Firstly add a new template to /etc/maas/templates/power/pipower.template content here.
  • Add an entry to JSON_POWER_TYPE_PARAMETERS in /usr/lib/python2.7/dist-packages/provisioningserver/power_schema.py
 
    {
        'name': 'pipower',
        'description': 'Pipower',
        'fields': [
            make_json_field('node_name', "Node Name"),
            make_json_field('power_address', "Power Address"),
            make_json_field('state_pin', "Power State Pin Number"),
            make_json_field('reset_relay', "Reset Relay Number"),
            make_json_field('power_relay', "Power Relay Number"),
        ],
    }
  • Tell maas that this powertype supports querying powerstate (unlike wake-on-lan). Edit /usr/lib/python2.7/dist-packages/provisioningserver/rpc/power.py and add 'pipower' to QUERY_POWER_TYPES
  •  sudo service maas-clusterd restart
  • Edit the nodes registered in MAAS


  1. Power Type: 'Pipower'. 
  2. Node name: Can be anything but it makes sense to use same node name that MAAS is using as it makes debugging easier.
  3. Power Address: http://<pipower ip>:8000/power/api/v1.0/computer
  4. Power state Pin: The number of the PiFace Input port for the power LED
  5. Reset Relay number: The number of the relay controlling the reset switch
  6. Power Relay number: The number of the relay controlling the power switch

Scaling

There are eight relays available and each computer to be managed uses two of them so the system will support four machines. However the reset relay is not strictly needed as MAAS never uses it which means you could support eight machines.

End Result

I can now have MAAS provision the machines without me being physically present. So, for example, I can use Juju to fire up a fully functioning Openstack cloud with two commands.

juju bootstrap
juju-deployer -c deploy.yaml trusty-kilo

In the background Juju will request as many physical machines as it needs from MAAS. MAAS with bring those machines online, install Ubuntu and hand them back to Juju. Juju then uses the Openstack charms to install and configure the Openstack services, all while I'm downstairs deciding which biscuit to have with my cup of tea. Whenever I'm finished I can tear down the environment.

juju destroy-environment maas

MAAS will power the machines off and they'll be put back in the pool ready for the next deployment.

NOTE: I adapted an existing bundle from the Juju charm store  to fit Openstack on my diminutive cluster. The bunde is here

Wednesday, 8 July 2015

Neutron Router High Availability? As easy as "juju set"

The Juju Charms for deploying Openstack have just had their three monthly update (15.04 release). The charms now allow the new Neutron Layer 3 High Availability using Virtual Router Redundancy Protocol (VRRP) feature to be used. When enabled, this feature will allow Neutron to quickly failover a router to another Neutron gateway in the event that the primary node hosting the router is lost. The feature was introduced in Juno and marked as experimental so I would recommend only using it with deployments >= Kilo.

Enabling Router ha:


L3 HA in kilo requires that DVR and L2 Population are disabled, so to enable it in the charms:
juju set neutron-api enable-l3ha=True
juju set neutron-api enable-dvr=False
juju set neutron-api l2-population=False

The number of L3 agents that will run standby routers can also be configured:
juju set neutron-api max-l3-agents-per-router=2
juju set neutron-api min-l3-agents-per-router=2

Creating a HA enabled router.


The charms switch on router HA by default once enable-l3ha has been enabled.
$ neutron router-create ha-router
Created a new router:
+-----------------------+--------------------------------------+
| Field                 | Value                                |
+-----------------------+--------------------------------------+
| admin_state_up        | True                                 |
| distributed           | False                                |
| external_gateway_info |                                      |
| ha                    | True                                 |
| id                    | 64ff0665-5600-433c-b2d8-33509ce88eb1 |
| name                  | test2                                |
| routes                |                                      |
| status                | ACTIVE                               |
| tenant_id             | 8e8b1426508f42aeaff783180d7b2ef4     |
+-----------------------+--------------------------------------+
/!\ Currently a router cannot be switched in and out of HA mode
$ neutron router-update 64ff0665-5600-433c-b2d8-33509ce88eb1  --ha=False
400-{u'NeutronError': {u'message': u'Cannot update read-only attribute ha', u'type': u'HTTPBadRequest', u'detail': u''}}

Under the hood:

Below is a worked example, following the creation of an HA enabled router showing the components created implicitly by Neutron. In this environment the following networks have already been created:
$ neutron net-list
+--------------------------------------+---------+------------------------------------------------------+
| id                                   | name    | subnets                                              |
+--------------------------------------+---------+------------------------------------------------------+
| 32ba54bc-804e-489e-8903-b8dc0ed535f7 | private | a3ed1cc4-3451-418f-a412-80ad8cca2ec4 192.168.21.0/24 |
| c9a3bc24-6390-4220-b136-bc0edf1fe2f2 | ext_net | 76098d4d-bfa4-4f96-89e0-78c851d80dac 10.5.0.0/16     |
+--------------------------------------+---------+------------------------------------------------------+

$ neutron subnet-list 
+--------------------------------------+----------------+-----------------+----------------------------------------------------+
| id                                   | name           | cidr            | allocation_pools                                   |
+--------------------------------------+----------------+-----------------+----------------------------------------------------+
| 76098d4d-bfa4-4f96-89e0-78c851d80dac | ext_net_subnet | 10.5.0.0/16     | {"start": "10.5.150.0", "end": "10.5.200.254"}     |
| a3ed1cc4-3451-418f-a412-80ad8cca2ec4 | private_subnet | 192.168.21.0/24 | {"start": "192.168.21.2", "end": "192.168.21.254"} |
+--------------------------------------+----------------+-----------------+----------------------------------------------------+

In this environment there are three neutron-gateways:
$ juju status neutron-gateway --format=short

- neutron-gateway/0: 10.5.29.216 (started)
- neutron-gateway/1: 10.5.29.217 (started)
- neutron-gateway/2: 10.5.29.218 (started)

With their corresponding L3 agents:
$ neutron agent-list | grep "L3 agent"
| 28f227d8-e620-4478-ba36-856fb0409393 | L3 agent           | juju-lytrusty-machine-7  | :-)   | True           |
| 8d439f33-e4f8-4784-a617-5b3328bab9e3 | L3 agent           | juju-lytrusty-machine-6  | :-)   | True           |
| bdc00c2a-77c0-45c3-ab8a-ceca3319832d | L3 agent           | juju-lytrusty-machine-8  | :-)   | True           |

There is no router defined yet so the only network namespace present is the dhcp namespace for the private network:
$ juju run --service neutron-gateway --format=yaml "ip netns list"
- MachineId: "6"
  Stdout: ""
  UnitId: neutron-gateway/0
- MachineId: "7"
  Stdout: |
    qdhcp-32ba54bc-804e-489e-8903-b8dc0ed535f7
  UnitId: neutron-gateway/1
- MachineId: "8"
  Stdout: ""
  UnitId: neutron-gateway/2

Creating a router will add a qrouter-$ROUTERID netns to two of the gateway nodes (since min-l3-agents-per-router=2 and max-l3-agents-per-router=2)

$ neutron router-create ha-router
Created a new router:
+-----------------------+--------------------------------------+
| Field                 | Value                                |
+-----------------------+--------------------------------------+
| admin_state_up        | True                                 |
| distributed           | False                                |
| external_gateway_info |                                      |
| ha                    | True                                 |
| id                    | 192ba483-c060-4ee2-86ad-fe38ea280c93 |
| name                  | ha-router                            |
| routes                |                                      |
| status                | ACTIVE                               |
| tenant_id             | 8e8b1426508f42aeaff783180d7b2ef4     |
+-----------------------+--------------------------------------+

Neutron has assigned this router to two of the three agents:

$ ROUTER_ID="192ba483-c060-4ee2-86ad-fe38ea280c93"
$ neutron l3-agent-list-hosting-router $ROUTER_ID
+--------------------------------------+-------------------------+----------------+-------+
| id                                   | host                    | admin_state_up | alive |
+--------------------------------------+-------------------------+----------------+-------+
| 28f227d8-e620-4478-ba36-856fb0409393 | juju-lytrusty-machine-7 | True           | :-)   |
| bdc00c2a-77c0-45c3-ab8a-ceca3319832d | juju-lytrusty-machine-8 | True           | :-)   |
+--------------------------------------+-------------------------+----------------+-------+
A netns for the new router will have been created in neutron-gateway/1 and neutron-gateway/2:
$  juju run --service neutron-gateway --format=yaml "ip netns list"
- MachineId: "6"
  Stdout: ""
  UnitId: neutron-gateway/0
- MachineId: "7"
  Stdout: |
    qrouter-192ba483-c060-4ee2-86ad-fe38ea280c93
    qdhcp-32ba54bc-804e-489e-8903-b8dc0ed535f7
  UnitId: neutron-gateway/1
- MachineId: "8"
  Stdout: |
    qrouter-192ba483-c060-4ee2-86ad-fe38ea280c93
  UnitId: neutron-gateway/2
A keepalived process is spawned in each of the qrouter netns and these process communicate over a dedicated network which is created implicitly when the HA enabled router is added.
$ neutron net-list
+--------------------------------------+----------------------------------------------------+-------------------------------------------------------+
| id                                   | name                                               | subnets                                               |
+--------------------------------------+----------------------------------------------------+-------------------------------------------------------+
| 32ba54bc-804e-489e-8903-b8dc0ed535f7 | private                                            | a3ed1cc4-3451-418f-a412-80ad8cca2ec4 192.168.21.0/24  |
| af9cad57-b4fe-465d-b439-b72aaec16309 | HA network tenant 8e8b1426508f42aeaff783180d7b2ef4 | f0cb279b-36fe-43dc-a03b-8eb8b99e7f0b 169.254.192.0/18 |
| c9a3bc24-6390-4220-b136-bc0edf1fe2f2 | ext_net                                            | 76098d4d-bfa4-4f96-89e0-78c851d80dac 10.5.0.0/16      |
+--------------------------------------+----------------------------------------------------+-------------------------------------------------------+

Neutron creates a dedicated interface in the qrouter netns for this traffic.

$ neutron port-list
+--------------------------------------+-------------------------------------------------+-------------------+--------------------------------------------------------------------------------------+
| id                                   | name                                            | mac_address       | fixed_ips                                                                            |
+--------------------------------------+-------------------------------------------------+-------------------+--------------------------------------------------------------------------------------+
| 72326e9b-67e8-403a-80c3-4bac9748cdb6 |                                                 | fa:16:3e:aa:2c:96 | {"subnet_id": "a3ed1cc4-3451-418fa412-80ad8cca2ec4", "ip_address": "192.168.21.2"}   |
| 89c47030-f849-41ed-96e6-a36a3a696eeb | HA port tenant 8e8b1426508f42aeaff783180d7b2ef4 | fa:16:3e:d4:fc:a1 | {"subnet_id": "f0cb279b-36fe-43dca03b-8eb8b99e7f0b", "ip_address": "169.254.192.1"}  |
| 9ce2b6ac-9983-4ffd-ae97-6400682021c8 | HA port tenant 8e8b1426508f42aeaff783180d7b2ef4 | fa:16:3e:a5:76:e9 | {"subnet_id": "f0cb279b-36fe-43dca03b-8eb8b99e7f0b", "ip_address": "169.254.192.2"}  |
+--------------------------------------+-------------------------------------------------+-------------------+--------------------------------------------------------------------------------------+

$  juju run --unit neutron-gateway/1,neutron-gateway/2 --format=yaml "ip netns exec qrouter-$ROUTER_ID ip addr list | grep  ha-"
- MachineId: "7"
  Stdout: |
    2: ha-89c47030-f8:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
        inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-89c47030-f8
  UnitId: neutron-gateway/1
- MachineId: "8"
  Stdout: |
    2: ha-9ce2b6ac-99:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
        inet 169.254.192.2/18 brd 169.254.255.255 scope global ha-9ce2b6ac-99
        inet 169.254.0.1/24 scope global ha-9ce2b6ac-99
  UnitId: neutron-gateway/2

Keepalived writes out state to /var/lib/neutron/ha_confs/$ROUTER_ID/state, this can be queried to find out who is currently the master:

$ juju run --unit  neutron-gateway/1,neutron-gateway/2 "cat /var/lib/neutron/ha_confs/$ROUTER_ID/state"
- MachineId: "7"
  Stdout: backup
  UnitId: neutron-gateway/1
- MachineId: "8"
  Stdout: master
  UnitId: neutron-gateway/2

Plugging the router into the networks:

$ neutron router-gateway-set $ROUTER_ID c9a3bc24-6390-4220-b136-bc0edf1fe2f2
Set gateway for router 192ba483-c060-4ee2-86ad-fe38ea280c93
$ neutron router-interface-add $ROUTER_ID a3ed1cc4-3451-418f-a412-80ad8cca2ec4
Added interface 4ffe673c-b528-4891-b9ec-3ebdcfc146e2 to router 192ba483-c060-4ee2-86ad-fe38ea280c93.

The router now has an IP on the private subnet which will be managed by keepalived:

$ neutron router-show $ROUTER_ID                            
+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                                                                                                                  |
+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                                                                                                                                   |
| distributed           | False                                                                                                                                                                                  |
| external_gateway_info | {"network_id": "c9a3bc24-6390-4220-b136-bc0edf1fe2f2", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "76098d4d-bfa4-4f96-89e0-78c851d80dac", "ip_address": "10.5.150.0"}]} |
| ha                    | True                                                                                                                                                                                   |
| id                    | 192ba483-c060-4ee2-86ad-fe38ea280c93                                                                                                                                                   |
| name                  | ha-router                                                                                                                                                                              |
| routes                |                                                                                                                                                                                        |
| status                | ACTIVE                                                                                                                                                                                 |
| tenant_id             | 8e8b1426508f42aeaff783180d7b2ef4                                                                                                                                                       |
+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Since neutron-gateway/2 has been designated as the leader it will have the router ip (10.5.150.0) in its netns:
$ juju run --unit neutron-gateway/1,neutron-gateway/2 --format=yaml "ip netns exec qrouter-192ba483-c060-4ee2-86ad-fe38ea280c93 ip addr list | grep  10.5.150"
- MachineId: "7"
  ReturnCode: 1
  Stdout: ""
  UnitId: neutron-gateway/1
- MachineId: "8"
  Stdout: |2
        inet 10.5.150.0/16 scope global qg-288da587-97
  UnitId: neutron-gateway/2

$ ping -c2 10.5.150.0
PING 10.5.150.0 (10.5.150.0) 56(84) bytes of data.
64 bytes from 10.5.150.0: icmp_seq=1 ttl=64 time=0.756 ms
64 bytes from 10.5.150.0: icmp_seq=2 ttl=64 time=0.487 ms

--- 10.5.150.0 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.487/0.621/0.756/0.136 ms

Finally, shutting down neutron-gateway/2 will tigger the router ip to flip over to neutron-gateway/1:

$ juju run --unit neutron-gateway/2 "shutdown -h now"
$ juju run --unit  neutron-gateway/1 "cat /var/lib/neutron/ha_confs/$ROUTER_ID/state"
master
$ juju run --unit neutron-gateway/1 --format=yaml "ip netns exec qrouter-192ba483-c060-4ee2-86ad-fe38ea280c93 ip addr list | grep  10.5.150"
- MachineId: "7"
  Stdout: |2
        inet 10.5.150.0/16 scope global qg-288da587-97
  UnitId: neutron-gateway/1

$ ping -c2 10.5.150.0
PING 10.5.150.0 (10.5.150.0) 56(84) bytes of data.
64 bytes from 10.5.150.0: icmp_seq=1 ttl=64 time=0.359 ms
64 bytes from 10.5.150.0: icmp_seq=2 ttl=64 time=0.497 ms

--- 10.5.150.0 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.359/0.428/0.497/0.069 ms