Monday, 14 March 2022

OpenStack OVS Debugging Walk-through

I recently needed to investigate broken connectivity in an OpenStack deployment after an OVS upgrade. This blog is a walk-through of that process. In this setup the missing connectivity is between the Octavia units and the amphora units they manage. There is nothing Octavia specific about the debugging session. At the start all I knew was that the OpenStack Xena Octavia tests were failing. On closer inspection of the test runs I could see that the amphorae were stuck in a BOOTING state.

$ openstack loadbalancer amphora list -c id -c status -c role
+--------------------------------------+-----------+--------+
| id                                   | status    | role   |
+--------------------------------------+-----------+--------+
| c7dc61df-c7c7-4bce-9d79-216118a0f001 | BOOTING   | BACKUP |
| e64930ad-1cf0-4073-9827-45302a3aac52 | BOOTING   | MASTER |
+--------------------------------------+-----------+--------+

The corresponding guest units appeared to be up and available

$ openstack server list --all-projects -c Name -c Status -c Networks
+----------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------+
| Name                                         | Status | Networks                                                                                                       |
+----------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------+
| amphora-e64930ad-1cf0-4073-9827-45302a3aac52 | ACTIVE | lb-mgmt-net=fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf; private=192.168.0.59; private_lb_fip_network=10.42.0.175  |
| amphora-c7dc61df-c7c7-4bce-9d79-216118a0f001 | ACTIVE | lb-mgmt-net=fc00:5896:ff23:44dc:f816:3eff:fe7e:15d1; private=192.168.0.194; private_lb_fip_network=10.42.0.231 |
+----------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------+

However, looking in the Ocatavia logs revealed these errors:

2022-03-12 09:04:19.832 78382 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. 
  Retrying.: requests.exceptions.ConnectTimeout: 
  HTTPSConnectionPool(host='fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf', port=9443): 
    Max retries exceeded with url: // 
    (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fcc3436dca0>,
                                   'Connection to fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf timed out.
                                   (connect timeout=10.0)'))

Testing from the Octavia unit confirms the lack of connectivity

$ ping -c2 fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf
PING fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf(fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf) 56 data bytes

--- fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1027ms

$ nc -zvw2 fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf 9443
nc: connect to fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf port 9443 (tcp) timed out: Operation now in progress

At this point the issue could be anywhere between the Octavia unit and the guest. I assumed the most likely place to be causing the issue was the guest. To check this I setup a tcpdump on the tap device which connects the guest to the virtual network. To find the tap device first find the hypervisor hosting the guest:

$ openstack server show 6bdbe8b9-2765-48aa-a48f-4057e8038d9c -c name -c 'OS-EXT-SRV-ATTR:hypervisor_hostname' 
+-------------------------------------+-----------------------------------------------------+
| Field                               | Value                                               |
+-------------------------------------+-----------------------------------------------------+
| OS-EXT-SRV-ATTR:hypervisor_hostname | juju-7091df-zaza-d0217f8a78e8-8.project.serverstack |
| name                                | amphora-e64930ad-1cf0-4073-9827-45302a3aac52        |
+-------------------------------------+-----------------------------------------------------+

Find the port ID corresponding to the IP address we are trying to connect to

$ openstack port list -cID -c'Fixed IP Addresses' | grep 'fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf'
| 93fff561-bfda-46b4-9d61-5f4ea8559a15 | ip_address='fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf', subnet_id='e4f5ce87-759a-465a-9d89-0e43066a6c6a' |

Now that the port is know we can connect to the hypervisor (juju-7091df-zaza-d0217f8a78e8-8) and search for the tap device corresponding to the port ID:

$ ip a | grep tap | grep 93fff561
25: tap93fff561-bf:  mtu 1458 qdisc fq_codel master ovs-system state UNKNOWN group default qlen 1000

Any traffic to the quest using the ip associated with the port will pass through this tap device. If we set off a ping on the Octavia unit then we can use tcpdump on the tap device to confirm that the icmpv6 echo packets are sucessfully arriving.

$ sudo tcpdump -i tap93fff561-bf -l 'icmp6[icmp6type]=icmp6-echo'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap93fff561-bf, link-type EN10MB (Ethernet), capture size 262144 bytes

0 packets captured
0 packets received by filter
0 packets dropped by kernel

The packets are not arriving so the break must be earlier on. Connectivity from the Octavia unit to the guest goes through a gre tunnel between the Octavia unit and the hypervisor hosting the guest. The next step is to setup listeners at either end of the gre tunnel. In this case the ip address of the Octavia unit is 172.20.0.139 and the hypervisor is 172.20.0.209. This allows us to find the corresponding gre interface in the ovs br-tun bridge. Looking at the output below from the hypervisor there are four gre tunnels and gre-ac14008b is the tunnel to the Octavia unit since the remote_ip option corresponds to the ip of the Octavia unit.

$ sudo ovs-vsctl show | grep -A3 "Port gre"
        Port gre-ac14008b
            Interface gre-ac14008b
                type: gre
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="172.20.0.209", out_key=flow, remote_ip="172.20.0.139"}
        Port gre-ac140070
            Interface gre-ac140070
                type: gre
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="172.20.0.209", out_key=flow, remote_ip="172.20.0.112"}
        Port gre-ac1400f4
            Interface gre-ac1400f4
                type: gre
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="172.20.0.209", out_key=flow, remote_ip="172.20.0.244"}
        Port gre-ac1400df
            Interface gre-ac1400df
                type: gre
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="172.20.0.209", out_key=flow, remote_ip="172.20.0.223"}

Repeating the process above reveals that the interface on the Octavia unit which handles the other end of the tunnel is called gre-ac1400d1. With this information we can set up a tcpdump processes on both ends of the gre tunnel to check for traffic: On the hypervisor:

$ sudo ovs-tcpdump -i gre-ac14008b -l 'icmp6[icmp6type]=icmp6-echo'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on migre-ac14008b, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

On the octavia unit:

$ sudo ovs-tcpdump -i gre-ac1400d1 -l 'icmp6[icmp6type]=icmp6-echo'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on migre-ac1400d1, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

The packets are not managing to leave the Octavia host, in fact they are not even making it to the gre interface on the Octavia host. Before investigating further its worth looking at a diagram showing the devices between o-hm0 and the tap device associated with the guest.

This is starting to look like an issue with port security. In an ML2 OVS deployment port security is enforced by openflow rules in the br-int ovs bridge. To check this we can disable port security and retest. In this deployment the port on the Octavia unit is named octavia-health-manager-octavia-2-listen-port

$ openstack port show octavia-health-manager-octavia-2-listen-port --fit-width -c name -c fixed_ips -c port_security_enabled -c security_group_ids
+-----------------------+--------------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                                  |
+-----------------------+--------------------------------------------------------------------------------------------------------+
| fixed_ips             | ip_address='fc00:5896:ff23:44dc:f816:3eff:fea8:7ef6', subnet_id='e4f5ce87-759a-465a-9d89-0e43066a6c6a' |
| name                  | octavia-health-manager-octavia-2-listen-port                                                           |
| port_security_enabled | True                                                                                                   |
| security_group_ids    | c01a6b7c-55e6-49ba-a141-d67401c42b6d                                                                   |
+-----------------------+--------------------------------------------------------------------------------------------------------+

To disable port security the security group needs to be removed too, this can be done in a single command

$ openstack port set --disable-port-security --no-security-group octavia-health-manager-octavia-2-listen-port

Then retest:

$ ping -c2  fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf
PING fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf(fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf) 56 data bytes
64 bytes from fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf: icmp_seq=1 ttl=64 time=2.34 ms
64 bytes from fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf: icmp_seq=2 ttl=64 time=1.45 ms

--- fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.454/1.895/2.337/0.441 ms

So removing port security fixes the issue. This strongly implies that the openflow rules are blocking the packets or failing to forward them to the gre interface. To continue debugging the issue we need to re-enable port security:

$ openstack port set --security-group lb-health-mgr-sec-grp --enable-port-security octavia-health-manager-octavia-2-listen-port

The next obvious place to look is the security group rules of the security group associated with the port.

a$ openstack security group rule list lb-health-mgr-sec-grp
+--------------------------------------+-------------+-----------+-----------+------------+-----------+-----------------------+----------------------+
| ID                                   | IP Protocol | Ethertype | IP Range  | Port Range | Direction | Remote Security Group | Remote Address Group |
+--------------------------------------+-------------+-----------+-----------+------------+-----------+-----------------------+----------------------+
| 2579171b-e2b7-43aa-b18e-1ca50bf1e2ce | None        | IPv6      | ::/0      |            | egress    | None                  | None                 |
| 632a4117-83bd-4d08-8bc7-de8424f92b6a | ipv6-icmp   | IPv6      | ::/0      |            | ingress   | None                  | None                 |
| be9c3312-dd15-4f1b-93bf-17a5f1917e0d | None        | IPv4      | 0.0.0.0/0 |            | egress    | None                  | None                 |
| eba21f3a-e90b-4b05-b729-20eb54ce047e | udp         | IPv6      | ::/0      | 5555:5555  | ingress   | None                  | None                 |
+--------------------------------------+-------------+-----------+-----------+------------+-----------+-----------------------+----------------------+

The first rule in the list is allowing any IPv6 egress traffic so the echo request should not be being blocked. Since the security group rules look ok we should now look at the flows associated with br-int incase there are any clues there. The complete set of flows associated with br-int can be seen by running sudo ovs-ofctl dump-flows br-int (since there are 84 of them I have ommitted listing them here). dump-flows by default includes additional information such as the number of packets which have matched a rule. This allows us to see how many of those rules are actually being hit:

$ sudo ovs-ofctl dump-flows br-int |  grep -v n_packets=0, | wc -l
21

This is still a lot of rules to go through. Luckily ovs provides a trace utility which allows you to send a theoretical packet through the flows and see how it would be treated. To use is you need to collect information about the source and destination.

IN_PORT="o-hm0"
BRIDGE="br-int"
HM0_SRC="fc00:5896:ff23:44dc:f816:3eff:fea8:7ef6"
DST_IP="fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf"
HM0_MAC="fa:16:3e:a8:7e:f6"
DST_MAC="fe:16:3e:0d:9a:cf"
sudo ovs-appctl ofproto/trace $BRIDGE in_port="$IN_PORT",icmp6,dl_src=$HM0_MAC,dl_dst=$DST_MAC,ipv6_src=$HM0_SRC,ipv6_dst=$DST_IP

The output is verbose but crucially at the end the packet has been transfered to br-tun and you can see it being set out of the tunnels:

...
22. dl_vlan=1, priority 1, cookie 0xfc585d7201b94e91
    pop_vlan
    set_field:0x2f1->tun_id
    output:2
     -> output to kernel tunnel
    output:3
     -> output to kernel tunnel
    output:5
     -> output to kernel tunnel

This is strange because the trace utility seems to be indicating that the packets should be being sent down the gre tunnels but this is not what we are seeing in practise. Since I manually constructed the test packet previously it is possible I ommited some data which would cause the packet to be handled differently (such as icmp_type). To construct a more authentic packet to test with we can use tcpdump again to capture a packet and then pass this through the trace utility. To do this set off a continious ping again, and run tcpdump this time against the o-hm0 interface and save the captured packets to a file.

$ sudo tcpdump -i o-hm0 'icmp6[icmp6type]=icmp6-echo' -w ping.pcap

The packets in the file need to be converted to hex before they can be fed into the trace utility

$ sudo ovs-pcap ping.pcap > ping.hex

We can now grab the first packet in the file and feed it through the trace utility:

$ IN_PORT="o-hm0"
$ BRIDGE="br-int"
$ PACKET=$(head -1 ping.hex)                                   
$ sudo ovs-appctl ofproto/trace $BRIDGE in_port="$IN_PORT" $PACKET
...
20. dl_vlan=1,dl_dst=fa:16:3e:0d:9a:cf, priority 2, cookie 0xfc585d7201b94e91
    pop_vlan
    set_field:0x2f1->tun_id
    output:5
     -> output to kernel tunnel

Once again the packet is correctly transfered to br-tun and out of a tunnel. In this case rather than flooding the packet out of all the gre tunnels it has been sent down the tunnel that corresponds to the correct hypervisor. Earlier we identified the interface as gre-ac1400d1 and we can see that is port 5:

$ sudo ovs-vsctl list interface | grep -B1 -E "ofport.*\s5"
name                : gre-ac1400d1
ofport              : 5

At this point it seems that the flows are permitting the packets in theory but not in practise. These seems to be pointing strongly that the rules themselves are not the issue but the way ovs is handleing the rules. However, investigating ovs itself is likely going to be a long process involving compiling ovs so I really want to make sure I have ruled out any other factors. The next thing to try is manullay adding promiscuous flow rules and see if that fixes connectivity. In this deployment port 2 is the patch cable between br-int and br-tun and host port o-hm0 is port3.

$ sudo ovs-vsctl list interface | grep -B1 -E "ofport.*\s(2|3)"
name                : patch-tun
ofport              : 2
--
name                : o-hm0
ofport              : 3

Below adds rules that allow any icmp6 traffic on both ports.

sudo ovs-ofctl add-flow br-int table=0,priority=96,icmp6,in_port=2,actions=NORMAL
sudo ovs-ofctl add-flow br-int table=0,priority=96,icmp6,in_port=3,actions=NORMAL

Once again connectivity is restored:

~$ ping -c2  fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf
PING fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf(fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf) 56 data bytes
64 bytes from fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf: icmp_seq=1 ttl=64 time=1.49 ms
64 bytes from fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf: icmp_seq=2 ttl=64 time=1.34 ms

--- fc00:5896:ff23:44dc:f816:3eff:fe0d:9acf ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.342/1.416/1.490/0.074 ms

Now remove the rules again.

$ sudo ovs-ofctl --strict del-flows br-int table=0,priority=96,icmp6,in_port=2
$ sudo ovs-ofctl --strict del-flows br-int table=0,priority=96,icmp6,in_port=3

At this point it seems clear that the issue is with ovs. I did do one other test before investigating ovs itself and that was to dump the flows from a working deployment with ovs 2.15 (using the handy --no-stats option to remove the packet and byte counts). Then upgrade and diff the flows again. Other than the cookie values changing and the order of some rules with the same priority changing the rules were the same.

Friday, 23 April 2021

Controlling service interrupting events in the OpenStack Charms

The new deferred service event feature is arriving in the 21.04 OpenStack charm release. This will allow an operator to stop services from being restarted in some of the charms. This means interruptions to the data plane can be tightly controlled.

Managing deferred service events

The deferred service event feature is off by default but can be enabled by updating the enable-auto-restarts charm config option.

$ juju config neutron-gateway enable-auto-restarts=False

Triggering a deferred service restart via a charm change

Changing the neutron-gateway charms 'debug' option causes the neutron.conf to be updated. In turn a change to the neutron.conf will trigger neutron services to be restarted. However, when auto restarts are disabled the charm updates the neutron.conf but does not restart the neutron services and lets the operator know, via the workload status, that a restart is needed.

$ juju config neutron-gateway debug=True
$ juju status neutron-gateway
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  10:02:19Z

App              Version  Status  Scale  Charm            Store  Channel  Rev  OS      Message
neutron-gateway  15.3.2   active      1  neutron-gateway  local            65  ubuntu  Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-l3-agent, neutron-metadata-agent, neutron-metering-agent, neutron-openvswitch-agent

Unit                Workload  Agent  Machine  Public address  Ports  Message
neutron-gateway/0*  active    idle   5        172.20.0.37            Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-l3-agent, neutron-metadata-agent, neutron-metering-agent, neutron-openvswitch-agent

Machine  State    DNS          Inst id                               Series  AZ    Message
5        started  172.20.0.37  9cc5c808-9c85-4b23-aaca-ded6ba666d33  bionic  nova  ACTIVE

Triggering a deferred hook

There are some occasions when it is not safe for a hook to run at all if the charm is deferring events. For example if the rabbitmq-server charm were to switch from plain text mode to TLS. If the rabbit daemon is not restarted then it will continue to run without TLS. The clients obviously cannot be told to switch to TLS as they will no longer be able to connect. In this case it is not safe to update the rabbitmq config without restarting the service because the service may get restarted for an unexpected reason like a server restart. If an unexpected restart happens rabbit will flip to the new config and the clients with be left trying to talk plain text to a TLS only service. To avoid this the charm may defer running the entire hook. If this happens this will also be visible in the workload status message.

$ juju config neutron-openvswitch disable-mlockall=False
$ juju status neutron-openvswitch/0
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  10:44:12Z

App                  Version  Status  Scale  Charm                Store       Channel  Rev  OS      Message
neutron-openvswitch  15.3.2   active      1  neutron-openvswitch  charmstore           433  ubuntu  Unit is ready. Hooks skipped due to disabled auto restarts: config-changed
nova-compute         20.5.0   active      1  nova-compute         charmstore           539  ubuntu  Unit is ready

Unit                      Workload  Agent  Machine  Public address  Ports  Message
nova-compute/0*           active    idle   7        172.20.0.6             Unit is ready
  neutron-openvswitch/0*  active    idle            172.20.0.6             Unit is ready. Hooks skipped due to disabled auto restarts: config-changed

Machine  State    DNS         Inst id                               Series  AZ    Message
7        started  172.20.0.6  f160add9-ec68-4658-9688-da6dc7cb8c44  bionic  nova  ACTIVE

Triggering a deferred service restart via package change

The charms also ensure that package updates do not trigger restarts of key services. This still applies when the package update happens outside of a charm hook or action. If the update does happen outside of the charm then the next update-status hook will spot that a restart is needed and display that in the workload status message.

$ juju run --unit neutron-gateway/0 "dpkg-reconfigure openvswitch-switch; ./hooks/update-status"
active
active
active
active
active
invoke-rc.d: policy-rc.d denied execution of restart.
$ juju status neutron-gateway
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  10:26:46Z

App              Version  Status  Scale  Charm            Store  Channel  Rev  OS      Message
neutron-gateway  15.3.2   active      1  neutron-gateway  local            65  ubuntu  Unit is ready. Services queued for restart: openvswitch-switch

Unit                Workload  Agent  Machine  Public address  Ports  Message
neutron-gateway/0*  active    idle   5        172.20.0.37            Unit is ready. Services queued for restart: openvswitch-switch

Machine  State    DNS          Inst id                               Series  AZ    Message
5        started  172.20.0.37  9cc5c808-9c85-4b23-aaca-ded6ba666d33  bionic  nova  ACTIVE

Triggering a deferred service restart via OpenStack upgrade

Perhaps the most interesting scenario is actually an OpenStack upgrade. In this case the package update is triggered by updating the charms openstack-origin option. With deferred service updates enabled the long running upgrade will complete without interrupting access to guests:

$ juju run --unit neutron-gateway/0 "pgrep ovs-vswitchd; dpkg -l | grep neutron-common"
30718
ii  neutron-common                       2:15.3.2-0ubuntu1~cloud2                                    all          Neutron is a virtual network service for Openstack - common
$ juju config neutron-gateway openstack-origin
cloud:bionic-train
$ juju config neutron-gateway openstack-origin=cloud:bionic-ussuri
$ juju run --unit neutron-gateway/0 "pgrep ovs-vswitchd; dpkg -l | grep neutron-common"
30718
ii  neutron-common                       2:16.3.0-0ubuntu3~cloud0                                    all          Neutron is a virtual network service for Openstack - common
$ juju status neutron-gateway/0
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  14:13:04Z

App              Version  Status  Scale  Charm            Store  Channel  Rev  OS      Message
neutron-gateway  16.3.0   active      1  neutron-gateway  local            65  ubuntu  Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-dhcp-agent.service, neutron-l3-agent, neutron-l3-agent.service, neutron-metadata-agent, neutron-metadata-agent.service, neutron-metering-agent, neutron-metering-agent.service, neutron-openvswitch-agent, neutron-openvswitch-agent.service, openvswitch-switch

Unit                Workload  Agent  Machine  Public address  Ports  Message
neutron-gateway/0*  active    idle   5        172.20.0.37            Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-dhcp-agent.service, neutron-l3-agent, neutron-l3-agent.service, neutron-metadata-agent, neutron-metadata-agent.service, neutron-metering-agent, neutron-metering-agent.service, neutron-openvswitch-agent, neutron-openvswitch-agent.service, openvswitch-switch

Machine  State    DNS          Inst id                               Series  AZ    Message
5        started  172.20.0.37  9cc5c808-9c85-4b23-aaca-ded6ba666d33  bionic  nova  ACTIVE

Running a service restart

The charms provide a restart-services action which accepts a deferred-only option. When the charm is run with deferred-only=True the charm will check which services are in need of a restart and restart them. For example to clear all deferred restarts:

$ juju run-action neutron-gateway/0 restart-services deferred-only=True --wait
unit-neutron-gateway-0:
  UnitId: neutron-gateway/0
  id: "238"
  results:
    Stdout: |
      active
      active
      active
      active
      active
  status: completed
  timing:
    completed: 2021-04-23 10:07:19 +0000 UTC
    enqueued: 2021-04-23 10:06:42 +0000 UTC
    started: 2021-04-23 10:06:45 +0000 UTC

Note: If a service is restarted manually then the charms workload status message will be updated when the next hook runs.

Running a deferred hook

The charms provide a run-deferred-hooks action which will run any hooks which have been deferred. Any service restarts that are marked as deferred will be restated as part of running this action.

$ juju run-action neutron-openvswitch/0 run-deferred-hooks  --wait

Showing details of deferred events

The charms provide a show-deferred-events action. This will list the events that have been deferred with some extra detail.

$ juju run-action neutron-gateway/0 show-deferred-events  --wait;
unit-neutron-gateway-0:
  UnitId: neutron-gateway/0
  id: "256"
  results:
    output: |
      hooks: []
      restarts:
      - 1619173568 openvswitch-switch                       Package update
      - 1619175335 openvswitch-switch                       Package update
      - '1619181884 neutron-dhcp-agent                       File(s) changed: /etc/neutron/dhcp_agent.ini,
        /etc/neutron/neutron.conf'
      - '1619181884 neutron-l3-agent                         File(s) changed: /etc/neutron/neutron.conf'
      - '1619181884 neutron-metadata-agent                   File(s) changed: /etc/neutron/neutron.conf'
      - '1619181884 neutron-metering-agent                   File(s) changed: /etc/neutron/neutron.conf'
      - '1619181884 neutron-openvswitch-agent                File(s) changed: /etc/neutron/neutron.conf'
  status: completed
  timing:
    completed: 2021-04-23 12:44:57 +0000 UTC
    enqueued: 2021-04-23 12:44:56 +0000 UTC
    started: 2021-04-23 12:44:56 +0000 UTC

Under the hood

Recording deferred events

When a charm or package needs to restart a service but cannot this is recorded in a file in /var/lib/policy-rc.d. These files have the following format:

# cat /var/lib/policy-rc.d/charm-neutron-gateway-6df8252a-a422-11eb-a3e0-fa163e25ff5d.deferred 
{
    action: restart,
    policy_requestor_name: neutron-gateway,
    policy_requestor_type: charm,
    reason: Package update,
    service: openvswitch-switch,
    timestamp: 1619175335}

This shows that the deferred action was a restart against the openvswitch-switch service. The timestamp the request was made is in seconds since the epoch and can be converted using the date command:

$ date -d @1619175335
Fri 23 Apr 11:55:35 BST 2021

The file also shows that the restart was requested because a package was updated. Finally the policy_requestor_name and policy_requestor_type keys show that the neutron-gateway charm is requesting that restarts of the service are denied.

These files are read by the update-status hook. The charm checks the timestamp in the file against the start time of the service. If the service was restarted after the timestamp in the file the file is removed and that deferred event is considered to be complete. Otherwise the events are summarised in the workload status message.

This means that deferred restarts can be cleared by restarting the service manually, removing the deferred event file or by running the restart-service action mentioned earlier.

Integration with packaging

The charm makes use of the policy-rc.d interface . When a package wishes to interact with a service it runs /usr/sbin/policy-rc.d with the name of the service and the action it wishes to take. The return code of the script tells the packaging system whether the restart was permitted or not. The charm ships its own implementation of the policy-rc.d script. This script decides whether a restart is permitted by examining policy files in /etc/policy-rc.d. These policy files list which actions against which services are denied.

# cat /etc/policy-rc.d/charm-neutron-gateway.policy

# Managed by juju
blocked_actions:
  neutron-dhcp-agent: [restart, stop, try-restart]
  neutron-l3-agent: [restart, stop, try-restart]
  neutron-metadata-agent: [restart, stop, try-restart]
  neutron-metering-agent: [restart, stop, try-restart]
  neutron-openvswitch-agent: [restart, stop, try-restart]
  openvswitch-switch: [restart, stop, try-restart]
  ovs-vswitchd: [restart, stop, try-restart]
  ovsdb-server: [restart, stop, try-restart]
policy_requestor_name: neutron-gateway
policy_requestor_type: charm

The charm that wrote the policy file is indicated by the policy_requestor_name key and the blocked_actions key lists which actions are blocked for each service.

Thursday, 13 August 2020

Raspberry Pi 4 Juju Controller

I have a MAAS cluster at home to test out Juju, OpenStack, Kubernetes etc. The cluster is really a collection of machines that have been carefully selected for their cheapness and ability to fit in a small space under a spiral staircase. Since its inception I have used an old laptop with a broken screen as a Juju controller. Unfortunately the laptop did not take kindly to having tea spilt on it a few months ago and so I forced to replace it. I was feeling flush at the time to decided to go for a Raspberry Pi 4. Unfortunately I failed to spot that the Raspberry Pi 4 does not actually network boot like a conventional machine and so I wouldn't be able to enlist it in MAAS. Fortunately while working on something else I spotted that Juju has introduced the idea of Managing multiple clouds with one controller This gave me the idea of setting up a controller on the Pi and then adding MAAS as an additional cloud to it.

Prepare the Pi

Burn an Ubuntu image onto the SD card, I went for the Ubuntu 20.04.1 64-bit server OS for arm64. Give the Pi a static IP address. I haven't tested it but I'm sure the Pi needs to have a wired network connection rather than wireless, otherwise access to containers via the macvlan interface will break.

I believe it would be possible to set the controller up using the Manual cloud type but I prefer the idea of the controller being in its own LXD container so I went for that approach. To configure LXD initialise it and then disable ipv6 as it doesn't currently work with Juju.

ubuntu@pi-controller-host:~$ lxd init --auto
ubuntu@pi-controller-host:~$ lxc network set lxdbr0 ipv6.address none

The controller needs to have access to the network the MAAS nodes are going to be deployed on. The primary interface on the LXD containers could be switched to be macvlan rather than bridged but unfortunately for reasons the LXD host would loose access to the containers. A neat solution is to give the LXD container two nics, one on the bridged network which will allow the host (pi-controller-host in my case) access to the containers and one macvlan interface that will give the wider network access to the containers.

This Pi is only going to be used as a controller so I make these network changes in the default lxd profile.

ubuntu@pi-controller-host:~$ cat two-nics.yaml
config:
  user.network-config: |
    version: 1
    config:
      - type: physical
        name: eth0
        subnets:
          - type: dhcp
            ipv4: true
      - type: physical
        name: eth1
        subnets:
          - type: dhcp
            ipv4: true
description: Default LXD profile
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  eth1:
    name: eth1
    nictype: macvlan
    parent: eth0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: default
ubuntu@pi-controller-host:~$ lxc profile edit default < two-nics.yaml

Install Juju Controller

Install the juju client. In this case I needed to do some testing with Juju 2.8.2 so I went for the candidate channel but normally I'd go for the latest/stable channel.

ubuntu@pi-controller-host:~$ sudo snap install --classic  --channel=2.8/candidate juju                                                                                                                                                                                               
juju (2.8/candidate) 2.8.2 from Canonical✓ installed

Next create the Juju controller.

ubuntu@pi-controller-host:~$ juju bootstrap localhost pi-lxd-controller
Creating Juju controller "pi-lxd-controller" on localhost/localhost
Looking for packaged Juju agent version 2.8.2 for arm64                                                                              
No packaged binary found, preparing local Juju agent binary                                      
To configure your system to better support LXD containers, please see: https://github.com/lxc/lxd/blob/master/doc/production-setup.md
Launching controller instance(s) on localhost/localhost...                                         
 - juju-b84c1e-0 (arch=arm64)                                                                     
Installing Juju agent on bootstrap instance                       
Fetching Juju Dashboard 0.2.0                                                                  
Waiting for address                                                                              
Attempting to connect to 10.119.95.251:22                                                          
Attempting to connect to 172.16.0.20:22                                                         
Connected to 10.119.95.251                                                                                                
Running machine configuration script...                                                                             
Bootstrap agent now started                                                                         
Contacting Juju controller at 10.119.95.251 to verify accessibility...                          
                                                                                                 
Bootstrap complete, controller "pi-lxd-controller" is now available                             
Controller machines are in the "controller" model                                               
Initial model "default" added

Now the Juju controller is setup for using Juju to deploy applications into local LXD containers. But the idea here is to use this controller for deployments to MAAS. To achieve this we need to add the MAAS cloud to the controller:

ubuntu@pi-controller-host:~$ juju add-cloud --controller pi-lxd-controller     
Cloud Types                                                                   
  lxd                                                                              
  maas                                                                             
  manual                                                                            
  openstack                                                                         
  vsphere                                                                             
                                                                                                       
Select cloud type: maas                                                                    
                                                                                             
Enter a name for your maas cloud: homemaas                                                                                     

Enter the API endpoint url: http://172.16.0.2:5240/MAAS

Cloud "" added to controller "pi-lxd-controller".
WARNING loading credentials: credentials for cloud homemaas not found
To upload a credential to the controller for cloud "homemaas", use
* 'add-model' with --credential option or
* 'add-credential -c homemaas'.

Giving the controller an api key for the MAAS environment is a separate step:

ubuntu@pi-controller-host:~$ juju add-credential --controller pi-lxd-controller homemaas
Using cloud "homemaas" from the controller to verify credentials.
Enter credential name: admin

Regions
  default

Select region [any region, credential is not region specific]:

Using auth-type "oauth1".

Enter maas-oauth:

Controller credential "admin" for user "admin" for cloud "homemaas" on controller "pi-lxd-controller" added.
For more information, see ‘juju show-credential homemaas admin’.

Since I opted for a candidate version of Juju I need to tell Juju where to look for the agent code:

ubuntu@pi-controller-host:~$ juju model-defaults homemaas agent-stream=devel

Deploying to MAAS

A Juju model is specific to a cloud and at this point the controller can deploy to two different clouds: LXD and MAAS. To differentiate between the two when adding a model the cloud needs to be specified:

ubuntu@pi-controller-host:~$ juju add-model maas-model homemaas    
Added 'maas-model' model on homemaas/default with credential 'admin' for user 'admin'

I can now deploy applications using MAAS via my LXD controller on the Pi:

ubuntu@pi-controller-host:~$ juju deploy ubuntu
Located charm "cs:ubuntu-15".
Deploying charm "cs:ubuntu-15".
ubuntu@pi-controller-host:~$ juju status
Model       Controller         Cloud/Region      Version  SLA          Timestamp
maas-model  pi-lxd-controller  homemaas/default  2.8.2    unsupported  09:05:39Z

App     Version  Status  Scale  Charm   Store       Rev  OS      Notes
ubuntu  18.04    active      1  ubuntu  jujucharms   15  ubuntu

Unit       Workload  Agent  Machine  Public address  Ports  Message
ubuntu/0*  active    idle   0        172.16.0.104           ready

Machine  State    DNS           Inst id   Series  AZ       Message
0        started  172.16.0.104  warm-dog  bionic  default  Deployed

Accessing Controller Remotely

I tend to keep all my deployment config on the MAAS server so it makes sense to be able to administer the Juju deployment from there. To set this up I needed to create a new user with Juju and then use that use on the MAAS server.

ubuntu@pi-controller-host:~$ juju add-user fawkesadmin                
User "fawkesadmin" added                                                                 
Please send this command to fawkesadmin:                                        
    juju register MGwTC2Zhd2tlc2FkbWluMCgTEzVERYLONGSTRING
                                                                                 
"fawkesadmin" has not been granted access to any models. You can use "juju grant" to grant access.

Next I need to grant that user superuser perms on the controller

ubuntu@pi-controller-host:~$ juju grant fawkesadmin superuser -c pi-lxd-controller

Finally on the maas server I can register that controller

1 liam@fawkes:~$ juju register MGwTC2Zhd2tlc2FkbWluMCgTEzVERYLONGSTRING
Enter a new password:                                
Confirm password:                                                                                
Enter a name for this controller [pi-lxd-controller]:                                                                                                                    
Initial password successfully set for fawkesadmin.         
                                                                                                                            
Welcome, fawkesadmin. You are now logged into "pi-lxd-controller".                                  
                                                                                                                                
There are 3 models available. Use "juju switch" to select                                                                        
one of them:                                                                                                                
  - juju switch admin/controller                                              
  - juju switch admin/default                                                                                                    
  - juju switch admin/maas-model

I now need to add the MAAS creds the fawkesadmin user is going to use (I used the same API key as before)

liam@fawkes:~$ juju add-credential -c pi-lxd-controller homemaas                   
Using cloud "homemaas" from the controller to verify credentials.
Enter credential name: admin                                                                                                                           
                                                 
Regions                                                                   
  default                                                                          
                                                                       
Select region [any region, credential is not region specific]:                   
                                                                                                                            
Using auth-type "oauth1".                                                                          
                                                                   
Enter maas-oauth:                                                             
                                                                          
Controller credential "admin" for user "fawkesadmin" for cloud "homemaas" on controller "pi-lxd-controller" added.
For more information, see ‘juju show-credential homemaas admin’.

Friday, 12 July 2019

OVS + DPDK Bond + VLAN + Network Namespace

I was recently investigating a bug in an OpenStack deployment. Guest were failing to receive metadata on boot. Digging into the deployment revealed that the service providing the metadata was running inside a network namespace, the namespace was attached to an ovs bridge using a tap device, the tap device's ovs port was associated with a specific vlan id, the ovs bridge was in turn using dpdk and for external network access two network cards were bonded together using an ovs dpdk bond. After grabbing a cup of tea to calm my nerves I started reading a fair amount of documentation and running a few tests. As far as I could tell everything was configured as it should be.

To be able to file a bug I needed to be able to reproduce this setup on the latest version of each piece of software, preferably without needing a full OpenStack deployment. Then I could start removing layers of complexity until I had this simplest reproducer of the bug.

Although no one single part of the process of reproducing this setup was particularly difficult it did involve a fair few moving parts and below I run through them (mainly for my benefit when next week I've forgotten everything I did).

DPDK

I was lucky enough to have access to a server with two dpdk compatible network cards which I could deploy using maas. The server had also been setup to have hugepages created on install. This was done by creating a custom maas tag and assigning it to the server:

2 ubuntu@maas:~⟫ maas maas tag read dpdk
Success.
Machine-readable output follows:
{
    "definition": "",
    "name": "dpdk",
    "resource_uri": "/MAAS/api/2.0/tags/dpdk/",
    "kernel_opts": "hugepages=103117 iommu=pt intel_iommu=on",
    "comment": "DPDK enabled machines"
}

After installing the development release of Ubuntu (eoan) on the server it was time to install the dpdk and ovs packages.

ubuntu@node-licetus:~$ lsb_release -a                                                             
No LSB modules are available.                                                                              
Distributor ID: Ubuntu                                                                             
Description:    Ubuntu Eoan Ermine (development branch)                                               
Release:        19.10                                                                                  
Codename:       eoan                                                                                   
ubuntu@node-licetus:~$ sudo apt-get -q install -y dpdk openvswitch-switch-dpdk

Update the system to use openvswitch-switch-dpdk for ovs-vswitchd.

ubuntu@node-licetus:~$ sudo update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
update-alternatives: using /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in manual mode

dpdk references the network cards via their PCI address. One way to find the PCI address of a network card, given its MAC address, is to examine the web of files and symbolic links in the /sys filesystem.

ubuntu@node-licetus:~$ grep -E 'a0:36:9f:dd:31:bc|a0:36:9f:dd:31:be' /sys/class/net/*/address
/sys/class/net/enp3s0f0/address:a0:36:9f:dd:31:bc
/sys/class/net/enp3s0f1/address:a0:36:9f:dd:31:be

ubuntu@node-licetus:~$ ls -ld /sys/class/net/enp3s0f0 /sys/class/net/enp3s0f1 | awk '{print $NF}' | awk 'BEGIN {FS="/"} {print $6}'
0000:03:00.0
0000:03:00.1

To switch the network cards from being kernel managed to being managed by dpdk the /etc/dpdk/interfaces is updated and dpdk restarted. When this is done the network cards will disappears from tools like ip.

root@node-licetus:~# ip -br addr show | grep enp
enp3s0f0         UP             fe80::a236:9fff:fedd:31bc/64 
enp3s0f1         UP             fe80::a236:9fff:fedd:31be/64 
root@node-licetus:~# echo "pci 0000:03:00.0 vfio-pci
> pci 0000:03:00.1 vfio-pci" >> /etc/dpdk/interfaces
root@node-licetus:~# systemctl restart dpdk
root@node-licetus:~# ip -br addr show | grep enp
root@node-licetus:~#

DPDK enabled OVS

There are a few global settings which need to be applied when using ovs with dpdk. The first relates to hugepages. Hugepages need to be allocated per NUMA node. First check that the hugepages have been created as requested by the kernel option specified in maas:

root@node-licetus:~# grep -i hugepages_ /proc/meminfo
HugePages_Total:   103117
HugePages_Free:    103117
HugePages_Rsvd:        0
HugePages_Surp:        0

Now see how many NUMA nodes there are:

# ls -ld /sys/devices/system/node/node* | wc -l
2

I chose to allocate 4096MB to each of the NUMA nodes. This is done via the dpdk-socket-mem option which takes a comma delimited list of hugepage numbers as its value:

root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096"
root@node-licetus:~#

Next white list the network cards in ovs via their PCI addresses:

root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-extra="--pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1"
root@node-licetus:~#

Finally restart openvswitch-switch and check the log:

root@node-licetus:~# systemctl restart openvswitch-switch
root@node-licetus:~#

root@node-licetus:~# grep --color -E 'PCI|DPDK|ovs-vswitchd' /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.475Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.496Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
2019-07-11T13:07:54.806Z|00009|dpdk|ERR|DPDK not supported in this copy of Open vSwitch.
2019-07-11T13:08:17.961Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T13:08:17.969Z|00007|dpdk|INFO|Using DPDK 18.11.0
2019-07-11T13:08:17.969Z|00008|dpdk|INFO|DPDK Enabled - initializing...
2019-07-11T13:08:17.969Z|00011|dpdk|INFO|Per port memory for DPDK devices disabled.
2019-07-11T13:08:17.969Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd --pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1 --socket-mem 4096,4096 --socket-limit 4096,4096 -l 0.
2019-07-11T13:08:26.915Z|00019|dpdk|INFO|EAL: PCI device 0000:03:00.0 on NUMA socket 0
2019-07-11T13:08:27.600Z|00023|dpdk|INFO|EAL: PCI device 0000:03:00.1 on NUMA socket 0
2019-07-11T13:08:28.090Z|00026|dpdk|INFO|DPDK Enabled - initialized
2019-07-11T13:08:28.097Z|00051|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0

Bridge with DPDK Bonded NICs

As per the OVS docs when creating the bridge the datapath_type needs to be set to netdev to tell ovs run in userspace mode .

root@node-licetus:~# ovs-vsctl -- add-br br-test
root@node-licetus:~# ovs-vsctl -- set bridge br-test datapath_type=netdev

Now create the bond device and attach it to the bridge:

root@node-licetus:~# ovs-vsctl --may-exist add-bond br-test dpdk-bond0 dpdk-nic1 dpdk-nic2 \
>           -- set Interface dpdk-nic1 type=dpdk options:dpdk-devargs=0000:03:00.0 \
>           -- set Interface dpdk-nic2 type=dpdk options:dpdk-devargs=0000:03:00.1

ovs-vsctl seems quite happy to create the bond even if there is a problem so its worth taking a moment to check the device exists in the bridge without any errors:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

Part of my testing was to use jumbo frames so the final step is to set the mtu on the dpdk devices:

root@node-licetus:~# ovs-vsctl set Interface dpdk-nic1 mtu_request=9000
root@node-licetus:~# ovs-vsctl set Interface dpdk-nic2 mtu_request=9000
root@node-licetus:~#

Tap in Network Namespace Attached to a Bridge.

Create a tap device called tap1 in the bridge:

root@node-licetus:~# ovs-vsctl add-port br-test tap1 -- set Interface tap1 type=internal
root@node-licetus:~#

Create a network namespace called ns1 and place the tap1 into it.

root@node-licetus:~# ip netns add ns1
root@node-licetus:~# ip link set tap1 netns ns1

Bring up tap1 and assign it an IP address:

root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 up
root@node-licetus:~# ip netns exec ns1 ip link set dev lo up
root@node-licetus:~# ip netns exec ns1 ip addr add 172.20.0.1/24 dev tap1

root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 mtu 9000

Finally the network the tap needs to be on is vlan 2933 which is delivered to the network cards as part of a vlan trunk. To assign tap1 to the vlan with id 2933 the tap port needs to be tagged.

root@node-licetus:~# ovs-vsctl set port tap1 tag=2933
root@node-licetus:~#

Testing

That really is it. Below is the resulting bridge and tap interface:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port "tap1"
            tag: 2933
            Interface "tap1"
                type: internal
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

root@node-licetus:~# ip netns exec ns1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
12: tap1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 0a:96:ad:cd:2e:7f brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.1/24 scope global tap1
       valid_lft forever preferred_lft forever
    inet6 fe80::896:adff:fecd:2e7f/64 scope link 
       valid_lft forever preferred_lft forever

From another machine that has access to vlan 2933 I can ping the tap device:

ubuntu@node-husband:~$ ping -c3 172.20.0.1
PING 172.20.0.1 (172.20.0.1) 56(84) bytes of data.
64 bytes from 172.20.0.1: icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from 172.20.0.1: icmp_seq=2 ttl=64 time=0.146 ms
64 bytes from 172.20.0.1: icmp_seq=3 ttl=64 time=0.144 ms

--- 172.20.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2046ms
rtt min/avg/max/mdev = 0.144/0.168/0.215/0.034 ms

Final Thoughts

If you want to use dpdk with OpenStack then the OpenStack charms make it easy. The charms look after all of the above and much more.

If you want a more complete guide to dpdk try here The new simplicity to consume dpdk

Tuesday, 16 April 2019

Quick OpenStack Charm Test with Zaza

I was investigating a bug recently that appeared to be a race condition causing RabbitMQ server to occasionally fail to cluster when the charm was upgraded. To attempt to reproduce the bug I decided to use Python library zaza.

Firstly I created a tests & bundles directory:

$ mkdir -p tests/bundles/

Then added a bundle in tests/bundles/ha.yaml :

series: xenial
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server
    constraints: mem=1G
    num_units: 3

Once the deployment is complete I needed to test upgrading the charm so the next step was to define a test in tests/tests_rabbit_upgrade.py to do this:

#!/usr/bin/env python3

import unittest
import zaza.model

class UpgradeTest(unittest.TestCase):

    def test_upgrade(self):
        zaza.model.upgrade_charm(
            'rabbitmq-server',
            switch='cs:~openstack-charmers-next/rabbitmq-server-343')
        zaza.model.block_until_all_units_idle()

The last step is to tell Zaza what to do in a test run and this is done in the tests/tests.yaml.

tests:
  - tests.tests_rabbit_upgrade.UpgradeTest
configure:
  - zaza.charm_tests.noop.setup.basic_setup
smoke_bundles:
  - ha

This tells Zaza which bundle to run for smoke tests and which test(s) to run. There is no configuration step needed but at the moment zaza expects one so it is pointed at a no-op method. Obviously you need to have zaza installed which can be done in a virtualenv:

$ virtualenv -q -ppython3 venv3                                                                                                   
Already using interpreter /usr/bin/python3                                                                                                                      
$ source venv3/bin/activate                                                                                                      
(venv3) $ pip install git+https://github.com/openstack-charmers/zaza.git#egg=zaza

Zaza will generate a new model for each run and does not bring the new model into focus so it is safe to kick of the test run as many times as you like in parallel.

(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[1] 4768
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[2] 4790
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[3] 4795
(venv3) $ functest-run-suite --smoke --keep-model &> /tmp/run.$(uuid -v1) &
[4] 4811

All four deployments and tests ran in parallel. Below is an example output:

$ cat /tmp/run.ed5fe80c-6045-11e9-83b5-77a3728f6280 
Controller: gnuoy-serverstack

Model              Cloud/Region             Type       Status      Machines  Cores  Access  Last connection
controller         serverstack/serverstack  openstack  available          1      4  admin   just now
default            serverstack/serverstack  openstack  available          0      -  admin   2018-11-13
zaza-310115a7b9f9  serverstack/serverstack  openstack  available          0      -  admin   just now
zaza-98856c9f8ec5  serverstack/serverstack  openstack  available          0      -  admin   just now
zaza-9c013d3f9ec9  serverstack/serverstack  openstack  available          0      -  admin   never connected

INFO:root:Deploying bundle './tests/bundles/ha.yaml' on to 'zaza-310115a7b9f9' model
Resolving charm: cs:rabbitmq-server
Executing changes:
- upload charm cs:rabbitmq-server-85 for series xenial
- deploy application rabbitmq-server on xenial using cs:rabbitmq-server-85
- add unit rabbitmq-server/0 to new machine 0
- add unit rabbitmq-server/1 to new machine 1
- add unit rabbitmq-server/2 to new machine 2
Deploy of bundle completed.
INFO:root:Waiting for environment to settle
INFO:root:Waiting for a unit to appear
INFO:root:Waiting for all units to be idle
INFO:root:Checking workload status of rabbitmq-server/0
INFO:root:Checking workload status message of rabbitmq-server/0
INFO:root:Checking workload status of rabbitmq-server/1
INFO:root:Checking workload status message of rabbitmq-server/1
INFO:root:Checking workload status of rabbitmq-server/2
INFO:root:Checking workload status message of rabbitmq-server/2
INFO:root:OK
INFO:root:## Running Test tests.tests_rabbit_upgrade.UpgradeTest ##
test_upgrade (tests.tests_rabbit_upgrade.UpgradeTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 87.692s

OK

Finally, because the --keep-model switch was used all four models are still available for any post test inspection:

$ juju list-models
Controller: gnuoy-serverstack

Model              Cloud/Region             Type       Status     Machines  Cores  Access  Last connection
controller         serverstack/serverstack  openstack  available         1      4  admin   just now
default*           serverstack/serverstack  openstack  available         0      -  admin   2018-11-13
zaza-310115a7b9f9  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago
zaza-98856c9f8ec5  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago
zaza-9c013d3f9ec9  serverstack/serverstack  openstack  available         3      6  admin   5 minutes ago
zaza-a1b9f911850b  serverstack/serverstack  openstack  available         3      6  admin   6 minutes ago

To read more about zaza head over to https://zaza.readthedocs.io/en/latest/

Saturday, 13 April 2019

OpenStack Automated Instance Recovery

Who needs pets? It's all about the cattle in the cloud world, right? Unfortunately, reality has a habit of stepping on even the best laid plans. There is often one old legacy app (or more) which proves somewhat reticent to being turned into a self-healing, auto-load balancing cloud entity. In this case it does seems reasonable that if the app becomes unavailable the infrastructure should make some attempt to bring it back again. In the OpenStack world this is where projects like Masakari come in.

Masakari, when used in conjunction with masakari-monitors and Pacemaker, provides a service which detects the failure of a guest, or an entire compute node, and attempts to bring it back on line. The Masakari service provides an api and executor which reacts to notifications of a failure. Masakari-monitors, as the name suggests, detects failures and tells masakari about them. One mechanism it uses to detect failure is to monitor pacemaker, and when pacemaker reports a fellow compute node is down, masakari-monitors reports that to the Masakari service.

Masakari also provides two other recovery mechanisms. It provides a mechanism for detecting the failure of an individual guest and restarting that guest in situ. Finally, it also provides a mechanism for restarting an operating system process if it stops; this mechanism is not covered here and the charm is configured to disable it.

Below is a diagram from the Masakari Wiki showing the plumbing:

Introducing Pacemaker Remote

As mentioned above, Masakari monitors detect a compute node failure by querying the locally running cluster software. A hacluster charm already exists which will deploy corosync and pacemaker, and initial tests showed that this worked in a small environment. Unfortunately, the corosync/pacemaker is not designed to scale much past 16 nodes (see pacemaker documentation ) and a limit of 16 compute nodes is not acceptable in most clouds.

This is where pacemaker remote comes in. A pacemaker remote node can run resources but does not run the full cluster stack in this case corosync is not installed on remote nodes and so the pacemaker remote nodes do not participate in the constant chatter needed to keep the cluster XML definitions up to date and consistent across the nodes.

A new Pacemaker Remote charm was developed which relates to an existing cluster and can run resources from that cluster. In this example the pacemaker remotes do not actually run any resources, they are just used as a mechanism of querying the state of all the nodes in the cluster.

Unfortunately due to Bug #1728527 masakari-monitors fails to query pacemaker-remote properly. A patch has been proposed upstream. The patch is currently applied to the masakari-monitors package in stein in the Ubuntu Cloud Archive.

Charm Architecture

The masakari api service is deployed using a new masakari charm. This charm behaves in the same way as the other OpenStack API service charms. It expects relations with MySQL/Percona, RabbitMQ and Keystone. It can also be related to Vault to secure the endpoints with https. The only real difference to the other API charms is that it must be deployed in an ha configuration which means multiple units with an hacluster relation. The pacemaker-remote and masakari-monitors charms are both subordinates that need to be running on the compute nodes:

It might be expected that the masakari-monitors would have a direct relation to the masakari charm. In fact it does a lookup for the service in the catalogue so it only needs credentials which it obtains by having an identity-credentials relation with keystone.

STONITH

For a guest to be failed over to another compute host it must be using some form of shared storage. As a result, the scenario where a compute node has lost network access to its peers but continues to have access to shared storage must be considered. Masakari monitors on a peer compute node registers the compute node has gone and notifies the masakari api server. This in turn triggers the masakari engine to instigate a failover of that guest via nova. Assuming that nova concurs that the compute node has gone, it will attempt to bring it up on another node. At this point there are two guests both trying to write to the same shared storage which will likely lead to data corruption.

The solution is to enable stonith in pacemaker. When the cluster detects a node has disappeared, it runs a stonith plugin to power the compute node off. To enable stonith the hacluster charm now ships with a maas plugin that the stonith resource can use. The maas details are provided by the existing hacluster maas_url and maas_credentials config options. This allows the stonith resource to power off a node via the maas api which has the added advantage of abstracting the charm away from the individual power management system being used in this particular deployment.

Adding Masakari to a deployment

Below is an example of using an overlay to add masakari to an existing deployment. Obviously there is lots of configuration in the yaml below which is specific to each deployment and will need updating:

machines:
  '0':
    series: bionic
  '1':
    series: bionic
  '2':
    series: bionic
  '3':
    series: bionic
relations:
- - nova-compute:juju-info
  - masakari-monitors:container
- - masakari:ha
  - hacluster:ha
- - keystone:identity-credentials
  - masakari-monitors:identity-credentials
- - nova-compute:juju-info
  - pacemaker-remote:juju-info
- - hacluster:pacemaker-remote
  - pacemaker-remote:pacemaker-remote
- - masakari:identity-service
  - keystone:identity-service
- - masakari:shared-db
  - mysql:shared-db
- - masakari:amqp
  - rabbitmq-server:amqp
series: bionic
applications:
  masakari-monitors:
    charm: cs:~gnuoy/masakari-monitors-9
    series: bionic
  hacluster:
    charm: cs:~gnuoy/hacluster-38
    options:
      corosync_transport: unicast
      cluster_count: 3
      maas_url: http://10.0.0.205/MAAS
      maas_credentials: 3UC4:zSfGfk:Kaka2VUAD37zGF
  pacemaker-remote:
    charm: cs:~gnuoy/pacemaker-remote-10
    options:
      enable-stonith: True
      enable-resources: False
  masakari:
    charm: cs:~gnuoy/masakari-4
    series: bionic
    num_units: 3
    options:
      openstack-origin: cloud:bionic-rocky/proposed
      vip: '10.0.0.236 10.70.0.236 10.80.0.236'
    bindings:
      public: public
      admin: admin
      internal: internal
      shared-db: internal
      amqp: internal
    to:
    - 'lxd:1'
    - 'lxd:2'
    - 'lxd:3'

To add it to the existing model:

$ juju deploy base.yaml --overlay masakari-overlay.yaml --map-machines=0=0,1=1,2=2,3=3

Hacluster Resources

Each Pacemaker remote node has a corresponding resource which runs in the main cluster. The status of the Pacemaker nodes can be seen via crm status

$ sudo crm status
Stack: corosync
Current DC: juju-f0373f-1-lxd-2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Fri Apr 12 12:57:27 2019
Last change: Fri Apr 12 10:03:49 2019 by root via cibadmin on juju-f0373f-1-lxd-2

6 nodes configured
10 resources configured

Online: [ juju-f0373f-1-lxd-2 juju-f0373f-2-lxd-1 juju-f0373f-3-lxd-3 ]
RemoteOnline: [ frank-colt model-crow tidy-goose ]

Full list of resources:

 Resource Group: grp_masakari_vips
     res_masakari_29a0f9d_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
     res_masakari_321f78b_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
     res_masakari_578b519_vip   (ocf::heartbeat:IPaddr2):       Started juju-f0373f-2-lxd-1
 Clone Set: cl_res_masakari_haproxy [res_masakari_haproxy]
     Started: [ juju-f0373f-1-lxd-2 juju-f0373f-2-lxd-1 juju-f0373f-3-lxd-3 ]
 tidy-goose     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 frank-colt     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 model-crow     (ocf::pacemaker:remote):        Started juju-f0373f-1-lxd-2
 st-maas-5937691        (stonith:external/maas):        Started juju-f0373f-1-lxd-2

The output above shows that the three pacemaker-remote nodes (frank-colt, model-crow & tidy-goose) are online. It also shows each remote nodes corresponding ocf::pacemaker:remote resource and where that resource is running.

$ sudo crm configure show tidy-goose
primitive tidy-goose ocf:pacemaker:remote \
        params server=10.0.0.161 reconnect_interval=60 \
        op monitor interval=30s

By default the cluster setup by the hacluster charm has symmetric-cluster set to true. This means when a cluster resource is defined it is eligible to run on any node in the cluster. If a node should not run any given resource then the node has to opt out. In the masakari deployment the pacemaker-remote nodes are joined to the masakari api cluster. This would mean that the VIP used for accessing the api service and haproxy clone set would be eligible to run on the nova-compute nodes which would break. Given that there may be a large number of pacemaker-remote nodes (one per compute node) and there are likely to be exactly three masakari api units, it makes sense to switch the cluster to be being an opt-in cluster. To achieve this symmetric-cluster is set to false and location rules are created allowing the VIP and haproxy set to run on the api units. Below is an example of these location rules:

$ sudo crm configure show
...
location loc-grp_masakari_vips-juju-f0373f-1-lxd-2 grp_masakari_vips 0: juju-f0373f-1-lxd-2
location loc-grp_masakari_vips-juju-f0373f-2-lxd-1 grp_masakari_vips 0: juju-f0373f-2-lxd-1
location loc-grp_masakari_vips-juju-f0373f-3-lxd-3 grp_masakari_vips 0: juju-f0373f-3-lxd-3

Each rule permits the grp_masakari_vips to run on that node, the score of 0 in the rule shows that all nodes are equal, it is not preferred that the resource run on one node rather than another.

The clones require one additional trick, they attempt to run everywhere regardless of the symmetric-cluster setting. To limit them to the api nodes location, rules are needed as before. So an additional setting called clone-max is applied to the clone resource limiting the number of places it should run.

clone cl_res_masakari_haproxy res_masakari_haproxy \
        meta clone-max=3

Finally the stonith configuration is also managed as a resource, as can be seen from the crm status at the start of the section where st-maas-5937691 is up and running.

$ sudo crm configure show st-maas-5937691
primitive st-maas-5937691 stonith:external/maas \
        params url="http://10.0.0.205/MAAS" apikey="3UC4:fGfk:37zGF" \
        hostnames="frank-colt model-crow tidy-goose" \
        op monitor interval=25 start-delay=25 timeout=25

As can be seen above, there is one stonith resource for all three nodes and the resource contains all the information needed to interact with the maas api.

Configuring Masakari

In Masakari the compute nodes are grouped into failover segments. In the event of a failure, guests are moved onto other nodes within the same segment. Which compute node is chosen to house the evacuated guests is determined by the recovery method of that segment.

'AUTO' Recovery Method

With auto recovery the guests are relocated to any of the available nodes in the same segment. The problem with this approach is that there is no guarantee that resources will be available to accommodate guests from a failed compute node.

To configure a group of compute hosts for auto recovery, first create a segment with the recovery method set to auto:

$ openstack segment create segment1 auto COMPUTE
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| created_at      | 2019-04-12T13:59:50.000000           |
| updated_at      | None                                 |
| uuid            | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| name            | segment1                             |
| description     | None                                 |
| id              | 1                                    |
| service_type    | COMPUTE                              |
| recovery_method | auto                                 |
+-----------------+--------------------------------------+

Next the hypervisors need to be added into the segment, these should be referenced by their unqualified hostname:

$ openstack segment host create tidy-goose COMPUTE SSH 691b8ef3-7481-48b2-afb6-908a98c8a768             
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| created_at          | 2019-04-12T14:18:24.000000           |
| updated_at          | None                                 |
| uuid                | 11b85c9d-2b97-4b83-b773-0e9565e407b5 |
| name                | tidy-goose                           |
| type                | COMPUTE                              |
| control_attributes  | SSH                                  |
| reserved            | False                                |
| on_maintenance      | False                                |
| failover_segment_id | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
+---------------------+--------------------------------------+

Repeat above for all remaining hypervisors:

$ openstack segment host list 691b8ef3-7481-48b2-afb6-908a98c8a768
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 75afadbb-67cc-47b2-914e-e3bf848028e4 | frank-colt | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| 11b85c9d-2b97-4b83-b773-0e9565e407b5 | tidy-goose | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
| f1e9b0b4-3ac9-4f07-9f83-5af2f9151109 | model-crow | COMPUTE | SSH                | False    | False          | 691b8ef3-7481-48b2-afb6-908a98c8a768 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

'RESERVED_HOST' Recovery Method

With reserved_host recovery compute hosts are allocated as reserved which allows an operator to guarantee there is sufficient capacity available for any guests in need of evacuation.

Firstly create a segment with the reserved_host recovery method:

$ openstack segment create segment1 reserved_host COMPUTE -c uuid -f value

2598f8aa-3612-4731-9716-e126ca6cc280

Add a host using the --reserved switch to indicate that it will act as a standby:

$ openstack segment host create model-crow --reserved True COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280

Add the remaining hypervisors as before:

$ openstack segment host create frank-colt COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280
$ openstack segment host create tidy-goose COMPUTE SSH 2598f8aa-3612-4731-9716-e126ca6cc280

Listing the segment hosts shows that model-crow is a reserved host:

$ openstack segment host list 2598f8aa-3612-4731-9716-e126ca6cc280
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 4769e08c-ed52-440a-866e-832b977aa5e2 | tidy-goose | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| 90aedbd2-e03b-4dbd-b330-a1c848f300df | frank-colt | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| c77574cc-b6e7-440e-9c86-84e91981f15e | model-crow | COMPUTE | SSH                | True     | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

Finally disable the reserved host in nova so that it remains available for failover:

$ openstack compute service set --disable model-crow nova-compute

$ openstack compute service list
+----+----------------+---------------------+----------+----------+-------+----------------------------+
| ID | Binary         | Host                | Zone     | Status   | State | Updated At                 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+
|  1 | nova-scheduler | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T10:59:10.000000 |
|  5 | nova-conductor | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T10:59:08.000000 |
|  7 | nova-compute   | tidy-goose          | nova     | enabled  | up    | 2019-04-13T10:59:11.000000 |
|  8 | nova-compute   | frank-colt          | nova     | enabled  | up    | 2019-04-13T10:59:05.000000 |
|  9 | nova-compute   | model-crow          | nova     | disabled | up    | 2019-04-13T10:59:12.000000 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+

When a compute node failure is detected, masakari will disable the failed node and enable the reserve node in nova. After simulating a failure of frank-colt the service list now looks like this:

$ openstack compute service list
+----+----------------+---------------------+----------+----------+-------+----------------------------+
| ID | Binary         | Host                | Zone     | Status   | State | Updated At                 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+
|  1 | nova-scheduler | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T11:05:20.000000 |
|  5 | nova-conductor | juju-44b912-3-lxd-3 | internal | enabled  | up    | 2019-04-13T11:05:28.000000 |
|  7 | nova-compute   | tidy-goose          | nova     | enabled  | up    | 2019-04-13T11:05:21.000000 |
|  8 | nova-compute   | frank-colt          | nova     | disabled | down  | 2019-04-13T11:03:56.000000 |
|  9 | nova-compute   | model-crow          | nova     | enabled  | up    | 2019-04-13T11:05:22.000000 |
+----+----------------+---------------------+----------+----------+-------+----------------------------+

Since the reserved host has now been enabled and is hosting evacuated guests, masakari has removed the reserved flag from it. Masakari has also placed the failed node in maintenance mode.

$ openstack segment host list 2598f8aa-3612-4731-9716-e126ca6cc280
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| uuid                                 | name       | type    | control_attributes | reserved | on_maintenance | failover_segment_id                  |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+
| 4769e08c-ed52-440a-866e-832b977aa5e2 | tidy-goose | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| 90aedbd2-e03b-4dbd-b330-a1c848f300df | frank-colt | COMPUTE | SSH                | False    | True           | 2598f8aa-3612-4731-9716-e126ca6cc280 |
| c77574cc-b6e7-440e-9c86-84e91981f15e | model-crow | COMPUTE | SSH                | False    | False          | 2598f8aa-3612-4731-9716-e126ca6cc280 |
+--------------------------------------+------------+---------+--------------------+----------+----------------+--------------------------------------+

‘AUTO_PRIORITY’ and ‘RH_PRIORITY’ Recovery Methods

These methods appear to chain the previous methods together. So, auto_priority attempts to move the guest using the auto method first and if that fails it tries the reserved_host method. rh_priority does the same thing but in the reverse order. See Pike Release Note for details.

Individual Instance Recovery

Finally, to use the masakari feature which reacts to a single guest failing rather than a whole hypervisor, the guest(s) need to be marked with a small piece of metadata:

$ openstack server set --property HA_Enabled=True server_120419134342

Testing Instance Failure

The simplest scenario to test is single guest recovery. It is worth noting that the whole stack is quite good at detecting intentional shutdown and will do nothing if it detects it. So to test masakari, processes need to be shutdown in a disorderly fashion, in this case sending a SIGKILL to the guests qemu process:

root@model-crow:~# pgrep -f qemu-system-x86_64
733213
root@model-crow:~# pkill -f -9 qemu-system-x86_64; sleep 10; pgrep -f qemu-system-x86_64
733851

The guest was killed and then restarted. Check the masakari instance monitor log to see what happened:

2019-04-12 14:30:56.269 189712 INFO masakarimonitors.instancemonitor.libvirt_handler.callback [-] Libvirt Event: type=VM, hostname=model-crow, uuid=4ce60f57-e8af-4a9a-b3f5-852d428c6890, time=2019-04-12 14:30:56.269703, event_id=LIFECYCLE, detail=STOPPED_FAILED)
2019-04-12 14:30:56.270 189712 INFO masakarimonitors.ha.masakari [-] Send a notification.

{'notification':
{'type': 'VM',
     'hostname': 'model-crow',
     'generated_time': datetime.datetime(2019, 4, 12, 14, 30, 56, 269703),
     'payload': {
         'event': 'LIFECYCLE',
         'instance_uuid': '4ce60f57-e8af-4a9a-b3f5-852d428c6890',
         'vir_domain_event': 'STOPPED_FAILED'}}}
2019-04-12 14:30:56.695 189712 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=VM, hostname=model-crow, generated_time=2019-04-12T14:30:56.269703,

payload={
   'event': 'LIFECYCLE',
   'instance_uuid': '4ce60f57-e8af-4a9a-b3f5-852d428c6890',
    'vir_domain_event': 'STOPPED_FAILED'}, id=4,
    notification_uuid=8c29844b-a79c-45ad-a7c7-2931fa263dab,
    source_host_uuid=f1e9b0b4-3ac9-4f07-9f83-5af2f9151109,
    status=new, created_at=2019-04-12T14:30:56.655930,
    updated_at=None)

The log shows the instance monitor spotting the guest going down and informing the masakari api service - note that the instance monitor does not start the guest itself. It is the masakari engine which resides on the maskari api units that performs the recovery:

# tail -11 /var/log/masakari/masakari-engine.log

2019-04-12 14:30:56.696 39396 INFO masakari.engine.manager [req-9b1465ed-5261-4f6b-b091-e1fb91a60583 33d5abf329d24757aa972c6a06a75e96 6f6b9d4f6c314fbba2753d7210052cd2 - - -] Processing notification 8c29844b-a79c-45ad-a7c7-2931fa263dab of type: VM
2019-04-12 14:30:56.723 39396 INFO masakari.compute.nova [req-c8d5cd4a-2f47-4677-af4e-9939067768dc masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:57.309 39396 INFO masakari.compute.nova [req-698f61d9-9765-4109-97f4-e56f5685a251 masakari - - - -] Call stop server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:57.780 39396 INFO masakari.compute.nova [req-1b9f58c1-de57-488f-904d-be244ce8d990 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:58.406 39396 INFO masakari.compute.nova [req-ddb32f52-d907-40c8-b790-bbe156f38878 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:59.127 39396 INFO masakari.compute.nova [req-b7b28eaa-4559-4fc4-9a89-e565694038d4 masakari - - - -] Call start server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:59.692 39396 INFO masakari.compute.nova [req-18fd6893-cbc9-4b15-b413-bf780b9583e7 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:00.698 39396 INFO masakari.compute.nova [req-1674a605-48af-4986-bc4b-1bc80111bf4d masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:01.700 39396 INFO masakari.compute.nova [req-86ba7d2d-5fd9-40e3-86e8-2cf506f02217 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:02.703 39396 INFO masakari.compute.nova [req-f4e19958-7c04-4c29-bf02-de98064729b2 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:31:03.262 39396 INFO masakari.engine.manager [req-f4e19958-7c04-4c29-bf02-de98064729b2 masakari - - - -] Notification 8c29844b-a79c-45ad-a7c7-2931fa263dab exits with status: finished.

Testing Compute Node failure

In the event of a compute node failing, pacemaker should spot this. The masakari host monitor periodically checks the node status as reported by pacemaker and in the event of a failure a notification is sent to the masakari api. Pacemaker should run the stonith resource to power off the node and masakari should move guests that were running on the compute node on to another available node.

The first thing to do is to check what hypervisor the guest is running on:

$ openstack server show server_120419134342 -c OS-EXT-SRV-ATTR:host -f value
model-crow

On that hypervisor disable systemd recovery for both the pacemaker_remote and nova-compte:

root@model-crow:~# sed -i -e 's/^Restart=.*/Restart=no/g' /lib/systemd/system/pacemaker_remote.service 
root@model-crow:~# sed -i -e 's/^Restart=.*/Restart=no/g' /lib/systemd/system/nova-compute.service
root@model-crow:~# systemctl daemon-reload

Now kill the processes:

root@model-crow:~# pkill -9 -f /usr/bin/nova-compute
root@model-crow:~# pkill -9 -f pacemaker_remoted

In the case of a compute node failure it is the compute nodes peers who should detect the failure and post the notification to masakari:

2019-04-12 15:21:29.464 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] Works on pacemaker-remote.
2019-04-12 15:21:29.643 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'model-crow' is 'offline'.
2019-04-12 15:21:29.644 886030 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'COMPUTE_HOST', 'hostname': 'model-crow', 'generated_time': datetime.datetime(2019, 4, 12, 15, 21, 29, 644323), 'payload': {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}}}
2019-04-12 15:21:30.746 886030 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=COMPUTE_HOST, hostname=model-crow, generated_time=2019-04-12T15:21:29.644323, payload={'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}, id=5, notification_uuid=2f62f9dd-7edf-406c-aa16-0a4e8e3a3726, source_host_uuid=f1e9b0b4-3ac9-4f07-9f83-5af2f9151109, status=new, created_at=2019-04-12T15:21:30.706730, updated_at=None)
2019-04-12 15:21:30.747 886030 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'tidy-goose' is 'online'.

As before, check the masakari engine has responded:

$ tail -n12  /var/log/masakari/masakari-engine.log
2019-04-12 14:30:22.004 31553 INFO masakari.compute.nova [req-ec98fbd0-717b-41db-883b-3d46eea82a0e masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 14:30:22.556 31553 INFO masakari.engine.manager [req-ec98fbd0-717b-41db-883b-3d46eea82a0e masakari - - - -] Notification cee025d2-9736-4238-91e2-378661373f9d exits with status: finished.
2019-04-12 15:21:30.744 31553 INFO masakari.engine.manager [req-7713b994-c112-4e7f-905d-ef2cc497ca9a 33d5abf329d24757aa972c6a06a75e96 6f6b9d4f6c314fbba2753d7210052cd2 - - -] Processing notification 2f62f9dd-7edf-406c-aa16-0a4e8e3a3726 of type: COMPUTE_HOST
2019-04-12 15:21:30.827 31553 INFO masakari.compute.nova [req-8abbebe8-a6fb-48f1-90fc-113ad68d1c38 masakari - - - -] Disable nova-compute on model-crow
2019-04-12 15:21:31.252 31553 INFO masakari.engine.drivers.taskflow.host_failure [req-8abbebe8-a6fb-48f1-90fc-113ad68d1c38 masakari - - - -] Sleeping 30 sec before starting recovery thread until nova recognizes the node down.
2019-04-12 15:22:01.257 31553 INFO masakari.compute.nova [req-658b10fc-cd2c-4b28-8f1c-21e9da5ca14e masakari - - - -] Fetch Server list on model-crow
2019-04-12 15:22:02.309 31553 INFO masakari.compute.nova [req-9c8c65c4-0a69-4cc5-90db-5aef4f35f22a masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:02.963 31553 INFO masakari.compute.nova [req-71a4c4a6-2950-416a-916c-b075eff51b4c masakari - - - -] Call lock server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:03.872 31553 INFO masakari.compute.nova [req-71a4c4a6-2950-416a-916c-b075eff51b4c masakari - - - -] Call evacuate command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890 on host None
2019-04-12 15:22:04.473 31553 INFO masakari.compute.nova [req-8363a1bf-0214-4db4-b45a-c720c4113830 masakari - - - -] Call get server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:05.221 31553 INFO masakari.compute.nova [req-f7410918-02e7-4bff-afdd-81a1c24c6ee6 masakari - - - -] Call unlock server command for instance 4ce60f57-e8af-4a9a-b3f5-852d428c6890
2019-04-12 15:22:05.783 31553 INFO masakari.engine.manager [req-f7410918-02e7-4bff-afdd-81a1c24c6ee6 masakari - - - -] Notification 2f62f9dd-7edf-406c-aa16-0a4e8e3a3726 exits with status: finished.

Now check that the guest has moved:

$ openstack server show server_120419134342 -c OS-EXT-SRV-ATTR:host -f value
frank-colt

And finally that the stonith resource ran:

root@juju-f0373f-2-lxd-1:~# tail -f  /var/log/syslog
Apr 12 15:21:39 juju-f0373f-1-lxd-2 external/maas[201018]: info: Performing power reset on model-crow
Apr 12 15:21:42 juju-f0373f-1-lxd-2 external/maas[201018]: info: model-crow is in power state unknown
Apr 12 15:21:42 juju-f0373f-1-lxd-2 external/maas[201018]: info: Powering off model-crow