Friday 23 April 2021

Controlling service interrupting events in the OpenStack Charms

The new deferred service event feature is arriving in the 21.04 OpenStack charm release. This will allow an operator to stop services from being restarted in some of the charms. This means interruptions to the data plane can be tightly controlled.

Managing deferred service events

The deferred service event feature is off by default but can be enabled by updating the enable-auto-restarts charm config option.

$ juju config neutron-gateway enable-auto-restarts=False

Triggering a deferred service restart via a charm change

Changing the neutron-gateway charms 'debug' option causes the neutron.conf to be updated. In turn a change to the neutron.conf will trigger neutron services to be restarted. However, when auto restarts are disabled the charm updates the neutron.conf but does not restart the neutron services and lets the operator know, via the workload status, that a restart is needed.

$ juju config neutron-gateway debug=True
$ juju status neutron-gateway
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  10:02:19Z

App              Version  Status  Scale  Charm            Store  Channel  Rev  OS      Message
neutron-gateway  15.3.2   active      1  neutron-gateway  local            65  ubuntu  Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-l3-agent, neutron-metadata-agent, neutron-metering-agent, neutron-openvswitch-agent

Unit                Workload  Agent  Machine  Public address  Ports  Message
neutron-gateway/0*  active    idle   5        172.20.0.37            Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-l3-agent, neutron-metadata-agent, neutron-metering-agent, neutron-openvswitch-agent

Machine  State    DNS          Inst id                               Series  AZ    Message
5        started  172.20.0.37  9cc5c808-9c85-4b23-aaca-ded6ba666d33  bionic  nova  ACTIVE

Triggering a deferred hook

There are some occasions when it is not safe for a hook to run at all if the charm is deferring events. For example if the rabbitmq-server charm were to switch from plain text mode to TLS. If the rabbit daemon is not restarted then it will continue to run without TLS. The clients obviously cannot be told to switch to TLS as they will no longer be able to connect. In this case it is not safe to update the rabbitmq config without restarting the service because the service may get restarted for an unexpected reason like a server restart. If an unexpected restart happens rabbit will flip to the new config and the clients with be left trying to talk plain text to a TLS only service. To avoid this the charm may defer running the entire hook. If this happens this will also be visible in the workload status message.

$ juju config neutron-openvswitch disable-mlockall=False
$ juju status neutron-openvswitch/0
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  10:44:12Z

App                  Version  Status  Scale  Charm                Store       Channel  Rev  OS      Message
neutron-openvswitch  15.3.2   active      1  neutron-openvswitch  charmstore           433  ubuntu  Unit is ready. Hooks skipped due to disabled auto restarts: config-changed
nova-compute         20.5.0   active      1  nova-compute         charmstore           539  ubuntu  Unit is ready

Unit                      Workload  Agent  Machine  Public address  Ports  Message
nova-compute/0*           active    idle   7        172.20.0.6             Unit is ready
  neutron-openvswitch/0*  active    idle            172.20.0.6             Unit is ready. Hooks skipped due to disabled auto restarts: config-changed

Machine  State    DNS         Inst id                               Series  AZ    Message
7        started  172.20.0.6  f160add9-ec68-4658-9688-da6dc7cb8c44  bionic  nova  ACTIVE

Triggering a deferred service restart via package change

The charms also ensure that package updates do not trigger restarts of key services. This still applies when the package update happens outside of a charm hook or action. If the update does happen outside of the charm then the next update-status hook will spot that a restart is needed and display that in the workload status message.

$ juju run --unit neutron-gateway/0 "dpkg-reconfigure openvswitch-switch; ./hooks/update-status"
active
active
active
active
active
invoke-rc.d: policy-rc.d denied execution of restart.
$ juju status neutron-gateway
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  10:26:46Z

App              Version  Status  Scale  Charm            Store  Channel  Rev  OS      Message
neutron-gateway  15.3.2   active      1  neutron-gateway  local            65  ubuntu  Unit is ready. Services queued for restart: openvswitch-switch

Unit                Workload  Agent  Machine  Public address  Ports  Message
neutron-gateway/0*  active    idle   5        172.20.0.37            Unit is ready. Services queued for restart: openvswitch-switch

Machine  State    DNS          Inst id                               Series  AZ    Message
5        started  172.20.0.37  9cc5c808-9c85-4b23-aaca-ded6ba666d33  bionic  nova  ACTIVE

Triggering a deferred service restart via OpenStack upgrade

Perhaps the most interesting scenario is actually an OpenStack upgrade. In this case the package update is triggered by updating the charms openstack-origin option. With deferred service updates enabled the long running upgrade will complete without interrupting access to guests:

$ juju run --unit neutron-gateway/0 "pgrep ovs-vswitchd; dpkg -l | grep neutron-common"
30718
ii  neutron-common                       2:15.3.2-0ubuntu1~cloud2                                    all          Neutron is a virtual network service for Openstack - common
$ juju config neutron-gateway openstack-origin
cloud:bionic-train
$ juju config neutron-gateway openstack-origin=cloud:bionic-ussuri
$ juju run --unit neutron-gateway/0 "pgrep ovs-vswitchd; dpkg -l | grep neutron-common"
30718
ii  neutron-common                       2:16.3.0-0ubuntu3~cloud0                                    all          Neutron is a virtual network service for Openstack - common
$ juju status neutron-gateway/0
Model              Controller              Cloud/Region             Version  SLA          Timestamp
zaza-cfafc581b686  gnuoy-serverstack-nons  serverstack/serverstack  2.8.8    unsupported  14:13:04Z

App              Version  Status  Scale  Charm            Store  Channel  Rev  OS      Message
neutron-gateway  16.3.0   active      1  neutron-gateway  local            65  ubuntu  Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-dhcp-agent.service, neutron-l3-agent, neutron-l3-agent.service, neutron-metadata-agent, neutron-metadata-agent.service, neutron-metering-agent, neutron-metering-agent.service, neutron-openvswitch-agent, neutron-openvswitch-agent.service, openvswitch-switch

Unit                Workload  Agent  Machine  Public address  Ports  Message
neutron-gateway/0*  active    idle   5        172.20.0.37            Unit is ready. Services queued for restart: neutron-dhcp-agent, neutron-dhcp-agent.service, neutron-l3-agent, neutron-l3-agent.service, neutron-metadata-agent, neutron-metadata-agent.service, neutron-metering-agent, neutron-metering-agent.service, neutron-openvswitch-agent, neutron-openvswitch-agent.service, openvswitch-switch

Machine  State    DNS          Inst id                               Series  AZ    Message
5        started  172.20.0.37  9cc5c808-9c85-4b23-aaca-ded6ba666d33  bionic  nova  ACTIVE

Running a service restart

The charms provide a restart-services action which accepts a deferred-only option. When the charm is run with deferred-only=True the charm will check which services are in need of a restart and restart them. For example to clear all deferred restarts:

$ juju run-action neutron-gateway/0 restart-services deferred-only=True --wait
unit-neutron-gateway-0:
  UnitId: neutron-gateway/0
  id: "238"
  results:
    Stdout: |
      active
      active
      active
      active
      active
  status: completed
  timing:
    completed: 2021-04-23 10:07:19 +0000 UTC
    enqueued: 2021-04-23 10:06:42 +0000 UTC
    started: 2021-04-23 10:06:45 +0000 UTC

Note: If a service is restarted manually then the charms workload status message will be updated when the next hook runs.


Running a deferred hook

The charms provide a run-deferred-hooks action which will run any hooks which have been deferred. Any service restarts that are marked as deferred will be restated as part of running this action.

$ juju run-action neutron-openvswitch/0 run-deferred-hooks  --wait

Showing details of deferred events

The charms provide a show-deferred-events action. This will list the events that have been deferred with some extra detail.

$ juju run-action neutron-gateway/0 show-deferred-events  --wait;
unit-neutron-gateway-0:
  UnitId: neutron-gateway/0
  id: "256"
  results:
    output: |
      hooks: []
      restarts:
      - 1619173568 openvswitch-switch                       Package update
      - 1619175335 openvswitch-switch                       Package update
      - '1619181884 neutron-dhcp-agent                       File(s) changed: /etc/neutron/dhcp_agent.ini,
        /etc/neutron/neutron.conf'
      - '1619181884 neutron-l3-agent                         File(s) changed: /etc/neutron/neutron.conf'
      - '1619181884 neutron-metadata-agent                   File(s) changed: /etc/neutron/neutron.conf'
      - '1619181884 neutron-metering-agent                   File(s) changed: /etc/neutron/neutron.conf'
      - '1619181884 neutron-openvswitch-agent                File(s) changed: /etc/neutron/neutron.conf'
  status: completed
  timing:
    completed: 2021-04-23 12:44:57 +0000 UTC
    enqueued: 2021-04-23 12:44:56 +0000 UTC
    started: 2021-04-23 12:44:56 +0000 UTC

Under the hood

Recording deferred events

When a charm or package needs to restart a service but cannot this is recorded in a file in /var/lib/policy-rc.d. These files have the following format:

# cat /var/lib/policy-rc.d/charm-neutron-gateway-6df8252a-a422-11eb-a3e0-fa163e25ff5d.deferred 
{
    action: restart,
    policy_requestor_name: neutron-gateway,
    policy_requestor_type: charm,
    reason: Package update,
    service: openvswitch-switch,
    timestamp: 1619175335}

This shows that the deferred action was a restart against the openvswitch-switch service. The timestamp the request was made is in seconds since the epoch and can be converted using the date command:
$ date -d @1619175335
Fri 23 Apr 11:55:35 BST 2021
The file also shows that the restart was requested because a package was updated. Finally the policy_requestor_name and policy_requestor_type keys show that the neutron-gateway charm is requesting that restarts of the service are denied.

These files are read by the update-status hook. The charm checks the timestamp in the file against the start time of the service. If the service was restarted after the timestamp in the file the file is removed and that deferred event is considered to be complete. Otherwise the events are summarised in the workload status message.

This means that deferred restarts can be cleared by restarting the service manually, removing the deferred event file or by running the restart-service action mentioned earlier.

Integration with packaging

The charm makes use of the policy-rc.d interface . When a package wishes to interact with a service it runs /usr/sbin/policy-rc.d with the name of the service and the action it wishes to take. The return code of the script tells the packaging system whether the restart was permitted or not. The charm ships its own implementation of the policy-rc.d script. This script decides whether a restart is permitted by examining policy files in /etc/policy-rc.d. These policy files list which actions against which services are denied.

# cat /etc/policy-rc.d/charm-neutron-gateway.policy 
# Managed by juju
blocked_actions:
  neutron-dhcp-agent: [restart, stop, try-restart]
  neutron-l3-agent: [restart, stop, try-restart]
  neutron-metadata-agent: [restart, stop, try-restart]
  neutron-metering-agent: [restart, stop, try-restart]
  neutron-openvswitch-agent: [restart, stop, try-restart]
  openvswitch-switch: [restart, stop, try-restart]
  ovs-vswitchd: [restart, stop, try-restart]
  ovsdb-server: [restart, stop, try-restart]
policy_requestor_name: neutron-gateway
policy_requestor_type: charm
The charm that wrote the policy file is indicated by the policy_requestor_name key and the blocked_actions key lists which actions are blocked for each service.