Friday 12 July 2019

OVS + DPDK Bond + VLAN + Network Namespace

I was recently investigating a bug in an OpenStack deployment. Guest were failing to receive metadata on boot. Digging into the deployment revealed that the service providing the metadata was running inside a network namespace, the namespace was attached to an ovs bridge using a tap device, the tap device's ovs port was associated with a specific vlan id, the ovs bridge was in turn using dpdk and for external network access two network cards were bonded together using an ovs dpdk bond. After grabbing a cup of tea to calm my nerves I started reading a fair amount of documentation and running a few tests. As far as I could tell everything was configured as it should be.

To be able to file a bug I needed to be able to reproduce this setup on the latest version of each piece of software, preferably without needing a full OpenStack deployment. Then I could start removing layers of complexity until I had this simplest reproducer of the bug.

Although no one single part of the process of reproducing this setup was particularly difficult it did involve a fair few moving parts and below I run through them (mainly for my benefit when next week I've forgotten everything I did).

DPDK

I was lucky enough to have access to a server with two dpdk compatible network cards which I could deploy using maas. The server had also been setup to have hugepages created on install. This was done by creating a custom maas tag and assigning it to the server:

2 ubuntu@maas:~⟫ maas maas tag read dpdk
Success.
Machine-readable output follows:
{
    "definition": "",
    "name": "dpdk",
    "resource_uri": "/MAAS/api/2.0/tags/dpdk/",
    "kernel_opts": "hugepages=103117 iommu=pt intel_iommu=on",
    "comment": "DPDK enabled machines"
}

After installing the development release of Ubuntu (eoan) on the server it was time to install the dpdk and ovs packages.

ubuntu@node-licetus:~$ lsb_release -a                                                             
No LSB modules are available.                                                                              
Distributor ID: Ubuntu                                                                             
Description:    Ubuntu Eoan Ermine (development branch)                                               
Release:        19.10                                                                                  
Codename:       eoan                                                                                   
ubuntu@node-licetus:~$ sudo apt-get -q install -y dpdk openvswitch-switch-dpdk          

Update the system to use openvswitch-switch-dpdk for ovs-vswitchd. 

ubuntu@node-licetus:~$ sudo update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
update-alternatives: using /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in manual mode

dpdk references the network cards via their PCI address.  One way to find the PCI address of a network card, given its MAC address, is to examine the web of files and symbolic links in the /sys filesystem.

ubuntu@node-licetus:~$ grep -E 'a0:36:9f:dd:31:bc|a0:36:9f:dd:31:be' /sys/class/net/*/address
/sys/class/net/enp3s0f0/address:a0:36:9f:dd:31:bc
/sys/class/net/enp3s0f1/address:a0:36:9f:dd:31:be

ubuntu@node-licetus:~$ ls -ld /sys/class/net/enp3s0f0 /sys/class/net/enp3s0f1 | awk '{print $NF}' | awk 'BEGIN {FS="/"} {print $6}'
0000:03:00.0
0000:03:00.1
To switch the network cards from being kernel managed to being managed by dpdk the /etc/dpdk/interfaces is updated and dpdk restarted. When this is done the network cards will disappears from tools like ip.

root@node-licetus:~# ip -br addr show | grep enp
enp3s0f0         UP             fe80::a236:9fff:fedd:31bc/64 
enp3s0f1         UP             fe80::a236:9fff:fedd:31be/64 
root@node-licetus:~# echo "pci 0000:03:00.0 vfio-pci
> pci 0000:03:00.1 vfio-pci" >> /etc/dpdk/interfaces
root@node-licetus:~# systemctl restart dpdk
root@node-licetus:~# ip -br addr show | grep enp
root@node-licetus:~#


DPDK enabled OVS

There are a few global settings which need to be applied when using ovs with dpdk. The first relates to hugepages. Hugepages need to be allocated per NUMA node. First check that the hugepages have been created as requested by the kernel option specified in maas:

root@node-licetus:~# grep -i hugepages_ /proc/meminfo
HugePages_Total:   103117
HugePages_Free:    103117
HugePages_Rsvd:        0
HugePages_Surp:        0

Now see how many NUMA nodes there are:

# ls -ld /sys/devices/system/node/node* | wc -l
2

I chose to allocate 4096MB to each of the NUMA nodes. This is done via the dpdk-socket-mem option which takes a comma delimited list of hugepage numbers as its value:

root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096"
root@node-licetus:~#



Next white list the network cards in ovs via their PCI addresses:



root@node-licetus:~# ovs-vsctl set Open_vSwitch . other_config:dpdk-extra="--pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1"
root@node-licetus:~# 


Finally restart openvswitch-switch and check the log:

root@node-licetus:~# systemctl restart openvswitch-switch
root@node-licetus:~#
root@node-licetus:~# grep --color -E 'PCI|DPDK|ovs-vswitchd' /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.475Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T12:19:51.496Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
2019-07-11T13:07:54.806Z|00009|dpdk|ERR|DPDK not supported in this copy of Open vSwitch.
2019-07-11T13:08:17.961Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-11T13:08:17.969Z|00007|dpdk|INFO|Using DPDK 18.11.0
2019-07-11T13:08:17.969Z|00008|dpdk|INFO|DPDK Enabled - initializing...
2019-07-11T13:08:17.969Z|00011|dpdk|INFO|Per port memory for DPDK devices disabled.
2019-07-11T13:08:17.969Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd --pci-whitelist 0000:03:00.0 --pci-whitelist 0000:03:00.1 --socket-mem 4096,4096 --socket-limit 4096,4096 -l 0.
2019-07-11T13:08:26.915Z|00019|dpdk|INFO|EAL: PCI device 0000:03:00.0 on NUMA socket 0
2019-07-11T13:08:27.600Z|00023|dpdk|INFO|EAL: PCI device 0000:03:00.1 on NUMA socket 0
2019-07-11T13:08:28.090Z|00026|dpdk|INFO|DPDK Enabled - initialized
2019-07-11T13:08:28.097Z|00051|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0

Bridge with DPDK Bonded NICs

As per  the OVS docs when creating the bridge the datapath_type needs to be set to netdev to tell ovs run in userspace mode . 

root@node-licetus:~# ovs-vsctl -- add-br br-test
root@node-licetus:~# ovs-vsctl -- set bridge br-test datapath_type=netdev

Now create the bond device and attach it to the bridge:

root@node-licetus:~# ovs-vsctl --may-exist add-bond br-test dpdk-bond0 dpdk-nic1 dpdk-nic2 \
>           -- set Interface dpdk-nic1 type=dpdk options:dpdk-devargs=0000:03:00.0 \
>           -- set Interface dpdk-nic2 type=dpdk options:dpdk-devargs=0000:03:00.1

ovs-vsctl seems quite happy to create the bond even if there is a problem so its worth taking a moment to check the device exists in the bridge without any errors:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

Part of my testing was to use jumbo frames so the final step is to set the mtu on the dpdk devices:

root@node-licetus:~# ovs-vsctl set Interface dpdk-nic1 mtu_request=9000
root@node-licetus:~# ovs-vsctl set Interface dpdk-nic2 mtu_request=9000
root@node-licetus:~# 

Tap in Network Namespace Attached to a Bridge.

Create a tap device called tap1 in the bridge:

root@node-licetus:~# ovs-vsctl add-port br-test tap1 -- set Interface tap1 type=internal
root@node-licetus:~# 

Create a network namespace called ns1 and place the tap1 into it.

root@node-licetus:~# ip netns add ns1
root@node-licetus:~# ip link set tap1 netns ns1


Bring up tap1 and assign it an IP address:

root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 up
root@node-licetus:~# ip netns exec ns1 ip link set dev lo up
root@node-licetus:~# ip netns exec ns1 ip addr add 172.20.0.1/24 dev tap1
root@node-licetus:~# ip netns exec ns1 ip link set dev tap1 mtu 9000

Finally the network the tap needs to be on is vlan 2933 which is delivered to the network cards as part of a vlan trunk. To assign tap1 to the vlan with id 2933 the tap port needs to be tagged.

root@node-licetus:~# ovs-vsctl set port tap1 tag=2933
root@node-licetus:~#

Testing

That really is it. Below is the resulting bridge and tap interface:

root@node-licetus:~# ovs-vsctl show
181b55d1-999a-464b-adf4-d80ca1790988
    Bridge br-test
        Port "tap1"
            tag: 2933
            Interface "tap1"
                type: internal
        Port br-test
            Interface br-test
                type: internal
        Port "dpdk-bond0"
            Interface "dpdk-nic1"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.0"}
            Interface "dpdk-nic2"
                type: dpdk
                options: {dpdk-devargs="0000:03:00.1"}
    ovs_version: "2.11.0"

root@node-licetus:~# ip netns exec ns1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
12: tap1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 0a:96:ad:cd:2e:7f brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.1/24 scope global tap1
       valid_lft forever preferred_lft forever
    inet6 fe80::896:adff:fecd:2e7f/64 scope link 
       valid_lft forever preferred_lft forever

From another machine that has access to vlan 2933 I can ping the tap device:

ubuntu@node-husband:~$ ping -c3 172.20.0.1
PING 172.20.0.1 (172.20.0.1) 56(84) bytes of data.
64 bytes from 172.20.0.1: icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from 172.20.0.1: icmp_seq=2 ttl=64 time=0.146 ms
64 bytes from 172.20.0.1: icmp_seq=3 ttl=64 time=0.144 ms

--- 172.20.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2046ms
rtt min/avg/max/mdev = 0.144/0.168/0.215/0.034 ms

Final Thoughts

If you want to use dpdk with OpenStack then the OpenStack charms make it easy. The charms look after all of the above and much more.

If you want a more complete guide to dpdk try here The new simplicity to consume dpdk