Virtio-Net on Multicore ARM System

Has anyone else done work with the ARM virtio-net connection on a multicore configuration?

For anyone wondering about our speeds, we did some work to get data in/out of a VM’s memory space faster. The default configuration limits the channel to ~40MBits/s.

Here is the basic setup:

VM0: 192.168.1.100 - Core 1
VM1: 192.168.1.101 - Core 2

VM0 runs iperf3 -s, VM1 runs iperf3 -c 192.168.1.100. From the basic setup, we get these errors:

camkes_virtqueue_buffer_alloc@virtqueue.c:32 Error: ran out of memory
camkes_virtqueue_driver_scatter_send_buffer@virtqueue.c:191 Error: could not allocate virtqueue buffer
tx_virtqueue_forward@virtio_net_virtqueue.c:82 Unknown error while enqueuing available buffer for dest 0:0:0:0:0:1.

While the iperf test works, the constant prints are rather annoying, and the bitrate tanks.

We were able to remove those errors by increasing the shared memory and queue size:

diff --git a/components/VM_Arm/configurations/vswitch_connections.h b/components/VM_Arm/configurations/vswitch_connections.h
index 952c054..48cde0b 100644
--- a/components/VM_Arm/configurations/vswitch_connections.h
+++ b/components/VM_Arm/configurations/vswitch_connections.h
@@ -99,8 +99,8 @@
     vm##base_id.ether_##target_id##_recv_id = idx * 2 + 1;                               \
     vm##base_id.ether_##target_id##_recv_attributes = VAR_STRINGIZE(target_id##base_id); \
     vm##base_id.ether_##target_id##_recv_badge = CONNECTION_BADGE;                       \
-    vm##base_id.ether_##target_id##_send_shmem_size = 32768;                             \
-    vm##base_id.ether_##target_id##_recv_shmem_size = 32768;
+    vm##base_id.ether_##target_id##_send_shmem_size = 32768 * 16;                        \
+    vm##base_id.ether_##target_id##_recv_shmem_size = 32768 * 16;

 // Add macaddress to virtqueue mapping. Called per connection per vm
 #define __ADD_MACADDR_MAPPING(base_id, vm_id, idx) \
@@ -153,7 +153,8 @@

 #define VM_CONNECTION_CONFIG(to_end, topology) \
     topology(__CONFIG_EXPAND_PERVM)            \
-        to_end##_topology = [topology(__CONFIG_EXPAND_TOPOLOGY)];
+        to_end##_topology = [topology(__CONFIG_EXPAND_TOPOLOGY)]; \
+        topology##_conn.queue_length = 256 * 16;

 #define VM_CONNECTION_INIT_HANDLER                                                                          \
{                                                                                                       \

At this point, the system throughput has drastically improved:

root@xilinx-zcu102-2021_1:~# iperf3 -c 192.168.1.100
Connecting to host 192.168.1.100, port 5201
Accepted connection from 192.168.1.101, port 40592
[  5] local 192.168.1.101 port 40594 connected to 192.168.1.100 port 5201
[  5] local 192.168.1.100 port 5201 connected to 192.168.1.101 port 40594
[ ID] Interval           Transfer     Bitrate
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  62.4 MBytes   523 Mbits/sec
[  5]   0.00-1.01   sec  63.4 MBytes   529 Mbits/sec    0    327 KBytes
[ [  5]   1.00-2.00   sec  63.1 MBytes   529 Mbits/sec
 5]   1.01-2.00   sec  63.4 MBytes   535 Mbits/sec   58    297 KBytes
[  5]   2.00-3.00   sec  63.7 MBytes   535 Mbits/sec
[  5]   2.00-3.01   sec  64.1 MBytes   535 Mbits/sec    8    286 KBytes

However, the test doesn’t last long before the throughput literally drops to zero.

[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes       
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes

This time it seems to have happened because the server side is stuck in the virtio_net_notify_free_send loop. But we’ve seen similar behavior before where both VMs are responsive and active, and can even ping and run the iperf3 test in reverse completely functionally.

So at this point, I did another round of stability fixes to remove excess memcpy calls, check to make sure data hasn’t queued beyond a packet size in the virtqueues, etc. I can get the iperf3 test to run for days at high throughput when the server’s printing is piped to /dev/null, but eventually the system crashes in the same way. Throughput drops to zero.

Other things I’ve tried:

  • Changing BLOCK_SIZE to 2048 (greater than the MTU) and replacing the scatter calls with standard calls
  • Calculating a “sum-of-bytes” checksum in the virtio_net_emul layer and validating the checksum upon virtqueue read
  • Modifying the template to expose the frame caps for the shared memory region, allowing for cache operations

So I guess my question is this: has anyone else worked with the virtio-net interface on a multicore system? It seems like the existing implementation was designed with a single-core implementation in mind. Does anyone have any advice or insight for how to rearchitect the system to handle multicore better?

Hi @chrisguikema, are you able to make your test code available anywhere?

Is the looping in virtio_net_notify_free_send because of data corruption in the vqueue structure? This could be because there aren’t sufficient memory synchronization barriers in the virtqueue operations which prevent one of the ends from being moved onto a different core “transparently”.

@kent-mcleod2 I’ll have to check if I’m able to push that code publicly for review.

One thing I haven’t tried yet is creating mutexes such that both sides cannot be accessing the virtqueues simultaneously. I’ll put that together and see if it works.