Has anyone else done work with the ARM virtio-net connection on a multicore configuration?
For anyone wondering about our speeds, we did some work to get data in/out of a VM’s memory space faster. The default configuration limits the channel to ~40MBits/s.
Here is the basic setup:
VM0: 192.168.1.100 - Core 1
VM1: 192.168.1.101 - Core 2
VM0 runs iperf3 -s
, VM1 runs iperf3 -c 192.168.1.100
. From the basic setup, we get these errors:
camkes_virtqueue_buffer_alloc@virtqueue.c:32 Error: ran out of memory
camkes_virtqueue_driver_scatter_send_buffer@virtqueue.c:191 Error: could not allocate virtqueue buffer
tx_virtqueue_forward@virtio_net_virtqueue.c:82 Unknown error while enqueuing available buffer for dest 0:0:0:0:0:1.
While the iperf test works, the constant prints are rather annoying, and the bitrate tanks.
We were able to remove those errors by increasing the shared memory and queue size:
diff --git a/components/VM_Arm/configurations/vswitch_connections.h b/components/VM_Arm/configurations/vswitch_connections.h
index 952c054..48cde0b 100644
--- a/components/VM_Arm/configurations/vswitch_connections.h
+++ b/components/VM_Arm/configurations/vswitch_connections.h
@@ -99,8 +99,8 @@
vm##base_id.ether_##target_id##_recv_id = idx * 2 + 1; \
vm##base_id.ether_##target_id##_recv_attributes = VAR_STRINGIZE(target_id##base_id); \
vm##base_id.ether_##target_id##_recv_badge = CONNECTION_BADGE; \
- vm##base_id.ether_##target_id##_send_shmem_size = 32768; \
- vm##base_id.ether_##target_id##_recv_shmem_size = 32768;
+ vm##base_id.ether_##target_id##_send_shmem_size = 32768 * 16; \
+ vm##base_id.ether_##target_id##_recv_shmem_size = 32768 * 16;
// Add macaddress to virtqueue mapping. Called per connection per vm
#define __ADD_MACADDR_MAPPING(base_id, vm_id, idx) \
@@ -153,7 +153,8 @@
#define VM_CONNECTION_CONFIG(to_end, topology) \
topology(__CONFIG_EXPAND_PERVM) \
- to_end##_topology = [topology(__CONFIG_EXPAND_TOPOLOGY)];
+ to_end##_topology = [topology(__CONFIG_EXPAND_TOPOLOGY)]; \
+ topology##_conn.queue_length = 256 * 16;
#define VM_CONNECTION_INIT_HANDLER \
{ \
At this point, the system throughput has drastically improved:
root@xilinx-zcu102-2021_1:~# iperf3 -c 192.168.1.100
Connecting to host 192.168.1.100, port 5201
Accepted connection from 192.168.1.101, port 40592
[ 5] local 192.168.1.101 port 40594 connected to 192.168.1.100 port 5201
[ 5] local 192.168.1.100 port 5201 connected to 192.168.1.101 port 40594
[ ID] Interval Transfer Bitrate
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 62.4 MBytes 523 Mbits/sec
[ 5] 0.00-1.01 sec 63.4 MBytes 529 Mbits/sec 0 327 KBytes
[ [ 5] 1.00-2.00 sec 63.1 MBytes 529 Mbits/sec
5] 1.01-2.00 sec 63.4 MBytes 535 Mbits/sec 58 297 KBytes
[ 5] 2.00-3.00 sec 63.7 MBytes 535 Mbits/sec
[ 5] 2.00-3.01 sec 64.1 MBytes 535 Mbits/sec 8 286 KBytes
However, the test doesn’t last long before the throughput literally drops to zero.
[ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes
[ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
[ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes
[ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
[ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
This time it seems to have happened because the server side is stuck in the virtio_net_notify_free_send
loop. But we’ve seen similar behavior before where both VMs are responsive and active, and can even ping
and run the iperf3
test in reverse completely functionally.
So at this point, I did another round of stability fixes to remove excess memcpy
calls, check to make sure data hasn’t queued beyond a packet size in the virtqueues, etc. I can get the iperf3
test to run for days at high throughput when the server’s printing is piped to /dev/null, but eventually the system crashes in the same way. Throughput drops to zero.
Other things I’ve tried:
- Changing BLOCK_SIZE to 2048 (greater than the MTU) and replacing the scatter calls with standard calls
- Calculating a “sum-of-bytes” checksum in the virtio_net_emul layer and validating the checksum upon virtqueue read
- Modifying the template to expose the frame caps for the shared memory region, allowing for cache operations
So I guess my question is this: has anyone else worked with the virtio-net interface on a multicore system? It seems like the existing implementation was designed with a single-core implementation in mind. Does anyone have any advice or insight for how to rearchitect the system to handle multicore better?