Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Data Rate #113

Open
turjanica-AS opened this issue Feb 10, 2022 · 28 comments
Open

Slow Data Rate #113

turjanica-AS opened this issue Feb 10, 2022 · 28 comments

Comments

@turjanica-AS
Copy link

So I have a 2 NT partition setup and am trying to send data using ntb_transport/ntb_netdev. While running qperf, or even a ftp transfer, I am only getting around 15-16 MB/s.
I'm not sure what is causing such a slow data rate.
I'm on Ubuntu 20.04 kernel 5.13.

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 10, 2022

That's an order of magnitude slower than I'd expect, even at it's slowest.

Are you using a dma engine? Assuming your hardware supports it? Without that, the software has to copy over IO memory with memcpy. There's lots of things that could slow that down.

Also, I know if you setup and use ntb_msi it improves the speed significantly, but you should be able to get much higher without it.

@turjanica-AS
Copy link
Author

I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.

I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).

So it's something to do with the ntb_netdev driver when trying to do TCP/IP?

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 16, 2022

Yeah, I'm not sure. I remember it being a bit sensitive to message sizes, Especially without the ntb_msi improvement. I never used qperf but is it possible it's running on very small messages? The testing I did was with iperf. I do remember doing some tuning to increase the mtu and send larger packages which helped some.

@epilmore
Copy link

epilmore commented Feb 16, 2022 via email

@epilmore
Copy link

epilmore commented Feb 16, 2022 via email

@jborz27
Copy link
Collaborator

jborz27 commented Feb 16, 2022

I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.

I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).

So it's something to do with the ntb_netdev driver when trying to do TCP/IP?

Did you also enable DMA on ntb_transport? Either way In my experience I found ntb_perf more useful since it doesn't have the TCP/IP layer which has inefficiencies that need to be accounted for. For example optimal TCP window size and options like use of zero buffer copy can make a difference in performance.

@turjanica-AS
Copy link
Author

Thanks that's all very helpful.

I have run both qperf and iperf3 with the same ~16MB/s results.

I updated my parameters
use_dma: N->Y
copy_bytes: 1024->0
max_mw_size:0->0x80000000000000000

But no changes in speed, still getting 16MB/s.

Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport.

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 17, 2022

What do you see in cat /sys/kernel/debug/ntb_transport/<pci_addr>/qp0/stats?

@epilmore
Copy link

epilmore commented Feb 17, 2022 via email

@turjanica-AS
Copy link
Author

On Thu, Feb 17, 2022 at 12:59 PM accipitersystems-Turjanica @.***> wrote: Thanks that's all very helpful. I have run both qperf and iperf3 with the same ~16MB/s results. I updated my parameters use_dma: N->Y copy_bytes: 1024->0 max_mw_size:0->0x80000000000000000 But no changes in speed, still getting 16MB/s. Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport.
I'm assuming that value for max_mw_size is a typo since it is 68-bits long! Maybe try running traceroute on your IP interface for the NTB. Eric

Yes, sorry a typo, only supposed to be 15 0s, not 16.
Traceroute shows
Server1: traceroute to 192.1.1.11 (192.1.1.11), 30 hops max, 60 byte packets
1 192.1.1.11 (192.1.1.11) 0.972 ms 0.957 ms 8.229 ms
Server2: traceroute to 192.1.1.10 (192.1.1.10), 30 hops max, 60 byte packets
1 192.1.1.10 (192.1.1.10) 0.715 ms 0.699 ms 6.648ms

@turjanica-AS
Copy link
Author

What do you see in cat /sys/kernel/debug/ntb_transport/<pci_addr>/qp0/stats?

Below is the output of stats.

root@carson-server1:/sys/kernel/debug/ntb_transport/0000:01:00.0/qp0# cat stats

NTB QP stats:

rx_bytes - 1473583413
rx_pkts - 23085
rx_memcpy - 23085
rx_async - 0
rx_ring_empty - 28521
rx_err_no_buf - 0
rx_err_oflow - 0
rx_err_ver - 0
rx_buff - 0x00000000ed26dcee
rx_index - 6
rx_max_entry - 7
rx_alloc_entry - 100

tx_bytes - 790495
tx_pkts - 11850
tx_memcpy - 11850
tx_async - 0
tx_ring_full - 0
tx_err_no_buf - 0
tx_mw - 0x00000000218a732e
tx_index (H) - 6
RRI (T) - 5
tx_max_entry - 7
free tx - 6

Using TX DMA - No
Using RX DMA - No
QP Link - Up

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 17, 2022

So, you're still not using a dma engine. It's all being memcpy'd.... Perhaps there is no dma engine? Do you see anything in /sys/class/dma? What kind of CPU do you have?

@turjanica-AS
Copy link
Author

So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No.

root@carson-server1:/sys/module/ntb_transport/parameters# cat /sys/kernel/debug/ntb_transport/0000:01:00.0/qp0/stats

NTB QP stats:

rx_bytes - 19618
rx_pkts - 128
rx_memcpy - 128
rx_async - 0
rx_ring_empty - 256
rx_err_no_buf - 0
rx_err_oflow - 0
rx_err_ver - 0
rx_buff - 0x0000000063c68030
rx_index - 2
rx_max_entry - 7
rx_alloc_entry - 100

tx_bytes - 21965
tx_pkts - 130
tx_memcpy - 130
tx_async - 0
tx_ring_full - 0
tx_err_no_buf - 0
tx_mw - 0x00000000409f3664
tx_index (H) - 4
RRI (T) - 3
tx_max_entry - 7
free tx - 6

Using TX DMA - No
Using RX DMA - No
QP Link - Up

root@carson-server1:/sys/module/ntb_transport/parameters# ls /sys/class/dma
dma0chan0 dma0chan1 dma1chan0 dma1chan1 dma2chan0 dma2chan1 dma2chan2 dma2chan3

root@carson-server2:/sys/module/ntb_transport/parameters# ls /sys/class/dma
dma0chan0 dma0chan1 dma1chan0 dma1chan1

@epilmore
Copy link

epilmore commented Feb 17, 2022 via email

@turjanica-AS
Copy link
Author

On Thu, Feb 17, 2022 at 1:56 PM accipitersystems-Turjanica @.***> wrote: So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No.
The DMA engine is acquired when the ntb_transport QP is created. You'll want to bring down the network interface, and possibly just unload and reload the ntb_netdev module, on both ends. This will cause the QP to get recreated and it should pick up the now-present DMA engine. Eric

So I just went back and took everything down to try and bring it back up in a proper order. My one server looks good dmesg shows "switchtec 0000:01:00.0: Using DMA memcpy for TX" and RX. But when I try and load the second server with ntb_transport it gets stuck and shows a dmesg error....
Software Queue-Pair Transport over NTB, version 4
BUG: unable to handle page fault for address: ffffbc8dc1e37074
#PF: supervisor read access in kernel mode
#PF: error code (0x0000) - not-present page
PGC 100000067 P4D 1000000067 PUD 1001d6067 PMD 119c53067 PTE 0
Oops: 0000 [#1] SMP NOPTI
CPU: 3 PID: 191 Comm: modprobe Tainted: G OE 5.13.0-30-generic #33~20.04.1-Ubuntu

There is also a scenario when it shows, but in this scenario it allows me to start the network interface, but not send pings. This error comes right after switchtec: eth0 created. Happens when I load the ntb_netdev on the other server.

CPU: 7PID: 172 Comm: kworker/7:1 Tainted: G OE

@turjanica-AS
Copy link
Author

So, you're still not using a dma engine. It's all being memcpy'd.... Perhaps there is no dma engine? Do you see anything in /sys/class/dma? What kind of CPU do you have?

i7-11700 CPU.

Yes I was able to see something in /sys/class/dma. My response to epilmore shows more about what I'm seeing in dmesg now.

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 18, 2022

What device provides the DMA engines? Is it part of the CPU or is this something else? I thought only Xeon CPUS had DMA engines in them; but I could be wrong about that.

@turjanica-AS
Copy link
Author

What device provides the DMA engines? Is it part of the CPU or is this something else? I thought only Xeon CPUS had DMA engines in them; but I could be wrong about that.

I'm using a PFX switchtec chip which has the hardware DMA engine, and the switchtec-dma kernel module exports the access to the Switchtec DMA engine to the upper layer host software.

@jborz27
Copy link
Collaborator

jborz27 commented Feb 18, 2022

Yeah, it should be the Switchtec DMA. That's what ntb_perf uses when the Switchtec DMA driver is loaded.

@turjanica-AS
Copy link
Author

Yea, which is strange that ntb_perf is having no issues but ntb_netdev/transport is.

Another wrinkle to the above error I had shown. The one server shows the above error and the other server shows a repeat of the dmesg...

switchtec 0000:01:00.0: Remote version = 0

Hundreds of them, so something is messed up with the switchtec-kernel?

@jborz27
Copy link
Collaborator

jborz27 commented Feb 18, 2022

That log originates from ntb_transport.c:

/* Query the remote side for its info */
val = [ntb_spad_read];
dev_dbg(&pdev->dev, "Remote version = %d\n", val);
if (val != NTB_TRANSPORT_VERSION)
		goto out;

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 18, 2022

Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback?

@epilmore
Copy link

epilmore commented Feb 18, 2022 via email

@turjanica-AS
Copy link
Author

On Fri, Feb 18, 2022 at 2:04 PM Logan Gunthorpe @.***> wrote: Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback?
I don't think the Switchtec DMA will be generally compatible with ntb_transport usage. The Switchtec DMA requires that the Source and Destination addresses have the exact same word-byte alignment. If your Source/Dest are always on (at least) word boundaries, then you should be good to go, but if not, then the DMA engine will not be happy. In our usage of Switchtec DMA, we've only been able to leverage it for usages where there are "block" transfers happening, where the Source/Dest addresses are definitely at least word aligned. In ntb_netdev/ntb_transport usage, you don't really have a guarantee that both Source and Dest addresses will be aligned equally, i.e. (S=0x0,D=0x0), (S=0x1,D=0x1), (S=0x2,D=0x2), (S=0x3,D=0x3). Eric

Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well.

Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport.

If you would like, here are the outputs and what I was doing before the error. Did full setup on Host 1 first then started Host2. Host 1 eventually freezes, mouse moves but won't take any input.
Netdev_DMAChannel_Setup_Remote_Version_Host2.txt
Netdev_DMAChannel_walkthrough_Host1.txt

@epilmore
Copy link

epilmore commented Feb 19, 2022 via email

@lsgunth
Copy link
Collaborator

lsgunth commented Feb 19, 2022

memcpy can perform worse when dealing with uncached IO memory like this application. I remember a long time ago having abysmal performance as the kernel had optimize for size set and memcpy was therefore copying one byte at a time, which meant one TLP on the PCI bus per byte. Not efficient.

@turjanica-AS
Copy link
Author

On Fri, Feb 18, 2022 at 2:46 PM accipitersystems-Turjanica @.***> wrote: Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well.
Across NTB, ntb_netdev/ntb_transport, is your only option short of writing your own version, although that won't solve your problem anyway. The issue is not so much the fault of ntb_netdev/ntb_transport, but rather the nature of data going through the general Linux netdev (TCP/IP) stack, i.e. skb's. Most DMA engines nowadays don't have this alignment restriction that the Switchtec DMA does. If your CPU is an Intel, then you might have IOAT available. Or if it is AMD, it also comes with some embedded DMA engines as part of its Crypto engine. I presume you have a host bus adapter card that connects your host to the Switchtec switch? Possibly any DMA engines available on that card?
Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport.
If a DMA engine is not available, then the data is moved by the CPU with good old fashioned memcpy(). Depending on the platform, memcpy() could be optimized to leverage possible vector registers that effectively allow more data to be moved per CPU operation versus just a simple 4 or 8 bytes per "store" operation. I don't recall what CPU you are using. Even for a CPU memcpy, 16MB/s seems low, but I haven't measured it lately to know what a reasonable range is to expect. Eric

So would the solution be to modify switchtec-dma? To then make it act like most current dma engines. That seems like the less hassle than writing new netdev/transport.

Our CPU is a i7-11700, and we are using the switchtec development board with the ADP_EDGE adapters, so no dma engine on the NICS. This dev board is being used until our HW NICs are built, but those NICs will just have a PFX on them.

@epilmore
Copy link

epilmore commented Feb 21, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants