-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Data Rate #113
Comments
That's an order of magnitude slower than I'd expect, even at it's slowest. Are you using a dma engine? Assuming your hardware supports it? Without that, the software has to copy over IO memory with memcpy. There's lots of things that could slow that down. Also, I know if you setup and use ntb_msi it improves the speed significantly, but you should be able to get much higher without it. |
I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s. I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA). So it's something to do with the ntb_netdev driver when trying to do TCP/IP? |
Yeah, I'm not sure. I remember it being a bit sensitive to message sizes, Especially without the ntb_msi improvement. I never used qperf but is it possible it's running on very small messages? The testing I did was with iperf. I do remember doing some tuning to increase the mtu and send larger packages which helped some. |
On Wed, Feb 16, 2022 at 8:35 AM accipitersystems-Turjanica ***@***.***> wrote:
I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.
I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).
So it's something to do with the ntb_netdev driver when trying to do TCP/IP?
I would suggest looking at the ntb_transport module parameter "copy_bytes" in:
/sys/module/ntb_transport/parameters/copy_bytes.
May want to also look at:
/sys/module/ntb_transport/parameters/transport_mtu
The "copy_bytes" parameter defines a threshold below which the data
will be moved with simple memcpy,
but above the threshold it will utilize the DMA engine. The ntb_netdev
module utilizes ntb_transport
to implement the QPs used for communication.
Eric
|
On Wed, Feb 16, 2022 at 9:19 AM Eric Pilmore ***@***.***> wrote:
On Wed, Feb 16, 2022 at 8:35 AM accipitersystems-Turjanica
***@***.***> wrote:
>
> I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.
>
> I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).
>
> So it's something to do with the ntb_netdev driver when trying to do TCP/IP?
I would suggest looking at the ntb_transport module parameter "copy_bytes" in:
/sys/module/ntb_transport/parameters/copy_bytes.
May want to also look at:
/sys/module/ntb_transport/parameters/transport_mtu
The "copy_bytes" parameter defines a threshold below which the data
will be moved with simple memcpy,
but above the threshold it will utilize the DMA engine. The ntb_netdev
module utilizes ntb_transport
to implement the QPs used for communication.
Eric
I should also point out that ntb_perf does NOT utilize ntb_transport,
and so is governed by its own parameters with respect to memcpy vs DMA
when moving data.
Eric
|
Did you also enable DMA on ntb_transport? Either way In my experience I found ntb_perf more useful since it doesn't have the TCP/IP layer which has inefficiencies that need to be accounted for. For example optimal TCP window size and options like use of zero buffer copy can make a difference in performance. |
Thanks that's all very helpful. I have run both qperf and iperf3 with the same ~16MB/s results. I updated my parameters But no changes in speed, still getting 16MB/s. Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport. |
What do you see in cat /sys/kernel/debug/ntb_transport/<pci_addr>/qp0/stats? |
On Thu, Feb 17, 2022 at 12:59 PM accipitersystems-Turjanica ***@***.***> wrote:
Thanks that's all very helpful.
I have run both qperf and iperf3 with the same ~16MB/s results.
I updated my parameters
use_dma: N->Y
copy_bytes: 1024->0
max_mw_size:0->0x80000000000000000
But no changes in speed, still getting 16MB/s.
Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport.
I'm assuming that value for max_mw_size is a typo since it is 68-bits long!
Maybe try running traceroute on your IP interface for the NTB.
Eric
|
Yes, sorry a typo, only supposed to be 15 0s, not 16. |
Below is the output of stats. root@carson-server1:/sys/kernel/debug/ntb_transport/0000:01:00.0/qp0# cat stats NTB QP stats: rx_bytes - 1473583413 tx_bytes - 790495 Using TX DMA - No |
So, you're still not using a dma engine. It's all being memcpy'd.... Perhaps there is no dma engine? Do you see anything in /sys/class/dma? What kind of CPU do you have? |
So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No. root@carson-server1:/sys/module/ntb_transport/parameters# cat /sys/kernel/debug/ntb_transport/0000:01:00.0/qp0/stats NTB QP stats: rx_bytes - 19618 tx_bytes - 21965 Using TX DMA - No root@carson-server1:/sys/module/ntb_transport/parameters# ls /sys/class/dma root@carson-server2:/sys/module/ntb_transport/parameters# ls /sys/class/dma |
On Thu, Feb 17, 2022 at 1:56 PM accipitersystems-Turjanica ***@***.***> wrote:
So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No.
The DMA engine is acquired when the ntb_transport QP is created.
You'll want to bring down the network interface, and possibly just
unload and reload the ntb_netdev module, on both ends. This will cause
the QP to get recreated and it should pick up the now-present DMA
engine.
Eric
|
So I just went back and took everything down to try and bring it back up in a proper order. My one server looks good dmesg shows "switchtec 0000:01:00.0: Using DMA memcpy for TX" and RX. But when I try and load the second server with ntb_transport it gets stuck and shows a dmesg error.... There is also a scenario when it shows, but in this scenario it allows me to start the network interface, but not send pings. This error comes right after switchtec: eth0 created. Happens when I load the ntb_netdev on the other server. CPU: 7PID: 172 Comm: kworker/7:1 Tainted: G OE |
i7-11700 CPU. Yes I was able to see something in /sys/class/dma. My response to epilmore shows more about what I'm seeing in dmesg now. |
What device provides the DMA engines? Is it part of the CPU or is this something else? I thought only Xeon CPUS had DMA engines in them; but I could be wrong about that. |
I'm using a PFX switchtec chip which has the hardware DMA engine, and the switchtec-dma kernel module exports the access to the Switchtec DMA engine to the upper layer host software. |
Yeah, it should be the Switchtec DMA. That's what ntb_perf uses when the Switchtec DMA driver is loaded. |
Yea, which is strange that ntb_perf is having no issues but ntb_netdev/transport is. Another wrinkle to the above error I had shown. The one server shows the above error and the other server shows a repeat of the dmesg... switchtec 0000:01:00.0: Remote version = 0 Hundreds of them, so something is messed up with the switchtec-kernel? |
That log originates from ntb_transport.c:
|
Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback? |
On Fri, Feb 18, 2022 at 2:04 PM Logan Gunthorpe ***@***.***> wrote:
Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback?
I don't think the Switchtec DMA will be generally compatible with
ntb_transport usage. The Switchtec DMA requires that the Source and
Destination addresses have the exact same word-byte alignment. If your
Source/Dest are always on (at least) word boundaries, then you should
be good to go, but if not, then the DMA engine will not be happy. In
our usage of Switchtec DMA, we've only been able to leverage it for
usages where there are "block" transfers happening, where the
Source/Dest addresses are definitely at least word aligned. In
ntb_netdev/ntb_transport usage, you don't really have a guarantee that
both Source and Dest addresses will be aligned equally, i.e.
(S=0x0,D=0x0), (S=0x1,D=0x1), (S=0x2,D=0x2), (S=0x3,D=0x3).
Eric
|
Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well. Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport. If you would like, here are the outputs and what I was doing before the error. Did full setup on Host 1 first then started Host2. Host 1 eventually freezes, mouse moves but won't take any input. |
On Fri, Feb 18, 2022 at 2:46 PM accipitersystems-Turjanica ***@***.***> wrote:
Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well.
Across NTB, ntb_netdev/ntb_transport, is your only option short of
writing your own version, although that won't solve your problem
anyway. The issue is not so much the fault of
ntb_netdev/ntb_transport, but rather the nature of data going through
the general Linux netdev (TCP/IP) stack, i.e. skb's. Most DMA engines
nowadays don't have this alignment restriction that the Switchtec DMA
does.
If your CPU is an Intel, then you might have IOAT available. Or if it
is AMD, it also comes with some embedded DMA engines as part of its
Crypto engine. I presume you have a host bus adapter card that
connects your host to the Switchtec switch? Possibly any DMA engines
available on that card?
Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport.
If a DMA engine is not available, then the data is moved by the CPU
with good old fashioned memcpy(). Depending on the platform, memcpy()
could be optimized to leverage possible vector registers that
effectively allow more data to be moved per CPU operation versus just
a simple 4 or 8 bytes per "store" operation. I don't recall what CPU
you are using. Even for a CPU memcpy, 16MB/s seems low, but I haven't
measured it lately to know what a reasonable range is to expect.
Eric
|
memcpy can perform worse when dealing with uncached IO memory like this application. I remember a long time ago having abysmal performance as the kernel had optimize for size set and memcpy was therefore copying one byte at a time, which meant one TLP on the PCI bus per byte. Not efficient. |
So would the solution be to modify switchtec-dma? To then make it act like most current dma engines. That seems like the less hassle than writing new netdev/transport. Our CPU is a i7-11700, and we are using the switchtec development board with the ADP_EDGE adapters, so no dma engine on the NICS. This dev board is being used until our HW NICs are built, but those NICs will just have a PFX on them. |
On Mon, Feb 21, 2022 at 9:00 AM accipitersystems-Turjanica ***@***.***> wrote:
So would the solution be to modify switchtec-dma? To then make it act like most current dma engines. That seems like the less hassle than writing new netdev/transport.
Our CPU is a i7-11700, and we are using the switchtec development board with the ADP_EDGE adapters, so no dma engine on the NICS. This dev board is being used until our HW NICs are built, but those NICs will just have a PFX on them.
Modifying switchtec-dma will not help because it is a hardware limitation.
Rewriting ntb_netdev/ntb_transport will not help because the issue is
generally in the Linux Netdev stack, and rewriting that is not
practical. Furthermore, the necessary changes to conform to the
Switchtec DMA alignment requirements would likely hamper overall
performance anyway.
BTW, I do NOT claim to be an expert on the intimate details of the
Linux Netdev stack. It may be possible that there is a knob that might
force a data alignment on SKBuff's such that they could satisfy the
Switchtec DMA hardware restrictions, but I'm not aware of what that
knob is or whether one even exists.
Bottom line, I think you may be screwed. Since your server does not
have built-in DMA engines, if all the stuff you're doing is PCIe Gen3,
you can maybe consider the Dolphin PXH832 host bus adapter. You could
probably enable the PLX DMA engines on that device and use those
(plx_dma driver is in Linux courtesy of Logan!). Sorry, the NTB stuff
is cool and interesting, but to really derive the benefit, you need
somebody to actually push the data down the wire!
Unless somebody else has some bright ideas!
Eric
|
So I have a 2 NT partition setup and am trying to send data using ntb_transport/ntb_netdev. While running qperf, or even a ftp transfer, I am only getting around 15-16 MB/s.
I'm not sure what is causing such a slow data rate.
I'm on Ubuntu 20.04 kernel 5.13.
The text was updated successfully, but these errors were encountered: