-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve field count typical case performance #120
Conversation
The tightest upper bound one can specify on the number of fields in a struct is `sizeof(type) * CHAR_BIT`. So this was previously used when performing a binary search for the field count. This upper bound is extremely loose when considering a typical large struct, which is more likely to contain a relatively small number of relatively large fields rather than the other way around. The binary search range being multiple orders of magnitude larger than necessary wouldn't have been a significant issue if each test was cheap, but they're not. Testing a field count of N costs O(N) memory and time. As a result, the initial few steps of the binary search may be prohibitively expensive. The primary optimization introduced by these changes is to use unbounded binary search, a.k.a. exponential search, instead of the typically loosely bounded binary search. This produces a tight upper bound (within 2x) on the field count to then perform the binary search with. As an upside of this change, the compiler-specific limit placed on the upper bound on the field count to stay within compiler limits could be removed.
This issue was originally noticed when some source files in a large project seemed to be consuming a suspiciously large amount of memory (and also time). After some digging, the culprit was eventually nailed down as field count detection. Before these changes, compiling one such source file with clang 14 peaked at 1.5 GB of memory usage. After these changes, compiling the same file peaked at 617 MB. These numbers include the memory usage of compiling everything else as well, which is why the after-fix number is still relatively large. The memory usage attributable just to field counting probably went from something like 1 GB to something at least one or two orders of magnitude less. |
The PR fails on msvc-14.1 with I'd recommend to try another approach: just change the Something like that:
|
In the last CI run, 15 tasks failed with a compiler is out of heap space error. With the jobs running in parallel, it's hard to determine which tasks failed due to their own excessive memory usage and which were well-behaved, but a victim of running when another task consumed all the available memory.
In the last CI run, 15 tasks failed with a compiler is out of heap space error. With the jobs running in parallel, it's hard to determine which tasks failed due to their own excessive memory usage and which were well-behaved, but a victim of running when another task consumed all the available memory. I pushed a temporary commit that I believe should disable testing in parallel. Could you please approve the CI build? Hopefully this will give me enough info to be able to diagnose the real issue, after which I can revert the CI config change.
Regarding your suggestion, maybe I'm not understanding it correctly, but I don't see how it would help with performance. The crux of the performance issue is that the check of whether a type is constructible with N arguments, My proposed changes aim to minimize the overall cost by minimizing the sum of all N checked. They effectively replace
|
The excessive compile time/memory usage issues that were causing failures have been fixed. The central issue was that the |
@apolukhin Is there anything more you'd like me to address? The AppVeyor build succeeds now, and does so roughly 5% faster than before despite there being 3 new tests. |
Sorry, I've misread you for the first time. So here's how I see your changes (please fix me if I'm wrong): T is default-constructible: you do exponentioal search for upper bound of fields count. It takes log(fields_count) + 1. After that you do a binary search that takes log(log(fields_count) + 1). The final complexity is log(fields_count) + 1 + log(log(fields_count) + 1). The current implementation starts the binary search from sizeof(type) * CHAR_BIT. However, that CHAR_BIT multiplication is not necessary - we do not need to know the eact count of bitfields. Instead of that the binary search could work with sizeof(type)+1, and if we get the maximum value, then the type has bitfields. Here comes the math. Your algorithm is better when which is which is sizeof(type) is equal to fields_count*avg_field_size. It gives us fields_countavg_field_size+1 > fields_count2 * (log(fields_count) + 1) Which is For fields count 16 the avg_field_size should be about 10 to make your algorithm better. For fields count 256 the avg_field_size should be about 18 to make your algorithm better. For aggregates of ints, chronos, pointers or size_ts existing algorithm performs better. For aggregates of strings and vectors your algorithm performs better than the existing. I'd rather call it a tie. But the CHAR_BITS multiplocation should be removed. T is not defsult constructible: your approach is defenetly superior. I'm worried about the cases, when the whole type is not aggregate initializable, because in that case I think your algo would run as long as the RAM os not exhausted and no diagnostic will be provided. Probably it is the reason, why github CI fails. I'm also worried about compiler idiosyncrasies. Not all the compilers are listed in CI, so I'd rather stick to the existing, well tested algorithm, if it does not make a noticeable difference. Here's the plan:
|
Your understanding of the approach used in these changes is correct: exponential search followed by binary search. However, your performance analysis only counts "steps". Critically, it does not factor in the cost of checking at each step whether the type is constructible with N arguments, which is O(N) memory and time. Factoring this in to the default-constructible case, the exponential search worst case costs O(1 + 2 + 4 + ... + fields_count + 2 * fields_count) = O(4 * fields_count) to establish bounds separated only by a factor of 2. The best case costs O(1 + 2 + 4 + ... + fields_count + 1) = O(2 * fields_count). Without exponential search and ignoring the possibility of bitfields, we start with a binary search over [0, sizeof(T)]. To reach bounds separated only by a factor of 2, in the best case of an average field size in [1, 2] bytes, this only requires one check costing O(sizeof(T) / 2). This ranges from O(field_count / 2) to O(field_count), which is 1/4 to 1/2 of exponential search's best case cost. However, the favorable comparison fades rather quickly. Once the average field size passes 4 bytes, reachng bounds separated only by a factor of 2 requires (at least) three checks costing O(sizeof(T) / 2 + sizeof(T) / 4 + sizeof(T) / 8) = O(7/8 * sizeof(T)). The field count must be less than sizeof(T) / 4, so in terms of field count, the cost is at least O(7/8 * 4 * field_count) = O(7/2 * field_count). This is already nearly the worst case cost of exponential search, and it only gets worse for larger average field sizes. In the worst case of a single field, the cost is O(sizeof(T) / 2 + sizeof(T) / 4 + sizeof(T) / 8 + ... 1) = O(sizeof(T)). This is sizeof(T) / 4 times exponential search's worst case cost. The cost factor is nearly unbounded as it simply grows with sizeof(T). Considering that the best case cost factor of using binary search only is 1/4 and the worst case cost factor is nearly unbounded, I believe that that alone should be enough reason to use exponential search. Additionally, it seems that exponential search may be faster in "average" use, as evidenced by the AppVeyor builds with these changes being roughly 5% faster than builds without. And as a reminder, really poor performance cases with binary search aren't just a theoretical possibility that won't happen in practice. I went through the effort to make and propose these changes specifically because I ran into such a poorly performing case where these changes made a huge difference (#120 (comment)). Regarding the latest CI failure, I haven't had much time to look into it yet. I was hoping that it wouldn't be necessary to do this because it feels unclean, but perhaps it's wise to add a sanity check to the exponential search to halt it if it somehow exceeds |
It should not take O(N). Please, check that your compiler supports the builtin and that it is properly detected in https://github.com/boostorg/pfr/blob/28bd7f541f7a7632f4fe52e30d06a121e7cb1f65/include/boost/pfr/detail/make_integer_sequence.hpp |
Even if Here's a demonstration of recent versions of clang, gcc, and msvc all running out of resources (whether self-limited or host-limited time or memory) trying to evaluate |
This could happen for a type with a constructor accepting a parameter pack. This also prevents unbounded growth in case something goes wrong with the logic and something should have already stopped (or never started).
I noticed that someone else (@rressi-at-globus) appears to have discovered these changes and found the changes to be useful enough to make their own attempt to get these changes upstreamed with #179. I resynchronized my branch with |
I re-ran the tests that I previously ran in #120 (comment) to once again demonstrate that the cost of
|
@apolukhin: Here are direct reponses to your comment, #120 (comment). I apologize for not doing this earlier.
You were right, there was a problem. I added tests for a default-constructible type accepting 0 or more arguments and a non-default-constructible type accepting 1 or more arguments in e80f7aa. These tests should now pass with the changes from dd1ae1c.
It does make a noticeable difference for large types. Practical evidence:
Theoretical evidence:
As you're aware, the current implementation performs binary search with an upper bound of The proposed implementation would perform binary search with an upper bound that is no more than twice the field count. This would not carry any unnecessary significant cost. In this regard, removing the multiplication by There is however a preexisting case of concern: large types that are not aggregate-initializable in environments that do not provide |
I have exported this contribution as a patch for the vcpkg''s port of this component. The patch is curently targeting boost-pfr version 1.85.0, and is under review here: |
Still fails the tests |
I'm looking at the output of the failed runs now, but there seems to be a lot of unstructured text to comb through... Any tips as to what to look at? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your code review, @apolukhin. I have read all of your comments. I replied to 3 comments that I wasn't sure how to proceed for. All of the other comments make sense and I will address them ASAP.
Verified ok targeting both x86 and x86-64 for: - clang 3.7.1 - clang 19.1.0 - gcc 8.2 - gcc 14.2 - msvc 19.20 - msvc 19.40
This reverts commit a6d0c5f. Conflicts: - include/boost/pfr/detail/fields_count.hpp
…econditions" This reverts commit 096445f. Conflicts: - include/boost/pfr/detail/fields_count.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks almost perfectly!
Added a few comments
@@ -30,6 +31,12 @@ import std; | |||
|
|||
namespace boost { namespace pfr { namespace detail { | |||
|
|||
///////////////////// min without including <algorithm> | |||
template <class T> | |||
constexpr const T &min(const T &a, const T &b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove all the references and pass std::size_t by value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current signature and behavior match std::min
(1). In a constant-evaluated context, which should be the case for all operations in this file, there will be no runtime difference.
If you still want this to pass and return by value, let me know and I will make the change. But if doing so, I believe I should also un-template the function and make it strictly accept and return std::size_t
so the function can't be accidentally invoked with some other type that may be incorrect to pass/return by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have replied to your new comments and will await your responses. I have also pushed a simple fix for what seems to be the only remaining issue identified by CI: Address violation of Boost min/max guidelines.
@@ -30,6 +31,12 @@ import std; | |||
|
|||
namespace boost { namespace pfr { namespace detail { | |||
|
|||
///////////////////// min without including <algorithm> | |||
template <class T> | |||
constexpr const T &min(const T &a, const T &b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current signature and behavior match std::min
(1). In a constant-evaluated context, which should be the case for all operations in this file, there will be no runtime difference.
If you still want this to pass and return by value, let me know and I will make the change. But if doing so, I believe I should also un-template the function and make it strictly accept and return std::size_t
so the function can't be accidentally invoked with some other type that may be incorrect to pass/return by value.
Awesome work! Many thanks for it! |
And thank you for hanging in there with me through the long process to ensure that these changes meet the high standards expected for a widely-used library. Although I hope this should never happen, if any concerns are raised about this set of changes in the future, feel free to contact me for support. |
* Start upper bound fields search from `4` fields, to avoid slow startup on typical workloads * Inline the `fields_count_binary_search_unbounded` function to reduce template instantiations depth by 1 * Renamed `min` to `min_of_size_t` to avoid weired syntax * Applied idea of better error reporting from #120 * Do not start fields count computation if one of the static asserts failed. That speedups error reporting in edge cases * Use `std::*_t` versions of traits as they are faster in some implementations * Rewrite binary search to simplify it and to avoid degradation to linear search on types that have constructor from variadic pack * Remove default template parameters to simplify code As a result, the whole test suite now runs 10%-25% faster on MSVC, ~20% faster on Clang, and 7%-20% faster on GCC.
The tightest upper bound one can specify on the number of fields in a struct is
sizeof(type) * CHAR_BIT
. So this was previously used when performing a binary search for the field count. This upper bound is extremely loose when considering a typical large struct, which is more likely to contain a relatively small number of relatively large fields rather than the other way around. The binary search range being multiple orders of magnitude larger than necessary wouldn't have been a significant issue if each test was cheap, but they're not. Testing a field count of N costs O(N) memory and time. As a result, the initial few steps of the binary search may be prohibitively expensive.The primary optimization introduced by these changes is to use unbounded binary search, a.k.a. exponential search, instead of the typically loosely bounded binary search. This produces a tight upper bound (within 2x) on the field count to then perform the binary search with.
As an upside of this change, the compiler-specific limit placed on the upper bound on the field count to stay within compiler limits could be removed.