Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf[BMQ,MQB]: stable GC history preventing allocations #498

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

678098
Copy link
Collaborator

@678098 678098 commented Nov 3, 2024

Problem overview

Consider a cluster processing 100k PUT msgs/s on its limit, meaning, that it can barely process this amount of PUTs without NACKs and producers generate exactly this rate of messages.
Consider a rollover or any other lag causing Queue dispatcher thread to stuck for 100ms.
After this time, there will be 10k pending PUT messages waiting in QUEUE dispatcher queue:
[Put1]...[Put10000]

Now, consider a GC called before this block of PUTs. It cleared something in queue history, but not everything, and also we scheduled another GC call to the QUEUE dispatcher:
[Put1]...[Put10000][gcHistory]

To call this gcHistory, we need to process all PUTs before (they are earlier in the queue).
So we process 10k PUTs and add them to queue history. Since in this scenario producers generate 100k msgs/s and cluster is also able to process 100k msgs/s and no more, by the time we process Put1-Put10000, there will be another 10k PUTs in the queue:
[gcHistory][Put10001]...[Put20000]

So when we call gcHistory, we will remove at most 1k msgs from the history (k_GC_MESSAGES_BATCH_SIZE = 1000). But at the same time, we will add 10k new items to queue history and get another 10k PUTs to the QUEUE dispatcher queue, delaying the next gcHistory.

And this process will continue. So we will constantly allocate new items in queue history items pool, and we will eventually upsize the number of buckets in ordered hashmap, causing performance lag and full rehashing of hashtable.

20kk items in queue history after a 5-minute run on 80k msgs/s:
Screenshot 2024-11-03 at 02 45 23

3GB allocated for queue history (Handles):

:::::::::: :::::::::: ALLOCATORS >> Last snapshot was 1.00 m ago.
                                 | Bytes Allocated| -delta-| Max Bytes Allocated| Allocations| -delta-| Deallocations| -delta-
---------------------------------+----------------+--------+--------------------+------------+--------+--------------+--------
allocators                       |   7,424,269,408|  86,800|       7,424,336,192| 106,779,796| 315,271|   103,623,032| 315,254
  *direct*                       |           1,584|        |               1,584|          10|        |             1|        
  Application                    |   7,423,728,624|  87,632|       7,423,748,064|  84,002,540| 243,085|    80,846,206| 243,063
    *direct*                     |           9,600|        |              11,440|       1,570|        |         1,544|        
    ClusterCatalog               |   3,426,265,408|     528|       3,426,279,424|  17,366,654|  39,631|    17,365,735|  39,630
      *direct*                   |           3,776|        |               3,776|          21|        |             3|        
      cl6_dc3                    |   3,426,261,632|     528|       3,426,275,648|  17,366,633|  39,631|    17,365,732|  39,630
        *direct*                 |       1,019,184|        |           1,056,640|   4,229,079|        |     4,228,883|        
        StorageManager           |   3,382,069,840|     528|       3,382,083,680|  13,137,171|  39,631|    13,136,714|  39,630
          *direct*               |           7,920|        |               7,920|          41|        |            12|        
          Partition1             |   3,361,992,272|     528|       3,362,006,112|  13,136,758|  39,631|    13,136,583|  39,630
            *direct*             |           5,824|     528|              19,664|  13,136,547|  39,631|    13,136,522|  39,630
            QueueStorage         |   3,163,421,248|        |       3,163,421,408|         139|        |            40|        
              *direct*           |           2,096|        |               2,256|          15|        |             4|        
              Handles            |   3,059,921,280|        |       3,059,921,280|          63|        |            20|     

Alternative 1

Set k_GC_MESSAGES_BATCH_SIZE = 1000000:
Screenshot 2024-11-03 at 02 47 12

Alternative 2

insert to history is the only operation that can increase history size, and, if we know that GC required, we can enforce GC on any insert operation. So the workflow is:

  • We exec scheduled gcHistory and find out that we cleaned 1000 items from history, but there are more. We set d_requireGC flag for history.
  • Next, we have a PUT event. We do insert to history on PUT. insert checks for d_requireGC flag and execs gc for another batch of items.
  • This process continues for one or more PUTs, but eventually we reach the end of gc, when there are no items to gc. We unset d_requireGC flag, so the following PUTs do not cause GC.
  • The situation remains the same until the next scheduler call possibly set d_requireGC again.

So we pay some CPU time on PUT operations to do the GC, but at the same time:

  • We save CPU time on allocating elements to history items pool.
  • We save CPU on rehashing the history hashmap when going to bigger scale.
  • We occupy less memory, the memory growth is not uncontrollable anymore.
  • We save some CPU on enque gcHistory callback to the same dispatcher queue.

Screenshot 2024-11-03 at 02 57 10

Comparison

80k msgs/s priority stress test shows the same results as 90k msgs/s with the proposed change. This means, that with this change we can actually hold better on higher throughputs.

Screenshot 2024-11-03 at 03 11 25

Screenshot 2024-11-03 at 03 11 10

@678098 678098 requested a review from a team as a code owner November 3, 2024 02:18
@678098 678098 force-pushed the 241103_queue_history_GC_opt branch from 992dd4c to adcdc17 Compare November 3, 2024 02:20
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 354 of commit adcdc17 has completed with FAILURE

@678098 678098 force-pushed the 241103_queue_history_GC_opt branch from adcdc17 to 310abe9 Compare November 3, 2024 04:06
@678098 678098 force-pushed the 241103_queue_history_GC_opt branch from 310abe9 to 1196b50 Compare November 3, 2024 22:21
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 355 of commit 1196b50 has completed with FAILURE

@chrisbeard
Copy link
Contributor

Before, performing gc on history was always sort of best-effort and done in the background. I think I agree on approach 1 not being ideal, it doesn't get at the core of the problem which is mismatched pacing of PUTs and gc; gc is still too periodic. Approach 2 is interesting, I hadn't considered moving gc into the critical path.

With approach 2, if (and only if) background gc can't keep history limited to the configured time, the broker starts doing gc on the critical data path when processing incoming PUTs. When doing that gc, k_INSERT_GC_MESSAGES_BATCH_SIZE controls how aggressive the gc is and since a PUT is only one message, any value > 1 will eventually reduce history to the desired size?

I'm curious if you've got any measurements on the gc time, do we have a rough number for how long that takes? I suspect it's pretty cheap. It'd be interesting to see if a smaller batch of gc on PUTs helps (in terms of latency maybe?) at all.

@678098
Copy link
Collaborator Author

678098 commented Nov 5, 2024

I hadn't considered moving gc into the critical path.

Actually, we do gc from the same QUEUE dispatcher thread, but just more rarely.

and since a PUT is only one message, any value > 1 will eventually reduce history to the desired size?

That is correct. However, there should also be some overhead of calling gc, making some initial checks and going into the loop etc. Need to check if it's inlined or not.

I'm curious if you've got any measurements on the gc time, do we have a rough number for how long that takes? I suspect it's pretty cheap.

Yes, I think it is cheap, but I will also add benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants