Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excess CPU time taken (Chrome, Brave) #112

Closed
ctm opened this issue Mar 22, 2020 · 10 comments
Closed

excess CPU time taken (Chrome, Brave) #112

ctm opened this issue Mar 22, 2020 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@ctm
Copy link
Owner

ctm commented Mar 22, 2020

Grizz was having a lot of trouble with his client yesterday, where it just got laggy and eventually he had to "reboot". I don't know what he meant by that and will need to seek clarification, but at minimum I believe he restarted his browser and there's.a chance he restarted his OS.

I didn't have that problem, per-se, but there have been a few times where I've been playing and my fan has come on and sure enough Activity Monitor shows a Brave process running at 100% of CPU. In every case when I've force-quit that process, my GMail tab instantly shows that it died. However, I haven't had this problem with GMail when I haven't been playing games on mb2, so I suspect Grizz's issue and mine are essentially the same ones (Brave is a Chrome variant).

What makes this especially weird is that this morning when I got up I had a GMail process consuming a bunch of CPU time and not only did I not have any window/tab running the mb2 client, I hadn't had one up in hours (since I had been sleeping), although my computer had been sleeping too.

I don't recall this ever happening back when we were doing Sunday 2pm play-testing. Since then I've upgraded to newer versions of Yew and I've also moved from std-web to web-sys. I suspect that a change to my stack is responsible for this, but right now I don't even know for sure that it's even mb2 related (although Grizz's issue is mb2 related!).

This one is going to be a PITA to track down, however if it is a problem with my stack then it's quite possible someone else will hit it and it will get resolved all by itself. If not, even though it will be a PITA, I believe that will just be because it will be wall-clock time consuming, but not that it will require a lot of mental work, per-se. Either way, although this is a huge issue, it's not yet a huge priority.

@ctm ctm added the bug Something isn't working label Mar 22, 2020
@ctm ctm self-assigned this Mar 22, 2020
@ctm
Copy link
Owner Author

ctm commented Mar 22, 2020

There's a small chance this is relevant. I hesitate to even mention it, but the involvement of WebSockets has me perking up my ears, but the age of that ticket tells me to just keep it in mind until I know more.

@ctm
Copy link
Owner Author

ctm commented Mar 22, 2020

GrizzB: Version 80.0.3987.149 (Official Build) (32-bit)
deadhead: Thanks!
GrizzB: Windows 10 64 bit
GrizzB: Wehn it happens first there is lag getting echoback or any response
deadhead: How much memory?
deadhead: 32-bit is interesting ...
GrizzB: gets progressively worse - completely locking the interface
deadhead: what do you do to get things back?
deadhead: close the window?
GrizzB: could be running out of memory only 8 gig and I keep too much open
deadhead: quit the browser?
GrizzB: kill the game window oftern works
GrizzB: sometimes have to log back in
GrizzB: don;t think i ever hav to kill the browser
GrizzB: is a thinkpad with ssd

@ctm
Copy link
Owner Author

ctm commented Mar 22, 2020

Even though this is ostensibly about CPU, I just played a tourney and used Chrome's Task Manager to monitor memory and sure enough, we have a leak. Additionally, right at the end, I got a panic similar to what I reported in #105 but it was right as one of Chrome's fields ticked over from 1.9 GB to 2.0 GB. WASM is 32 bits and it's not inconceivable that 2 GB is a magic number.

So we may have three separate bugs, one with excess CPU consumption, one at the end of tourneys and one being a memory leak, but the memory leak is pretty much guaranteed to make the game an unpleasant experience for everyone, so that's what I'll be concentrating on.

For the record, although I forgot to copy his info verbatim, Quick was using a 32 GB water cooled machine (Alienware, I think) and was using the latest Chrome and he was getting behavior similar to Grizz, where he started getting a lot of lag. I had no perceptible lag up until the panic. However, I'm running MacOS, Quick and Grizz are running Windows.

Quick also mentioned a different memory stat:

Quick: committed mem went up from 4 to 11GB
Quick: it is all getting consumed by the game popup
Quick: killed the game popup and used mem went back to what it was before

I wasn't running Activity Monitor, so I don't know if I had similar memory bloat.

All of this is trivially reproducible, although it takes time when we play for real. If I knew my web development tools better there's a good chance it handed us the answer on a platter. However, I can read up on the available tools and I can also write some test code.

@ctm
Copy link
Owner Author

ctm commented Mar 22, 2020

Oh, and I have a hunch that the memory leak has to do with the addition of text to my view, because the consumption appeared to be accelerating. At the time of the crash, the hand-history had 1,358 lines in it. I don't keep track of how many lines are generated per action, but my guess is the average would be about 5. As it is, I saved a copy of the buffer (currently in /tmp) and could even use it as a test case (not that it holds anything special that Lorem Ipsum doesn't).

@ctm
Copy link
Owner Author

ctm commented Mar 23, 2020

My hunch was wrong. I wrote a little test program that sends a bunch of lines to a div and there's essentially no memory growth. I then tried routing them through a websocket echo server and no memory loss. FWIW, in my real code I'm using serde_cbor, but it doesn't have any unsafe code, so I don't think that's an issue.

At this point, my inability to get geiger to work with wasm32-none-none is a little annoying because that might help me.

It can't be user input, because I was observing a table after I bust and the memory was climbing as Grizz repeatedly raised the two remaining players on vacation. At that point all I'm doing is receiving data from the socket, deserializing it, translating it to text, putting it in the window and scrolling.

Translating it to text is done without any unsafe code. I guess scrolling could be it, although that seems exceedingly unlikely.

Wait. Now that I think about it, the way I scroll could actually be it. I wouldn't think so, but who knows and it's easy to test.

@ctm
Copy link
Owner Author

ctm commented Mar 23, 2020

There are a ton of little differences between my sample program that doesn't leak and my table code. Each one seems unlikely, but something is causing the leak. In no particular order, here are the differences I'm finding by doing a cursory examination of my table code:

  • CborMinimal
  • it's running in a pop-up
  • there are delays between when the packets are sent
  • the server is actix-web (vs. tungstenite)
  • send_binary vs. send
  • wss vs. ws
  • log is its own method
  • log has the indentation hacks
  • log has the non-breaking space hack
  • view_data is its own method

Of the above, send_binary, wss and actix-web seem like the most likely. It should be trivial to not use wss and to not use send_binary and CborMinimal.

I think my next step should be to temporarily put my little test program aside and just make a branch of mb2 where I hack up the lobby and the server so that as soon as a connection is made the server sends a bunch of lobby messages and we display them. I haven't seen memory issues in the lobby, but the lobby really doesn't get many messages compared to a game.

Once that's going, we can go back from wss to ws, Json from CBorMinimal, and text vs. binary. OTOH, if I can't get the lobby leaking memory, I'll need to explore the possibility that the table being in a pop-up matters. One thing that is a little strange is that Chrome's Task Manager doesn't show the table as a separate entity, like it does different tabs.

Anyway, if I go any further tonight I won't be able to easily able to stop, so I'll do all of the above first thing tomorrow.

@ctm
Copy link
Owner Author

ctm commented Mar 23, 2020

I couldn't sleep, so I hacked away and I have a modified lobby that leaks memory, but it no longer does most of the things it used to. However, it still:

  • uses actix-web instead of tungstenite
  • sends encoded (albeit by JSON, not CBOR) packets

I modified my toy program to be very much like my modified lobby/server in that it simply sends one command to the "echo server", an then the echo server then echos it back 1,500 times.

I also tried commenting out the code in the lobby that pushd the line to my log_data and when I did, there was no memory leak, but when I made it so the actix-web server only sent back one chat line after receiving the log in but made it so that when we got that chat line we pushed it to log_data 1,500 times, we also didn't have a memory leak. So it's not just the log_data having 1,500 lines in it (which I didn't think was the case).

There may be other differences other than the two bullet items above, but I can probably test each of them without too much difficulty, although I'm pretty tired right now.

@ctm
Copy link
Owner Author

ctm commented Mar 23, 2020

Ugh. Turns out it was wee-alloc. That was a difference, but I didn't notice it until I went to remove yew-router from Cargo.toml, not because I was suspicious of yew-router, but because I was running out of differences.

My guess is I can make my tiny self-contained example show this problem and then I can make it even more minimal (for example, I doubt that it's necessary to bounce messages through a WebSocket, but who knows, wee_alloc must work in most cases, or someone else would have already discovered this issue).

However, before I do anything else, I'm going to get a new client out.

@ctm
Copy link
Owner Author

ctm commented Mar 23, 2020

Deployed and slightly tested on https://devctm.com:8081

However, simply adding wee_alloc to my small test program did not invoke the leak, so there's something else necessary as well. I have a hunch. I'll keep this issue open until I figure out what that other piece is.

@ctm
Copy link
Owner Author

ctm commented Mar 23, 2020

I created yew_wee_alloc_leak_demo and reported it.

@ctm ctm closed this as completed Mar 23, 2020
@ctm ctm mentioned this issue Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant