Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncontrolled process memory size during test #59

Open
inducer opened this issue Feb 11, 2021 · 50 comments
Open

Uncontrolled process memory size during test #59

inducer opened this issue Feb 11, 2021 · 50 comments

Comments

@inducer
Copy link
Owner

inducer commented Feb 11, 2021

Continued from https://gitlab.tiker.net/inducer/pytential/-/issues/131.

@inducer
Copy link
Owner Author

inducer commented Feb 11, 2021

Unlike for mirgecom, jemalloc doesn't seem to work wonders for pytential. This is memory use according to memory-profiler for a sequential run of python -m pytest:

grafik

I'm glad it's enough to unbreak the CI (#56, #58), but doesn't appear totally solved.

Using the machinery I made for mirgecom, I'll try to do a running total of CL buffer memory allocated and see if there's any growth there.

@alexfikl
Copy link
Collaborator

alexfikl commented Feb 11, 2021

jemalloc seems to have some knobs to make it release memory more aggresively. There's a dirty_decay_ms and a muzzy_decay_ms. Should be easy to see what it does by setting

export MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000'

(the default is 10000 (i.e. 10s) and smaller means it releases memory more)

Found this pretty useful too, but no idea how up to date it is.

@inducer
Copy link
Owner Author

inducer commented Feb 12, 2021

Using this method for instrumenting memory allocation in pocl, I get the following plot of allocated memory (by pocl) over time:
grafik

This seems to support the assertion that, as with mirgecom, the underlying cause for the growth of allocated memory might be memory fragmentation, and, as a result, the allocator requesting more and more memory from the OS. In particular, it appears that, at least from the point of view of OpenCL, all memory that is allocated is also freed.

For reproducibility:

PYOPENCL_TEST=port:pthread POCL_DEBUG=memory ~/shared/with-my-pocl python -m pytest  --tb=native -rxsw --capture=no . 2>out    

run with 48efe3b and pocl a2d016c8.

@inducer
Copy link
Owner Author

inducer commented Feb 20, 2021

Interestingly, replacing pocl's allocator for global memory with jemalloc doesn't seem to do much:

grafik

If anything, it releases even less memory. Maybe @alexfikl is right, and this comes down to a knob twiddling.

For reproducibility:

  • Jemalloc configured with
    ./autogen.sh --prefix=$HOME/shared/pack/jemalloc-build --with-jemalloc-prefix=jemalloc_ --disable-cxx --with-install-suffix=_prefixed --disable-initial-exec-tls
    
  • pocl/pocl@a2d016c with this patch
  • Run with
    PYOPENCL_TEST=port:pthread mprof run ~/shared/with-my-pocl python -m pytest --tb=native -rxsw   
    
    where with-my-pocl is this script.

@inducer
Copy link
Owner Author

inducer commented Feb 20, 2021

As for things that are cheap to do, I discovered another allocator to try.

@inducer
Copy link
Owner Author

inducer commented Feb 21, 2021

mimalloc (used globally) is not an improvement:
grafik

@inducer
Copy link
Owner Author

inducer commented Feb 22, 2021

Tried with jemalloc with the flags given by @alexfikl

MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000' PYOPENCL_TEST=port:pthread LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so mprof run python -m pytest --tb=native -rxsw 

Not really all that different from glibc or vanilla jemalloc:
grafik

@alexfikl
Copy link
Collaborator

Just to add more random data to this. I ran

mprof run --python -T 0.01 pytest test/test_cost_model.py test/test_scalar_int_eq.py test/test_global_qbx.py test/test_target_specific_qbx.py

This only runs 4 tests, not the whole thing, so should probably taken with a grain of salt.

mprof

The different lines are

  • (black) is the default run.
  • (blue) should add export MALLOC_TRIM_THRESHOLD_=1024 (default is 128 * 1024).
  • (red) should be jemalloc.
  • (green) should be jemalloc with export MALLOC_CONF='abort:true,dirty_decay_ms:1000,muzzy_decay_ms:1000', so a bit more aggresive than setting them to 5000.

It does seem like changing some options will not have a big effect vs default jemalloc :\

@alexfikl
Copy link
Collaborator

I've also tried adding

def teardown_function(func):
    import gc
    gc.collect()

    import ctypes
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

At the top of those tests to force it to free as much as possible, but it didn't seem to improve much. However, it does seem to deallocate memory towards the end there, so it may not continue growing as all the tests run.

mprof

  • (black) is again the baseline.
  • (blue) is default jemalloc.
  • (red) is jemalloc with the snippet.
  • (green) is default malloc with the snippet.

@alexfikl
Copy link
Collaborator

Went ahead and ran the whole test suite with the snippet from previous comment added to all files.

def teardown_function(func):
    import gc
    gc.collect()

    import ctypes
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

mprof

  • (black) is a run with a cleared cache for code generation.
  • (blue) is just another run with cached code.

@inducer
Copy link
Owner Author

inducer commented Feb 22, 2021

Thanks! One additional bit of information: The Python GC already gets run by pyopencl (I think) between tests.

@alexfikl
Copy link
Collaborator

Thanks! One additional bit of information: The Python GC already gets run by pyopencl.

Ah, that explains why I didn't see much of a difference when removing the collect at some point!

@inducer
Copy link
Owner Author

inducer commented Mar 25, 2021

Had an OpenCL-related theory for something that could be contributing: inducer/pyopencl#450

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

python test_layer_pot.py

with this diff

diff --git a/test/test_layer_pot.py b/test/test_layer_pot.py
index 409eede..a67bb78 100644
--- a/test/test_layer_pot.py
+++ b/test/test_layer_pot.py
@@ -614,15 +614,17 @@ def test_3d_jump_relations(actx_factory, relation, visualize=False):
 # }}}
 
 
+def run_em_all():
+    import gc
+    for i in range(10):
+        test_off_surface_eval(_acf, True)
+        gc.collect()
+
+
 # You can test individual routines by typing
 # $ python test_layer_pot.py 'test_routine()'
 
 if __name__ == "__main__":
-    import sys
-    if len(sys.argv) > 1:
-        exec(sys.argv[1])
-    else:
-        from pytest import main
-        main([__file__])
+    run_em_all()
 
 # vim: fdm=marker

seems to be a microcosm of this issue that runs in manageable time. Memory usage goes up in a straight line, when... there's no reason for that, IMO:

grafik

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

The allocations seem to scale with the size of the problem (in this case, set by changing npoints=3000 on line 116), indicating that it's at least some amount of CL memory that's going unfreed. Interestingly, some deallocations do turn up, but in the plot at that size, but there's still a big bunch of leakage:

grafik

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

I'm using the code in this diff to hunt the leaks. One (possibly minor?) contributor to this appears to be ProcessLogger, which spawns a bunch of threads whose resources appear to never be reaped.

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

That was a reference cycle, but not a significant contributor: inducer/pytools#181.

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

Beyond that, it's mostly a bunch of pyopencl "stuff" that stays alive:

<class 'list'>, <class 'list'>, <class 'set'>, <class 'pyopencl._cl.Context'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'pyopencl.scan.GenericScanKernel'>, <class 'list'>, <class 'pyopencl._cl.Device'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltFinalUpdateKernelInfo'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl._cl.Platform'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'tuple'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'tuple'>, <class 'list'>, <class 'tuple'>, <class 'list'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, ...`

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

Hm. Calling pyopencl.tools.clear_first_arg_caches in each loop makes most of those go away (along with substantially flattening the mprof curve), so this is probably as expected. One thought is that running pytest in parallel may mean that pyopencl.tools.clear_first_arg_caches is not getting called as intended. (from here)

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

Irritatingly, even with zero growth of Python-side object allocation, the test_off_surface_eval loop still manages to lose about 20MB over 30 runs.

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

OK, it's flat after switching to jemalloc, so that's probably not our problem. Switching to the interaction of pytest with PyOpenCL's first-arg caches as the prime suspect.

@inducer
Copy link
Owner Author

inducer commented Jul 19, 2023

Well. So I removed that return from the middle of test_off_surface_eval, and suddenly we're leaking a bunch of loopy stuff. This may just be the caches though, will keep digging.

@inducer
Copy link
Owner Author

inducer commented Jul 20, 2023

Observations:

  • New LoopKernels show up in new_objects in each iteration. They're all for the FMM translations. Why are they there?
  • Does the array context's caching of einsum programs work?
  • Loopy's kernel executor mechanism is a tangle of circular references. It needs to be untangled, to avoid having it keep CL contexts alive. I wonder if it'd be better to introduce a separate object holding a reference to a CL context and a loopy kernel, and having that be callable.
  • clear_first_arg_caches is getting called, from here.

@inducer
Copy link
Owner Author

inducer commented Jul 21, 2023

inducer/loopy#797 (plus small changes across the stack) should make it so that loopy caches don't keep contexts alive.

@inducer
Copy link
Owner Author

inducer commented Jul 21, 2023

grafik

Somewhat depressingly, inducer/loopy#797 doesn't seem to have made a decisive difference. Blue is with the changes, black is before.

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

Been on the hunt for other caches that keep things alive. They contribute, but they're all small potatoes:

grafik

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

I've also printed lists of Python objects that, according to gc, survive each test. Here's a flavor:

138590, 62 diff, 119 new
[(<class 'builtin_function_or_method'>, 16), (<class 'weakref.ReferenceType'>, 15), (<class 'list'>, 14), (<class 'dict'>, 14), (<class 'cell'>, 10), (<class 'tuple'>, 8), (<class 'generator'>, 6), (<class 'method'>, 5), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'function'>, 2), (<class 'pytools.persistent_dict._LinkedList'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'pytools.persistent_dict._LRUCache'>, 1), (<class 'contextlib._GeneratorContextManager'>, 1)]
.
138728, 138 diff, 195 new
[(<class 'function'>, 39), (<class 'tuple'>, 31), (<class 'dict'>, 25), (<class 'cell'>, 22), (<class 'list'>, 11), (<class 'method'>, 8), (<class 'property'>, 8), (<class 'generator'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'pytools.persistent_dict._LRUCache'>, 4), (<class 'type'>, 4), (<class 'weakref.ReferenceType'>, 4), (<class 'getset_descriptor'>, 4), (<class '_pytest.reports.TestReport'>, 2), (<class 'set'>, 1), (<class 'builtin_function_or_method'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'warnings.WarningMessage'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'module'>, 1), (<class 'logging.Logger'>, 1), (<class '_frozen_importlib_external.SourceFileLoader'>, 1), (<class '_frozen_importlib.ModuleSpec'>, 1), (<class 'DeprecationWarning'>, 1)]
l_inf error: 0.0025745079426662495
.
138820, 92 diff, 157 new
[(<class 'list'>, 49), (<class 'tuple'>, 20), (<class 'dict'>, 15), (<class 'cell'>, 13), (<class 'method'>, 10), (<class 'pytools.persistent_dict._LinkedList'>, 7), (<class 'pytools.persistent_dict._LRUCache'>, 5), (<class 'pymbolic.primitives.Variable'>, 4), (<class 'function'>, 3), (<class 'pymbolic.primitives.Product'>, 3), (<class 'set'>, 2), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'generator'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_pytest.terminal.WarningReport'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'sumpy.expansion.LinearPDEBasedExpansionTermsWrangler'>, 1), (<class 'sumpy.expansion.CSEMatVecOperator'>, 1), (<class '_pytest.reports.TestReport'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class 'sumpy.kernel.LaplaceKernel'>, 1), (<class 'pymbolic.primitives.Quotient'>, 1), (<class 'pymbolic.primitives.Sum'>, 1)]
.
139330, 510 diff, 568 new
[(<class 'tuple'>, 83), (<class 'function'>, 79), (<class 'dict'>, 64), (<class 'weakref.ReferenceType'>, 40), (<class 'cell'>, 36), (<class 'builtin_function_or_method'>, 31), (<class 'list'>, 28), (<class 'getset_descriptor'>, 20), (<class 'wrapper_descriptor'>, 13), (<class 'method_descriptor'>, 13), (<class 'pymbolic.primitives.Variable'>, 12), (<class 'set'>, 11), (<class 'module'>, 10), (<class 'type'>, 10), (<class '_frozen_importlib.ModuleSpec'>, 10), (<class 'method'>, 9), (<class '_frozen_importlib_external.SourceFileLoader'>, 8), (<class 'frozenset'>, 7), (<class 'pyopencl.tools._SimpleTextTemplate'>, 6), (<class 'pytools.log_process'>, 6), (<class 'property'>, 5), (<class 'pytential.qbx.target_specific.impl.Enum'>, 5), (<class 'pytools.persistent_dict._LinkedList'>, 5), (<class 'pymbolic.primitives.Call'>, 4), (<class 'pymbolic.primitives.Subscript'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'generator'>, 3), (<class 'staticmethod'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3)]
.
139453, 123 diff, 178 new
[(<class 'builtin_function_or_method'>, 33), (<class 'weakref.ReferenceType'>, 32), (<class 'tuple'>, 18), (<class 'dict'>, 14), (<class 'list'>, 12), (<class 'method'>, 9), (<class 'pymbolic.primitives.Variable'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class 'generator'>, 5), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'set'>, 3), (<class '_pytest.terminal.WarningReport'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class '_pytest.reports.TestReport'>, 2), (<class 'function'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'cell'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'frozenset'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'pymbolic.primitives.Sum'>, 1), (<class 'pymbolic.primitives.Product'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1)]
.
139482, 29 diff, 94 new
[(<class 'list'>, 15), (<class 'dict'>, 11), (<class 'cell'>, 10), (<class 'method'>, 9), (<class 'tuple'>, 8), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'generator'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class 'function'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1)]

I'm increasingly coming to the conclusion that these leaks might not be Python-visible---these objects are unlikely to explain hundreds of megabytes, I think.

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

Further confirmation of this is that tracemalloc (what a nice tool!) seems to see none of the offending allocations.

@isuruf
Copy link
Collaborator

isuruf commented Jul 22, 2023

Can you try with SUMPY_FFT_BACKEND=loopy?

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

Thanks for following along! 😀

SUMPY_FFT_BACKEND=loopy takes longer, but otherwise doesn't seem to make a big difference (it's the black curve):

grafik

@alexfikl
Copy link
Collaborator

alexfikl commented Jul 22, 2023

Does something like https://github.com/bloomberg/memray give additional useful information? I've never actually tried it..

EDIT: Possibly even https://github.com/bloomberg/pytest-memray?

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

I was trying https://github.com/plasma-umass/scalene just now, but it seemed to incur fairly significant slowdown? Not sure. I guess I'll switch to memray for a bit?

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

I've already cut pytest out of the picture and wrote some code to manually call the tests, to make sure that's not at fault. (Yes, I'm that deep into conspiracy territory.) Memray just crashed initially, but that seems to be due to the Intel OpenCL ICD. If I disable it, things at least seem to run.

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

sigh It seems memray gets confused by GPUs and simply reports the 600 GiB of mapped GPU memory as memory consumption. 🤦 (Yes, yes, I'm rebuilding pocl without GPUs.)

@inducer
Copy link
Owner Author

inducer commented Jul 22, 2023

[Nvm all this, see below)

🎉 Memray did something useful!

Here's the relevant flamegraph: https://gist.github.com/inducer/2aaa320fdf49b3d5af651cbc28d5ee4d

It seems like most of the leaks are in pocl, with vkfft contributing a measly 100 MB. I'll recompile pocl with debug to try and get more precise info.

@inducer
Copy link
Owner Author

inducer commented Jul 23, 2023

[Nvm all this, see below)

Here's the updated flamegraph with Pocl in debug mode: https://gist.github.com/inducer/bfaeb3a85025f3e7a747cba930e2477e

One thing that's not clear is that the bottom of most stacktraces shows memray. Does that mean that memray's allocations dominate, and that is what we've learned? Or is the presence of those stackframes just an artifact that we should ignore?

@inducer
Copy link
Owner Author

inducer commented Jul 23, 2023

[Nvm all this, see below)

Continuing on the assumption that the "memray intercept stuff" is not the actual source of memory consumption.

If that's the case, then that flamegraph implicates two specific mallocs in pocl:

Hmm... both of those bits of code look non-leaky to me, but both of them also return memory to a caller... so there's probably more going on.

@inducer
Copy link
Owner Author

inducer commented Jul 23, 2023

[Nvm all this, see below)

The first of those sites also leaks when running https://gist.github.com/inducer/6c2a088eefeffd86cdd456e5aa244bec, a slightly modified PyOpenCL demo, on an empty cache, where it seems to leak 256 MB over the course of the runtime.

@inducer
Copy link
Owner Author

inducer commented Jul 24, 2023

[Nvm all this, see below)

I've been staring at this for a while: both the PyOpenCL demo and the full-scale test with test_layer_pot.py show multiple instance of temp_buf escaping this function. I just can't for the life of me figure out how.

@inducer
Copy link
Owner Author

inducer commented Jul 24, 2023

Never mind all that. TIL that memray flamegraph has a --leaks option that you need to use if you'd like to see leaks. 🤦 (along with PYTHONMALLOC=malloc when gathering)

@inducer
Copy link
Owner Author

inducer commented Jul 25, 2023

The latest from memray: There is no leak! Here's the flamegraph: https://ssl.tiker.net/nextcloud/s/ayQRNNcx8MBexkS

So maybe it's time to come to terms with the idea that the leak, if there is one, isn't visible to malloc? 🤷

@inducer
Copy link
Owner Author

inducer commented Jul 27, 2023

#214 and PRs linked therein is where this journey ends, for now.

@inducer
Copy link
Owner Author

inducer commented Jul 28, 2023

Here's a multi-process memory profile, from mprof, made with:

PYOPENCL_TEST=port:cpu mprof run -M python -m pytest -rsxw --durations=10 -n 4

grafik

It looks like it's mostly one subprocess/one test that hoovers up most of that memory...

@inducer
Copy link
Owner Author

inducer commented Jul 28, 2023

So... maybe all that leak-hunting was not as useless as I had feared. Here's an mprof of a single-process run:

 PYOPENCL_TEST=port:cpu mprof run --python pytest -rsxw --durations=10

grafik

Noticeably, the one 12G bump still exists.

@inducer
Copy link
Owner Author

inducer commented Jul 29, 2023

It looks like that 12G thing is test_skeletonize_by_proxy_convergence. Memray told me. 🎉

Flamegraph: https://ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa

@alexfikl
Copy link
Collaborator

It looks like that 12G thing is test_skeletonize_by_proxy_convergence. Memray told me. tada

Flamegraph: ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa

Well that's quite impressive. I'll try to look at it over the weekend. It definitely should not do that :\

@alexfikl
Copy link
Collaborator

alexfikl commented Jul 29, 2023

Hm, tried looking only at the skeletonization tests with

mprof run python -m pytest -v -s -m 'not slowtest' test/test_linalg_skeletonization.py

and there's definitely a big bump in memory due to that test. It's mainly due to constructing the full dense matrix here (a slightly bigger one than the one that is in test_matrix.py)

mat = MatrixBuilder(
actx,
dep_expr=sym_u,
other_dep_exprs=[],
dep_discr=density_discr,
places=places,
context=case.knl_concrete_kwargs,
_weighted=wrangler.weighted_sources)(expr)

Locally it only bumps up to about 2GB, not 12GB, so not sure what that's about. However, that matrix construction makes up most of the memory and the error computation (which uses that matrix to compute a bunch of block-wise matrix norms and a big SVD) makes up most of the runtime. I'll continue digging into it..

memory

(blue is local and black is koelsch)

@alexfikl
Copy link
Collaborator

@inducer What did you run to get the results from #59 (comment)? I tried running on koelsch and it looks intimidatingly reasonable

memory

From the command on the plot, it seems like you're also running the slowtest tests? There are some 3D tests in the skeletonization that use ALL the memory, so maybe those are showing up? They shouldn't actually be run on the CI though..

@inducer
Copy link
Owner Author

inducer commented Jul 30, 2023

I did not use -k 'not slowtest'. 🤔 Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think inducer/pyopencl@9931e14 would make all that difference (between the CIs, which are presumably dying because of excessive memory allocation, and your run).

Also: let me rerun with the slow tests excluded.

@alexfikl
Copy link
Collaborator

alexfikl commented Jul 30, 2023

I did not use -k 'not slowtest'. thinking Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think inducer/pyopencl@9931e14 would make all that difference.

Yeah, all of my runs have been on the latest git from all the stack and with a mamba update --all a few days ago.

EDIT: And I think I cleared all the disk caches last week too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants