Uncontrolled process memory size during test #59

inducer · 2021-02-11T14:33:12Z

Continued from https://gitlab.tiker.net/inducer/pytential/-/issues/131.

inducer · 2021-02-11T14:37:26Z

Unlike for mirgecom, jemalloc doesn't seem to work wonders for pytential. This is memory use according to memory-profiler for a sequential run of python -m pytest:

I'm glad it's enough to unbreak the CI (#56, #58), but doesn't appear totally solved.

Using the machinery I made for mirgecom, I'll try to do a running total of CL buffer memory allocated and see if there's any growth there.

alexfikl · 2021-02-11T19:56:48Z

jemalloc seems to have some knobs to make it release memory more aggresively. There's a dirty_decay_ms and a muzzy_decay_ms. Should be easy to see what it does by setting

export MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000'

(the default is 10000 (i.e. 10s) and smaller means it releases memory more)

Found this pretty useful too, but no idea how up to date it is.

inducer · 2021-02-12T07:15:30Z

Using this method for instrumenting memory allocation in pocl, I get the following plot of allocated memory (by pocl) over time:

This seems to support the assertion that, as with mirgecom, the underlying cause for the growth of allocated memory might be memory fragmentation, and, as a result, the allocator requesting more and more memory from the OS. In particular, it appears that, at least from the point of view of OpenCL, all memory that is allocated is also freed.

For reproducibility:

PYOPENCL_TEST=port:pthread POCL_DEBUG=memory ~/shared/with-my-pocl python -m pytest  --tb=native -rxsw --capture=no . 2>out

run with 48efe3b and pocl a2d016c8.

inducer · 2021-02-20T18:50:45Z

Interestingly, replacing pocl's allocator for global memory with jemalloc doesn't seem to do much:

If anything, it releases even less memory. Maybe @alexfikl is right, and this comes down to a knob twiddling.

For reproducibility:

Jemalloc configured with

./autogen.sh --prefix=$HOME/shared/pack/jemalloc-build --with-jemalloc-prefix=jemalloc_ --disable-cxx --with-install-suffix=_prefixed --disable-initial-exec-tls

pocl/pocl@a2d016c with this patch

Run with

PYOPENCL_TEST=port:pthread mprof run ~/shared/with-my-pocl python -m pytest --tb=native -rxsw

where with-my-pocl is this script.

inducer · 2021-02-20T18:51:10Z

As for things that are cheap to do, I discovered another allocator to try.

inducer · 2021-02-21T22:00:29Z

mimalloc (used globally) is not an improvement:

inducer · 2021-02-22T00:20:02Z

Tried with jemalloc with the flags given by @alexfikl

MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000' PYOPENCL_TEST=port:pthread LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so mprof run python -m pytest --tb=native -rxsw

Not really all that different from glibc or vanilla jemalloc:

alexfikl · 2021-02-22T15:33:49Z

Just to add more random data to this. I ran

mprof run --python -T 0.01 pytest test/test_cost_model.py test/test_scalar_int_eq.py test/test_global_qbx.py test/test_target_specific_qbx.py

This only runs 4 tests, not the whole thing, so should probably taken with a grain of salt.

The different lines are

(black) is the default run.
(blue) should add export MALLOC_TRIM_THRESHOLD_=1024 (default is 128 * 1024).
(red) should be jemalloc.
(green) should be jemalloc with export MALLOC_CONF='abort:true,dirty_decay_ms:1000,muzzy_decay_ms:1000', so a bit more aggresive than setting them to 5000.

It does seem like changing some options will not have a big effect vs default jemalloc :\

alexfikl · 2021-02-22T15:39:41Z

I've also tried adding

def teardown_function(func):
    import gc
    gc.collect()

    import ctypes
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

At the top of those tests to force it to free as much as possible, but it didn't seem to improve much. However, it does seem to deallocate memory towards the end there, so it may not continue growing as all the tests run.

(black) is again the baseline.
(blue) is default jemalloc.
(red) is jemalloc with the snippet.
(green) is default malloc with the snippet.

alexfikl · 2021-02-22T17:42:33Z

Went ahead and ran the whole test suite with the snippet from previous comment added to all files.

def teardown_function(func):
    import gc
    gc.collect()

    import ctypes
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

(black) is a run with a cleared cache for code generation.
(blue) is just another run with cached code.

inducer · 2021-02-22T18:13:22Z

Thanks! One additional bit of information: The Python GC already gets run by pyopencl (I think) between tests.

alexfikl · 2021-02-22T18:18:36Z

Thanks! One additional bit of information: The Python GC already gets run by pyopencl.

Ah, that explains why I didn't see much of a difference when removing the collect at some point!

inducer · 2021-03-25T03:59:22Z

Had an OpenCL-related theory for something that could be contributing: inducer/pyopencl#450

inducer · 2023-07-19T17:19:28Z

python test_layer_pot.py

with this diff

diff --git a/test/test_layer_pot.py b/test/test_layer_pot.py
index 409eede..a67bb78 100644
--- a/test/test_layer_pot.py
+++ b/test/test_layer_pot.py
@@ -614,15 +614,17 @@ def test_3d_jump_relations(actx_factory, relation, visualize=False):
 # }}}
 
 
+def run_em_all():
+    import gc
+    for i in range(10):
+        test_off_surface_eval(_acf, True)
+        gc.collect()
+
+
 # You can test individual routines by typing
 # $ python test_layer_pot.py 'test_routine()'
 
 if __name__ == "__main__":
-    import sys
-    if len(sys.argv) > 1:
-        exec(sys.argv[1])
-    else:
-        from pytest import main
-        main([__file__])
+    run_em_all()
 
 # vim: fdm=marker

seems to be a microcosm of this issue that runs in manageable time. Memory usage goes up in a straight line, when... there's no reason for that, IMO:

inducer · 2023-07-19T17:43:15Z

The allocations seem to scale with the size of the problem (in this case, set by changing npoints=3000 on line 116), indicating that it's at least some amount of CL memory that's going unfreed. Interestingly, some deallocations do turn up, but in the plot at that size, but there's still a big bunch of leakage:

inducer · 2023-07-19T18:15:07Z

I'm using the code in this diff to hunt the leaks. One (possibly minor?) contributor to this appears to be ProcessLogger, which spawns a bunch of threads whose resources appear to never be reaped.

inducer · 2023-07-19T18:49:44Z

That was a reference cycle, but not a significant contributor: inducer/pytools#181.

inducer · 2023-07-19T19:01:30Z

Beyond that, it's mostly a bunch of pyopencl "stuff" that stays alive:

<class 'list'>, <class 'list'>, <class 'set'>, <class 'pyopencl._cl.Context'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'pyopencl.scan.GenericScanKernel'>, <class 'list'>, <class 'pyopencl._cl.Device'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltFinalUpdateKernelInfo'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl._cl.Platform'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'tuple'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'tuple'>, <class 'list'>, <class 'tuple'>, <class 'list'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, ...`

inducer · 2023-07-19T19:11:05Z

Hm. Calling pyopencl.tools.clear_first_arg_caches in each loop makes most of those go away (along with substantially flattening the mprof curve), so this is probably as expected. One thought is that running pytest in parallel may mean that pyopencl.tools.clear_first_arg_caches is not getting called as intended. (from here)

inducer · 2023-07-19T19:21:36Z

Irritatingly, even with zero growth of Python-side object allocation, the test_off_surface_eval loop still manages to lose about 20MB over 30 runs.

inducer · 2023-07-19T19:24:33Z

OK, it's flat after switching to jemalloc, so that's probably not our problem. Switching to the interaction of pytest with PyOpenCL's first-arg caches as the prime suspect.

inducer · 2023-07-19T19:31:53Z

Well. So I removed that return from the middle of test_off_surface_eval, and suddenly we're leaking a bunch of loopy stuff. This may just be the caches though, will keep digging.

inducer · 2023-07-20T01:22:22Z

Observations:

New LoopKernels show up in new_objects in each iteration. They're all for the FMM translations. Why are they there?
Does the array context's caching of einsum programs work?
Loopy's kernel executor mechanism is a tangle of circular references. It needs to be untangled, to avoid having it keep CL contexts alive. I wonder if it'd be better to introduce a separate object holding a reference to a CL context and a loopy kernel, and having that be callable.
clear_first_arg_caches is getting called, from here.

inducer · 2023-07-21T01:16:36Z

inducer/loopy#797 (plus small changes across the stack) should make it so that loopy caches don't keep contexts alive.

inducer · 2023-07-21T20:54:16Z

Somewhat depressingly, inducer/loopy#797 doesn't seem to have made a decisive difference. Blue is with the changes, black is before.

inducer · 2023-07-22T19:24:08Z

Been on the hunt for other caches that keep things alive. They contribute, but they're all small potatoes:

inducer · 2023-07-22T19:26:11Z

I've also printed lists of Python objects that, according to gc, survive each test. Here's a flavor:

138590, 62 diff, 119 new
[(<class 'builtin_function_or_method'>, 16), (<class 'weakref.ReferenceType'>, 15), (<class 'list'>, 14), (<class 'dict'>, 14), (<class 'cell'>, 10), (<class 'tuple'>, 8), (<class 'generator'>, 6), (<class 'method'>, 5), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'function'>, 2), (<class 'pytools.persistent_dict._LinkedList'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'pytools.persistent_dict._LRUCache'>, 1), (<class 'contextlib._GeneratorContextManager'>, 1)]
.
138728, 138 diff, 195 new
[(<class 'function'>, 39), (<class 'tuple'>, 31), (<class 'dict'>, 25), (<class 'cell'>, 22), (<class 'list'>, 11), (<class 'method'>, 8), (<class 'property'>, 8), (<class 'generator'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'pytools.persistent_dict._LRUCache'>, 4), (<class 'type'>, 4), (<class 'weakref.ReferenceType'>, 4), (<class 'getset_descriptor'>, 4), (<class '_pytest.reports.TestReport'>, 2), (<class 'set'>, 1), (<class 'builtin_function_or_method'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'warnings.WarningMessage'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'module'>, 1), (<class 'logging.Logger'>, 1), (<class '_frozen_importlib_external.SourceFileLoader'>, 1), (<class '_frozen_importlib.ModuleSpec'>, 1), (<class 'DeprecationWarning'>, 1)]
l_inf error: 0.0025745079426662495
.
138820, 92 diff, 157 new
[(<class 'list'>, 49), (<class 'tuple'>, 20), (<class 'dict'>, 15), (<class 'cell'>, 13), (<class 'method'>, 10), (<class 'pytools.persistent_dict._LinkedList'>, 7), (<class 'pytools.persistent_dict._LRUCache'>, 5), (<class 'pymbolic.primitives.Variable'>, 4), (<class 'function'>, 3), (<class 'pymbolic.primitives.Product'>, 3), (<class 'set'>, 2), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'generator'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_pytest.terminal.WarningReport'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'sumpy.expansion.LinearPDEBasedExpansionTermsWrangler'>, 1), (<class 'sumpy.expansion.CSEMatVecOperator'>, 1), (<class '_pytest.reports.TestReport'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class 'sumpy.kernel.LaplaceKernel'>, 1), (<class 'pymbolic.primitives.Quotient'>, 1), (<class 'pymbolic.primitives.Sum'>, 1)]
.
139330, 510 diff, 568 new
[(<class 'tuple'>, 83), (<class 'function'>, 79), (<class 'dict'>, 64), (<class 'weakref.ReferenceType'>, 40), (<class 'cell'>, 36), (<class 'builtin_function_or_method'>, 31), (<class 'list'>, 28), (<class 'getset_descriptor'>, 20), (<class 'wrapper_descriptor'>, 13), (<class 'method_descriptor'>, 13), (<class 'pymbolic.primitives.Variable'>, 12), (<class 'set'>, 11), (<class 'module'>, 10), (<class 'type'>, 10), (<class '_frozen_importlib.ModuleSpec'>, 10), (<class 'method'>, 9), (<class '_frozen_importlib_external.SourceFileLoader'>, 8), (<class 'frozenset'>, 7), (<class 'pyopencl.tools._SimpleTextTemplate'>, 6), (<class 'pytools.log_process'>, 6), (<class 'property'>, 5), (<class 'pytential.qbx.target_specific.impl.Enum'>, 5), (<class 'pytools.persistent_dict._LinkedList'>, 5), (<class 'pymbolic.primitives.Call'>, 4), (<class 'pymbolic.primitives.Subscript'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'generator'>, 3), (<class 'staticmethod'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3)]
.
139453, 123 diff, 178 new
[(<class 'builtin_function_or_method'>, 33), (<class 'weakref.ReferenceType'>, 32), (<class 'tuple'>, 18), (<class 'dict'>, 14), (<class 'list'>, 12), (<class 'method'>, 9), (<class 'pymbolic.primitives.Variable'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class 'generator'>, 5), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'set'>, 3), (<class '_pytest.terminal.WarningReport'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class '_pytest.reports.TestReport'>, 2), (<class 'function'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'cell'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'frozenset'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'pymbolic.primitives.Sum'>, 1), (<class 'pymbolic.primitives.Product'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1)]
.
139482, 29 diff, 94 new
[(<class 'list'>, 15), (<class 'dict'>, 11), (<class 'cell'>, 10), (<class 'method'>, 9), (<class 'tuple'>, 8), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'generator'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class 'function'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1)]

I'm increasingly coming to the conclusion that these leaks might not be Python-visible---these objects are unlikely to explain hundreds of megabytes, I think.

inducer · 2023-07-22T19:53:12Z

Further confirmation of this is that tracemalloc (what a nice tool!) seems to see none of the offending allocations.

isuruf · 2023-07-22T19:56:34Z

Can you try with SUMPY_FFT_BACKEND=loopy?

inducer · 2023-07-22T20:06:09Z

Thanks for following along! 😀

SUMPY_FFT_BACKEND=loopy takes longer, but otherwise doesn't seem to make a big difference (it's the black curve):

alexfikl · 2023-07-22T20:09:11Z

Does something like https://github.com/bloomberg/memray give additional useful information? I've never actually tried it..

EDIT: Possibly even https://github.com/bloomberg/pytest-memray?

inducer · 2023-07-22T20:16:06Z

I was trying https://github.com/plasma-umass/scalene just now, but it seemed to incur fairly significant slowdown? Not sure. I guess I'll switch to memray for a bit?

inducer · 2023-07-22T20:23:44Z

I've already cut pytest out of the picture and wrote some code to manually call the tests, to make sure that's not at fault. (Yes, I'm that deep into conspiracy territory.) Memray just crashed initially, but that seems to be due to the Intel OpenCL ICD. If I disable it, things at least seem to run.

inducer · 2023-07-22T20:47:31Z

sigh It seems memray gets confused by GPUs and simply reports the 600 GiB of mapped GPU memory as memory consumption. 🤦 (Yes, yes, I'm rebuilding pocl without GPUs.)

inducer · 2023-07-22T21:20:06Z

[Nvm all this, see below)

🎉 Memray did something useful!

Here's the relevant flamegraph: https://gist.github.com/inducer/2aaa320fdf49b3d5af651cbc28d5ee4d

It seems like most of the leaks are in pocl, with vkfft contributing a measly 100 MB. I'll recompile pocl with debug to try and get more precise info.

inducer · 2023-07-23T04:48:15Z

[Nvm all this, see below)

Here's the updated flamegraph with Pocl in debug mode: https://gist.github.com/inducer/bfaeb3a85025f3e7a747cba930e2477e

One thing that's not clear is that the bottom of most stacktraces shows memray. Does that mean that memray's allocations dominate, and that is what we've learned? Or is the presence of those stackframes just an artifact that we should ignore?

inducer · 2023-07-23T20:04:51Z

[Nvm all this, see below)

Continuing on the assumption that the "memray intercept stuff" is not the actual source of memory consumption.

If that's the case, then that flamegraph implicates two specific mallocs in pocl:

Hmm... both of those bits of code look non-leaky to me, but both of them also return memory to a caller... so there's probably more going on.

inducer · 2023-07-23T20:33:57Z

[Nvm all this, see below)

The first of those sites also leaks when running https://gist.github.com/inducer/6c2a088eefeffd86cdd456e5aa244bec, a slightly modified PyOpenCL demo, on an empty cache, where it seems to leak 256 MB over the course of the runtime.

inducer · 2023-07-24T02:14:10Z

[Nvm all this, see below)

I've been staring at this for a while: both the PyOpenCL demo and the full-scale test with test_layer_pot.py show multiple instance of temp_buf escaping this function. I just can't for the life of me figure out how.

inducer · 2023-07-24T23:30:55Z

Never mind all that. TIL that memray flamegraph has a --leaks option that you need to use if you'd like to see leaks. 🤦 (along with PYTHONMALLOC=malloc when gathering)

inducer · 2023-07-25T15:23:59Z

The latest from memray: There is no leak! Here's the flamegraph: https://ssl.tiker.net/nextcloud/s/ayQRNNcx8MBexkS

So maybe it's time to come to terms with the idea that the leak, if there is one, isn't visible to malloc? 🤷

inducer · 2023-07-27T00:51:55Z

#214 and PRs linked therein is where this journey ends, for now.

inducer · 2023-07-28T12:41:38Z

Here's a multi-process memory profile, from mprof, made with:

PYOPENCL_TEST=port:cpu mprof run -M python -m pytest -rsxw --durations=10 -n 4

It looks like it's mostly one subprocess/one test that hoovers up most of that memory...

inducer · 2023-07-28T16:13:35Z

So... maybe all that leak-hunting was not as useless as I had feared. Here's an mprof of a single-process run:

 PYOPENCL_TEST=port:cpu mprof run --python pytest -rsxw --durations=10

Noticeably, the one 12G bump still exists.

inducer · 2023-07-29T03:03:21Z

It looks like that 12G thing is test_skeletonize_by_proxy_convergence. Memray told me. 🎉

Flamegraph: https://ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa

alexfikl · 2023-07-29T04:05:30Z

It looks like that 12G thing is test_skeletonize_by_proxy_convergence. Memray told me. tada

Flamegraph: ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa

Well that's quite impressive. I'll try to look at it over the weekend. It definitely should not do that :\

alexfikl · 2023-07-29T17:12:36Z

Hm, tried looking only at the skeletonization tests with

mprof run python -m pytest -v -s -m 'not slowtest' test/test_linalg_skeletonization.py

and there's definitely a big bump in memory due to that test. It's mainly due to constructing the full dense matrix here (a slightly bigger one than the one that is in test_matrix.py)

pytential/test/test_linalg_skeletonization.py

Lines 222 to 229 in 27d976d

    
           mat = MatrixBuilder( 
        
               actx, 
        
               dep_expr=sym_u, 
        
               other_dep_exprs=[], 
        
               dep_discr=density_discr, 
        
               places=places, 
        
               context=case.knl_concrete_kwargs, 
        
               _weighted=wrangler.weighted_sources)(expr)

Locally it only bumps up to about 2GB, not 12GB, so not sure what that's about. However, that matrix construction makes up most of the memory and the error computation (which uses that matrix to compute a bunch of block-wise matrix norms and a big SVD) makes up most of the runtime. I'll continue digging into it..

(blue is local and black is koelsch)

alexfikl · 2023-07-30T13:57:00Z

@inducer What did you run to get the results from #59 (comment)? I tried running on koelsch and it looks intimidatingly reasonable

From the command on the plot, it seems like you're also running the slowtest tests? There are some 3D tests in the skeletonization that use ALL the memory, so maybe those are showing up? They shouldn't actually be run on the CI though..

inducer · 2023-07-30T14:29:05Z

I did not use -k 'not slowtest'. 🤔 Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think inducer/pyopencl@9931e14 would make all that difference (between the CIs, which are presumably dying because of excessive memory allocation, and your run).

Also: let me rerun with the slow tests excluded.

alexfikl · 2023-07-30T14:31:07Z

I did not use -k 'not slowtest'. thinking Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think inducer/pyopencl@9931e14 would make all that difference.

Yeah, all of my runs have been on the latest git from all the stack and with a mamba update --all a few days ago.

EDIT: And I think I cleared all the disk caches last week too.

inducer mentioned this issue Mar 25, 2021

Kernels keep Buffers alive inducer/pyopencl#450

Closed

alexfikl mentioned this issue Jul 19, 2023

Seemingly unbounded memory growth during test #213

Closed

inducer mentioned this issue Jul 27, 2023

Mitigate (?) memory growth in tests #214

Merged

inducer mentioned this issue Dec 8, 2023

tests: disable high memory test case #223

Closed

1 task

inducer mentioned this issue Jan 29, 2024

Bump version to 2024.1 inducer/loopy#822

Merged

Uncontrolled process memory size during test #59

Uncontrolled process memory size during test #59

Comments

inducer commented Feb 11, 2021

inducer commented Feb 11, 2021

alexfikl commented Feb 11, 2021 • edited Loading

inducer commented Feb 12, 2021 • edited Loading

inducer commented Feb 20, 2021

inducer commented Feb 20, 2021

inducer commented Feb 21, 2021 • edited Loading

inducer commented Feb 22, 2021

alexfikl commented Feb 22, 2021

alexfikl commented Feb 22, 2021

alexfikl commented Feb 22, 2021

inducer commented Feb 22, 2021

alexfikl commented Feb 22, 2021

inducer commented Mar 25, 2021

inducer commented Jul 19, 2023 • edited Loading

inducer commented Jul 19, 2023

inducer commented Jul 19, 2023

inducer commented Jul 19, 2023

inducer commented Jul 19, 2023

inducer commented Jul 19, 2023 • edited Loading

inducer commented Jul 19, 2023

inducer commented Jul 19, 2023

inducer commented Jul 19, 2023

inducer commented Jul 20, 2023 • edited Loading

inducer commented Jul 21, 2023 • edited Loading

inducer commented Jul 21, 2023

inducer commented Jul 22, 2023

inducer commented Jul 22, 2023 • edited Loading

inducer commented Jul 22, 2023

isuruf commented Jul 22, 2023

inducer commented Jul 22, 2023 • edited Loading

alexfikl commented Jul 22, 2023 • edited Loading

inducer commented Jul 22, 2023

inducer commented Jul 22, 2023

inducer commented Jul 22, 2023 • edited Loading

inducer commented Jul 22, 2023 • edited Loading

inducer commented Jul 23, 2023 • edited Loading

inducer commented Jul 23, 2023 • edited Loading

inducer commented Jul 23, 2023 • edited Loading

inducer commented Jul 24, 2023 • edited Loading

inducer commented Jul 24, 2023

inducer commented Jul 25, 2023

inducer commented Jul 27, 2023

inducer commented Jul 28, 2023

inducer commented Jul 28, 2023

inducer commented Jul 29, 2023 • edited Loading

alexfikl commented Jul 29, 2023

alexfikl commented Jul 29, 2023 • edited Loading

alexfikl commented Jul 30, 2023

inducer commented Jul 30, 2023 • edited Loading

alexfikl commented Jul 30, 2023 • edited Loading

alexfikl commented Feb 11, 2021 •

edited

Loading

inducer commented Feb 12, 2021 •

edited

Loading

inducer commented Feb 21, 2021 •

edited

Loading

inducer commented Jul 19, 2023 •

edited

Loading

inducer commented Jul 19, 2023 •

edited

Loading

inducer commented Jul 20, 2023 •

edited

Loading

inducer commented Jul 21, 2023 •

edited

Loading

inducer commented Jul 22, 2023 •

edited

Loading

inducer commented Jul 22, 2023 •

edited

Loading

alexfikl commented Jul 22, 2023 •

edited

Loading

inducer commented Jul 22, 2023 •

edited

Loading

inducer commented Jul 22, 2023 •

edited

Loading

inducer commented Jul 23, 2023 •

edited

Loading

inducer commented Jul 23, 2023 •

edited

Loading

inducer commented Jul 23, 2023 •

edited

Loading

inducer commented Jul 24, 2023 •

edited

Loading

inducer commented Jul 29, 2023 •

edited

Loading

alexfikl commented Jul 29, 2023 •

edited

Loading

inducer commented Jul 30, 2023 •

edited

Loading

alexfikl commented Jul 30, 2023 •

edited

Loading