Pocl memory leaks (not just) during CI #212

MTCam · 2021-02-03T14:38:22Z

This diffusion test that uses over-integration seems to require a great deal of memory in CI, causing subsequent crash of pytest without error messages in #95.

Here's a plot of the memory usage of pytest with and without the diffusion accuracy test. The blue data has the diffusion accuracy test disabled. When the diffusion accuracy test is disabled, CI in #95 runs to completion with all tests passing. When the diffusion test is enabled, pytest crashes inexplicably in the next suite/file.

The text was updated successfully, but these errors were encountered:

MTCam · 2021-02-03T15:13:57Z

Appears to be fixed by #211.

inducer · 2021-02-03T16:49:33Z

#211 manages to paper over this, sure, but I wouldn't call this a fix. #150 reports a similar memory growth issue. I propose we devote some time to finding out what's going on. There's no reason why memory use should grow like this.

majosm · 2021-02-03T16:50:38Z

I'm seeing different behavior on my laptop:

This is with my old diffusion dev environment though. I'm currently re-running with a fresh emirge.

majosm · 2021-02-03T16:55:07Z

Hmmm, nope. No memory blowup with fresh emirge either:

inducer · 2021-02-03T17:07:55Z

Hmm... interesting. Are you both running the same thing and getting different results?

MTCam · 2021-02-03T17:10:33Z

I'm running it again, it's just taking a long time. Update as soon as the data is in.

majosm · 2021-02-03T17:11:21Z

Seems like it. I'm running pytest with -sv, but I don't see why that would make a difference. I can try a run without just to make sure.

MTCam · 2021-02-03T17:29:19Z

Seems like it. I'm running pytest with -sv, but I don't see why that would make a difference. I can try a run without just to make sure.

A little redundant, but just wanted to make sure I run exactly the same thing. This was the diffusion test only on fresh emirge/master.

matthiasdiener · 2021-02-03T17:34:39Z

Which OpenCL runtime are you using?

majosm · 2021-02-03T17:47:45Z

We're both running with <pyopencl.Device 'pthread' on 'Portable Computing Language'>.

Our environments also look pretty similar:

diff --git a/conda_list_matt.txt b/conda_list_mike.txt
index e5d90c2..3d98384 100644
--- a/conda_list_matt.txt
+++ b/conda_list_mike.txt
@@ -1,4 +1,4 @@
 #
 # Name                    Version                   Build  Channel
 alabaster                 0.7.12                   pypi_0    pypi
@@ -73,9 +73,9 @@ pocl                      1.6                  h4841041_1    conda-forge
 psutil                    5.8.0                    pypi_0    pypi
 pudb                      2020.1                   pypi_0    pypi
 py                        1.10.0                   pypi_0    pypi
-pybind11                  2.6.2                    pypi_0    pypi
 pycodestyle               2.6.0                    pypi_0    pypi
 pyflakes                  2.2.0                    pypi_0    pypi
+pygments                  2.7.4                    pypi_0    pypi
 pymbolic                  2020.2                    dev_0    <develop>
 pymetis                   2020.1           py38hb45949e_2    conda-forge
 pyopencl                  2021.1           py38h73a25cc_0    conda-forge
@@ -84,6 +84,7 @@ pytato                    2020.1.1                  dev_0    <develop>
 pytest                    6.2.2                    pypi_0    pypi
 pytest-pudb               0.7.0                    pypi_0    pypi
 python                    3.8.6           h624753d_5_cpython    conda-forge
+python-dateutil           2.8.1                    pypi_0    pypi
 python_abi                3.8                      1_cp38    conda-forge
 pytools                   2021.1             pyhd3deb0d_0    conda-forge
 pytz                      2021.1                   pypi_0    pypi

Not sure what's up with the pybind11 difference though.

inducer · 2021-02-03T17:53:35Z

Not sure what's up with the pybind11 difference though.

That shouldn't matter. pybind11 is only needed when bindings are built from source. (i.e. it's not used at runtime by anything)

majosm · 2021-02-03T19:57:43Z

I'm seeing similar leaky behavior as @MTCam when I run on lassen. Even with the diffusion operator call commented out and the RHS set to 0, the memory usage still grows pretty quickly (black line below). Interestingly, if I comment out the timestepping loop, most of the growth disappears (blue line).

If I add more test cases (orders 4 and 5) it keeps growing:

Same if I instead increase the number of timesteps in each test (50 to 150):

majosm · 2021-02-03T22:39:49Z

Also interesting -- I changed the test to have it explicitly construct a PyOpenCLArrayContext so I could try using a different allocator. Here's the default allocator (black) vs. memory pool allocator (blue):

inducer · 2021-02-03T22:58:33Z

memory pool allocator

Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.

matthiasdiener · 2021-02-03T23:17:43Z

memory pool allocator

Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.

Would it make sense to change the default allocator to the MemoryPool in array contexts? Other people have probably hit the same issue.

inducer · 2021-02-03T23:52:38Z

IMO no. If we do that, we won't know what the problem is, we'll just have papered over it (so it could come back any moment).

majosm · 2021-02-09T03:18:19Z

Here's some valgrind output. It measures a single execution of test_diffusion_accuracy, modified to return 0 for the RHS instead of calling diffusion_operator. The spot where the memory usage jumps up appears to be the timestepping loop (because I can vary the peak memory usage by changing the number of steps).

https://gist.github.com/inducer/2956323608953cbf7b3246fe5a426d33

inducer · 2021-02-09T05:32:22Z

(@majosm: Stuck your valgrind output in a gist, to keep this issue manageable. Hope that's OK with you.)

inducer · 2021-02-09T05:38:04Z

I think I found a manageable reproducer without pytest that takes only a few seconds to run. I'm on f89f563, and I've applied this diff:

diff --git a/test/test_euler.py b/test/test_euler.py
index 4e86e45..8121a8b 100644
--- a/test/test_euler.py
+++ b/test/test_euler.py
@@ -882,3 +882,9 @@ def test_isentropic_vortex(actx_factory, order):
         eoc_rec.order_estimate() >= order - 0.5
         or eoc_rec.max_error() < 1e-11
     )
+
+
+if __name__ == "__main__":
+    import meshmode
+    for i in range(6):
+        test_inviscid_flux(meshmode._acf, 20, 3)

Using that (pocl 1.6, built from source), running

PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py

gives me this plot out of memory-profiler's mplot:

whereas using the Intel OpenCL runtime, running

PYOPENCL_TEST=int mprof run -T 0.01 python test_euler.py

gives me this much more sensible plot:

inducer · 2021-02-09T06:41:30Z

Patching pocl/pocl@a2d016c with this diff:

diff --git a/lib/CL/devices/basic/basic.c b/lib/CL/devices/basic/basic.c
index 5022970a..a912aced 100644
--- a/lib/CL/devices/basic/basic.c
+++ b/lib/CL/devices/basic/basic.c
@@ -212,7 +212,6 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
   void *b = NULL;
   cl_mem_flags flags = mem_obj->flags;
   unsigned i;
-  POCL_MSG_PRINT_MEMORY (" mem %p, dev %d\n", mem_obj, device->dev_id);
   /* check if some driver has already allocated memory for this mem_obj 
      in our global address space, and use that*/
   for (i = 0; i < mem_obj->context->num_devices; ++i)
@@ -231,6 +230,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
           return CL_SUCCESS;
         }
     }
+  POCL_MSG_PRINT_MEMORY (" mem_alloc %p, dev %d, size %d\n", mem_obj, device->dev_id, mem_obj->size);
 
   /* memory for this global memory is not yet allocated -> do it */
   if (flags & CL_MEM_USE_HOST_PTR)
@@ -241,7 +241,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
     }
   else
     {
-      POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
+      // POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
       b = pocl_aligned_malloc_global_mem (device, MAX_EXTENDED_ALIGNMENT,
                                           mem_obj->size);
       if (b==NULL)
@@ -273,6 +273,8 @@ pocl_basic_free (cl_device_id device, cl_mem memobj)
 {
   cl_mem_flags flags = memobj->flags;
 
+  POCL_MSG_PRINT_MEMORY (" mem_free %p, dev %d, size %d\n", memobj, device->dev_id, memobj->size);
+
   /* allocation owner executes freeing */
   if (flags & CL_MEM_USE_HOST_PTR ||
       memobj->shared_mem_allocation_owner != device)

and using this log reader on the resulting log gives this plot:

So it seems that pocl is at least trying to return memory to the operating system...?

inducer · 2021-02-10T06:58:28Z

A potentially interesting finding that corroborates the observation that the memory consumption is system-dependent: It seems that this has everything to do with the implementation of malloc being used, and potentially with the fragmentation being introduced by the allocator.

This (i.e. using the excellent jemalloc):

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1

gives:

With my libc's allocator by contrast, you can pretty much see the fragmentation happening. This is plotting the returned addresses vs. time:

Observe how the allocator starts returning memory in kind of a wild order and then the allocation addresses shoot higher and higher. This may be an interaction between the allocator and our allocation pattern: pocl allocating big chunks and small, administrative things in about equal proportions and the Python runtime allocating lots of tiny things. The libc allocator (glibc 2.31-9 from Debian in my case) may simply not be handling this load well. To get a memory usage log as clean as Intel CL, I suppose we could try to use a separate allocator for CL buffers.

jemalloc is liberally-licensed open-source and, conveniently, already in conda-forge for all platforms that we care about. ~~It seems that building pocl against jemalloc might avoid our memory growth.~~ jemalloc (by default) overrides malloc, which means that it needs to be linked in by the main executable (python), i.e. not happening. We could build a feedstock of a jemalloc with a prefix (je_malloc e.g.) and use that in pocl, but that seems overkill. I think simply LD_PRELOADing jemalloc for CI is probably OK. See #246.

Bonus data point: Using the malloc implementation from Intel's TBB:

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtbbmalloc_proxy.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1

gives

i.e. an improvement over libc, but not nearly as good as jemalloc.

inducer · 2021-02-10T07:02:08Z

Because I was curious and making the plot was cheap: This is the allocation address pattern from jemalloc:

inducer · 2021-02-10T07:21:28Z

This is the plot for python -m pytest test_euler.py with jemalloc, which would previously leak like a sieve:

MTCam · 2021-02-10T18:48:20Z

On my Intel(Core i7) Mac (Catalina 10.15.4) After:

conda install jemalloc
DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py
mprof plot

edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.

(#246)

inducer · 2021-02-10T19:04:08Z

edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.

I think you might be right.

Quoting from https://github.com/jemalloc/jemalloc/blob/041145c272711b55f91aa42128b108674a12fd91/INSTALL.md:

By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.

AFAICT, the conda forge feedstock does not override this. So it's not a surprise that this didn't do much on a Mac. We'll need a different solution then. Could you try with libtbbmalloc_proxy.dylib from https://github.com/conda-forge/tbb-feedstock?

MTCam · 2021-02-10T19:25:36Z

Quoting from jemalloc/jemalloc@041145c/INSTALL.md:

By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.

Heh, wut? 🤔

MTCam · 2021-02-10T19:56:08Z

Could you try with libtbbmalloc_proxy.dylib from conda-forge/tbb-feedstock?

Definitely different!

edit: But at least something got returned?

inducer · 2021-02-10T19:58:09Z

Yikes! That looks worse.

majosm · 2021-02-15T21:55:45Z

I tried using the tracemalloc module to see what was being allocated during time stepping. Here are some results:

tracemalloc w/ zero RHS using DOFArray (code here)
tracemalloc w/ zero RHS using numpy (code here)

The DOFArray version definitely does more small memory allocations than the equivalent numpy code. The numpy version does not exhibit the memory growth issue.

A large number of allocations are actually coming from a deprecation warning in numpy that gets triggered by the pyopencl array constructor. Unfortunately, fixing this doesn't seem to help with the memory deallocation issue. Would it be possible/worthwhile to remove some of these other allocations? Not sure how much they would come into play in a lazy evaluation context.

inducer · 2021-02-16T05:43:06Z

Thanks! I've actually never used tracemalloc, so I found this very informative.

/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py:287: size=1286 KiB, count=25806, average=51 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py", line 287
    warnings.warn(msg, DeprecationWarning, stacklevel=2)

Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size= an aggregate measure?)

/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:405: size=906 KiB, count=12877, average=72 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 405
    self.message = message
/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:412: size=425 KiB, count=6500, average=67 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 412
    self._category_name = category.__name__ if category else None

What are these? Any idea?

Some more thoughts:

Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)
How come numpy manages to do its entire time stepping with just four allocations?
I saw that tracemalloc has a handy "compare" mode. We could try that to see what (if any) memory sticks around between timesteps.

majosm · 2021-02-16T16:07:49Z

Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size= an aggregate measure?)

What's being allocated is something related to the warning (not sure if it's the message, or the deprecation warning object, or what). It looks like the size is the sum of all the allocations.

What are these? Any idea?

Also seems to be related to the warnings. All of these go away if I replace the two instances of np.object with object in pyopencl/array.py.

Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)

I don't think it does. Sounds like it just instruments the memory allocation code on the Python side.

How come numpy manages to do its entire time stepping with just four allocations?

From what I understand tracemalloc only displays what's currently allocated at the time of the snapshot (and also only out of those objects that were allocated after tracemalloc.start()). Garbage collection complicates things a little. I'm going to try some more runs with garbage collection disabled so we can see everything that was allocated during stepping.

majosm · 2021-02-16T17:23:00Z

Hmm, I get the same 4 allocations with numpy after disabling GC. And the reported sizes for the 3 large allocations seem odd, too -- they're listed as averaging 416 KiB, which is 1/3 of a DOF array.

Here are some results for a single step (code here):

Without GC collect afterwards
With GC collect afterwards

MTCam · 2021-02-17T17:19:58Z

It seems that disabling warnings helps avoid the issue on my laptop:
pytest -W ignore test_diffusion.py

inducer · 2021-03-25T03:59:16Z

~~Had an OpenCL-related theory for something that could be contributing: inducer/pyopencl#450~~ Turns out that's not it.

thomasgibson · 2021-05-21T14:09:19Z

Here are some recent updates (using test_diffusion.py in mirgecom), ran on my local machine. DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py

Main:

And here is the same experiment using new world Grudge inducer/grudge#74

thomasgibson · 2021-05-22T14:34:51Z

Same thing but for the entire mirgecom test suite. Here's main:

And now with the revamped grudge:

MTCam mentioned this issue Feb 3, 2021

Separate testing suites in CI #211

Merged

MTCam added a commit that referenced this issue Feb 4, 2021

Bandage up the CI awaiting fix for #212

db0d2e3

inducer changed the title ~~Pytest crashes in CI after diffusion tests~~ Pocl memory leaks (not just) during CI Feb 4, 2021

MTCam mentioned this issue Feb 4, 2021

Undo pytest memory bandage. #216

Merged

matthiasdiener mentioned this issue Feb 5, 2021

pulse-mpi: add missing memory pool #217

Merged

inducer mentioned this issue Feb 8, 2021

Investigate CI failures inducer/pytential#56

Closed

MTCam mentioned this issue Feb 9, 2021

Eliminate silent failures in CI. #240

Merged

MTCam mentioned this issue Feb 10, 2021

Use jemalloc in Linux CI #246

Draft

inducer mentioned this issue Feb 10, 2021

Add CI jobs for downstream CI inducer/meshmode#114

Merged

1 task

This was referenced Feb 11, 2021

Uncontrolled process memory size during test inducer/pytential#59

Open

Enable mirgecom downstream CI inducer/meshmode#115

Closed

matthiasdiener mentioned this issue Mar 15, 2021

test downstream projects inducer/grudge#57

Merged

inducer mentioned this issue Mar 25, 2021

Kernels keep Buffers alive inducer/pyopencl#450

Closed

majosm mentioned this issue Aug 24, 2021

Try using a memory pool in the multilump test #496

Closed

inducer mentioned this issue Feb 21, 2023

Memory growth #839

Open

Pocl memory leaks (not just) during CI #212

Pocl memory leaks (not just) during CI #212

Comments

MTCam commented Feb 3, 2021

MTCam commented Feb 3, 2021

inducer commented Feb 3, 2021

majosm commented Feb 3, 2021

majosm commented Feb 3, 2021

inducer commented Feb 3, 2021

MTCam commented Feb 3, 2021

majosm commented Feb 3, 2021

MTCam commented Feb 3, 2021

matthiasdiener commented Feb 3, 2021

majosm commented Feb 3, 2021

inducer commented Feb 3, 2021 • edited Loading

majosm commented Feb 3, 2021

majosm commented Feb 3, 2021

inducer commented Feb 3, 2021

matthiasdiener commented Feb 3, 2021 • edited Loading

inducer commented Feb 3, 2021

majosm commented Feb 9, 2021 • edited by inducer Loading

inducer commented Feb 9, 2021

inducer commented Feb 9, 2021 • edited Loading

inducer commented Feb 9, 2021 • edited Loading

inducer commented Feb 10, 2021 • edited Loading

inducer commented Feb 10, 2021

inducer commented Feb 10, 2021

MTCam commented Feb 10, 2021 • edited Loading

inducer commented Feb 10, 2021 • edited Loading

MTCam commented Feb 10, 2021

MTCam commented Feb 10, 2021 • edited Loading

inducer commented Feb 10, 2021

majosm commented Feb 15, 2021

inducer commented Feb 16, 2021

majosm commented Feb 16, 2021

majosm commented Feb 16, 2021

MTCam commented Feb 17, 2021

inducer commented Mar 25, 2021 • edited Loading

thomasgibson commented May 21, 2021

thomasgibson commented May 22, 2021

inducer commented Feb 3, 2021 •

edited

Loading

matthiasdiener commented Feb 3, 2021 •

edited

Loading

majosm commented Feb 9, 2021 •

edited by inducer

Loading

inducer commented Feb 9, 2021 •

edited

Loading

inducer commented Feb 9, 2021 •

edited

Loading

inducer commented Feb 10, 2021 •

edited

Loading

MTCam commented Feb 10, 2021 •

edited

Loading

inducer commented Feb 10, 2021 •

edited

Loading

MTCam commented Feb 10, 2021 •

edited

Loading

inducer commented Mar 25, 2021 •

edited

Loading