Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pocl memory leaks (not just) during CI #212

Open
MTCam opened this issue Feb 3, 2021 · 36 comments
Open

Pocl memory leaks (not just) during CI #212

MTCam opened this issue Feb 3, 2021 · 36 comments

Comments

@MTCam
Copy link
Member

MTCam commented Feb 3, 2021

This diffusion test that uses over-integration seems to require a great deal of memory in CI, causing subsequent crash of pytest without error messages in #95.

Here's a plot of the memory usage of pytest with and without the diffusion accuracy test. The blue data has the diffusion accuracy test disabled. When the diffusion accuracy test is disabled, CI in #95 runs to completion with all tests passing. When the diffusion test is enabled, pytest crashes inexplicably in the next suite/file.

Screen Shot 2021-02-03 at 8 33 05 AM

@MTCam
Copy link
Member Author

MTCam commented Feb 3, 2021

Appears to be fixed by #211.

@inducer
Copy link
Contributor

inducer commented Feb 3, 2021

#211 manages to paper over this, sure, but I wouldn't call this a fix. #150 reports a similar memory growth issue. I propose we devote some time to finding out what's going on. There's no reason why memory use should grow like this.

@majosm
Copy link
Collaborator

majosm commented Feb 3, 2021

I'm seeing different behavior on my laptop:

Figure_2

This is with my old diffusion dev environment though. I'm currently re-running with a fresh emirge.

@majosm
Copy link
Collaborator

majosm commented Feb 3, 2021

Hmmm, nope. No memory blowup with fresh emirge either:

Figure_1

@inducer
Copy link
Contributor

inducer commented Feb 3, 2021

Hmm... interesting. Are you both running the same thing and getting different results?

@MTCam
Copy link
Member Author

MTCam commented Feb 3, 2021

I'm running it again, it's just taking a long time. Update as soon as the data is in.

@majosm
Copy link
Collaborator

majosm commented Feb 3, 2021

Seems like it. I'm running pytest with -sv, but I don't see why that would make a difference. I can try a run without just to make sure.

@MTCam
Copy link
Member Author

MTCam commented Feb 3, 2021

Seems like it. I'm running pytest with -sv, but I don't see why that would make a difference. I can try a run without just to make sure.

A little redundant, but just wanted to make sure I run exactly the same thing. This was the diffusion test only on fresh emirge/master.

Screen Shot 2021-02-03 at 11 27 16 AM

@matthiasdiener
Copy link
Member

Which OpenCL runtime are you using?

@majosm
Copy link
Collaborator

majosm commented Feb 3, 2021

We're both running with <pyopencl.Device 'pthread' on 'Portable Computing Language'>.

Our environments also look pretty similar:

diff --git a/conda_list_matt.txt b/conda_list_mike.txt
index e5d90c2..3d98384 100644
--- a/conda_list_matt.txt
+++ b/conda_list_mike.txt
@@ -1,4 +1,4 @@
 #
 # Name                    Version                   Build  Channel
 alabaster                 0.7.12                   pypi_0    pypi
@@ -73,9 +73,9 @@ pocl                      1.6                  h4841041_1    conda-forge
 psutil                    5.8.0                    pypi_0    pypi
 pudb                      2020.1                   pypi_0    pypi
 py                        1.10.0                   pypi_0    pypi
-pybind11                  2.6.2                    pypi_0    pypi
 pycodestyle               2.6.0                    pypi_0    pypi
 pyflakes                  2.2.0                    pypi_0    pypi
+pygments                  2.7.4                    pypi_0    pypi
 pymbolic                  2020.2                    dev_0    <develop>
 pymetis                   2020.1           py38hb45949e_2    conda-forge
 pyopencl                  2021.1           py38h73a25cc_0    conda-forge
@@ -84,6 +84,7 @@ pytato                    2020.1.1                  dev_0    <develop>
 pytest                    6.2.2                    pypi_0    pypi
 pytest-pudb               0.7.0                    pypi_0    pypi
 python                    3.8.6           h624753d_5_cpython    conda-forge
+python-dateutil           2.8.1                    pypi_0    pypi
 python_abi                3.8                      1_cp38    conda-forge
 pytools                   2021.1             pyhd3deb0d_0    conda-forge
 pytz                      2021.1                   pypi_0    pypi

Not sure what's up with the pybind11 difference though.

@inducer
Copy link
Contributor

inducer commented Feb 3, 2021

Not sure what's up with the pybind11 difference though.

That shouldn't matter. pybind11 is only needed when bindings are built from source. (i.e. it's not used at runtime by anything)

@majosm
Copy link
Collaborator

majosm commented Feb 3, 2021

I'm seeing similar leaky behavior as @MTCam when I run on lassen. Even with the diffusion operator call commented out and the RHS set to 0, the memory usage still grows pretty quickly (black line below). Interestingly, if I comment out the timestepping loop, most of the growth disappears (blue line).

Figure_0

If I add more test cases (orders 4 and 5) it keeps growing:

Figure_1

Same if I instead increase the number of timesteps in each test (50 to 150):

Figure_2

@majosm
Copy link
Collaborator

majosm commented Feb 3, 2021

Also interesting -- I changed the test to have it explicitly construct a PyOpenCLArrayContext so I could try using a different allocator. Here's the default allocator (black) vs. memory pool allocator (blue):

Figure_4

@inducer
Copy link
Contributor

inducer commented Feb 3, 2021

memory pool allocator

Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 3, 2021

memory pool allocator

Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.

Would it make sense to change the default allocator to the MemoryPool in array contexts? Other people have probably hit the same issue.

@inducer
Copy link
Contributor

inducer commented Feb 3, 2021

IMO no. If we do that, we won't know what the problem is, we'll just have papered over it (so it could come back any moment).

MTCam added a commit that referenced this issue Feb 4, 2021
@inducer inducer changed the title Pytest crashes in CI after diffusion tests Pocl memory leaks (not just) during CI Feb 4, 2021
@majosm
Copy link
Collaborator

majosm commented Feb 9, 2021

Here's some valgrind output. It measures a single execution of test_diffusion_accuracy, modified to return 0 for the RHS instead of calling diffusion_operator. The spot where the memory usage jumps up appears to be the timestepping loop (because I can vary the peak memory usage by changing the number of steps).

https://gist.github.com/inducer/2956323608953cbf7b3246fe5a426d33

@inducer
Copy link
Contributor

inducer commented Feb 9, 2021

(@majosm: Stuck your valgrind output in a gist, to keep this issue manageable. Hope that's OK with you.)

@inducer
Copy link
Contributor

inducer commented Feb 9, 2021

I think I found a manageable reproducer without pytest that takes only a few seconds to run. I'm on f89f563, and I've applied this diff:

diff --git a/test/test_euler.py b/test/test_euler.py
index 4e86e45..8121a8b 100644
--- a/test/test_euler.py
+++ b/test/test_euler.py
@@ -882,3 +882,9 @@ def test_isentropic_vortex(actx_factory, order):
         eoc_rec.order_estimate() >= order - 0.5
         or eoc_rec.max_error() < 1e-11
     )
+
+
+if __name__ == "__main__":
+    import meshmode
+    for i in range(6):
+        test_inviscid_flux(meshmode._acf, 20, 3)

Using that (pocl 1.6, built from source), running

PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py 

gives me this plot out of memory-profiler's mplot:

grafik

whereas using the Intel OpenCL runtime, running

PYOPENCL_TEST=int mprof run -T 0.01 python test_euler.py 

gives me this much more sensible plot:
grafik

@inducer
Copy link
Contributor

inducer commented Feb 9, 2021

Patching pocl/pocl@a2d016c with this diff:

diff --git a/lib/CL/devices/basic/basic.c b/lib/CL/devices/basic/basic.c
index 5022970a..a912aced 100644
--- a/lib/CL/devices/basic/basic.c
+++ b/lib/CL/devices/basic/basic.c
@@ -212,7 +212,6 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
   void *b = NULL;
   cl_mem_flags flags = mem_obj->flags;
   unsigned i;
-  POCL_MSG_PRINT_MEMORY (" mem %p, dev %d\n", mem_obj, device->dev_id);
   /* check if some driver has already allocated memory for this mem_obj 
      in our global address space, and use that*/
   for (i = 0; i < mem_obj->context->num_devices; ++i)
@@ -231,6 +230,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
           return CL_SUCCESS;
         }
     }
+  POCL_MSG_PRINT_MEMORY (" mem_alloc %p, dev %d, size %d\n", mem_obj, device->dev_id, mem_obj->size);
 
   /* memory for this global memory is not yet allocated -> do it */
   if (flags & CL_MEM_USE_HOST_PTR)
@@ -241,7 +241,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
     }
   else
     {
-      POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
+      // POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
       b = pocl_aligned_malloc_global_mem (device, MAX_EXTENDED_ALIGNMENT,
                                           mem_obj->size);
       if (b==NULL)
@@ -273,6 +273,8 @@ pocl_basic_free (cl_device_id device, cl_mem memobj)
 {
   cl_mem_flags flags = memobj->flags;
 
+  POCL_MSG_PRINT_MEMORY (" mem_free %p, dev %d, size %d\n", memobj, device->dev_id, memobj->size);
+
   /* allocation owner executes freeing */
   if (flags & CL_MEM_USE_HOST_PTR ||
       memobj->shared_mem_allocation_owner != device)

and using this log reader on the resulting log gives this plot:
grafik

So it seems that pocl is at least trying to return memory to the operating system...?

@inducer
Copy link
Contributor

inducer commented Feb 10, 2021

A potentially interesting finding that corroborates the observation that the memory consumption is system-dependent: It seems that this has everything to do with the implementation of malloc being used, and potentially with the fragmentation being introduced by the allocator.

This (i.e. using the excellent jemalloc):

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1 

gives:

grafik

With my libc's allocator by contrast, you can pretty much see the fragmentation happening. This is plotting the returned addresses vs. time:

grafik

Observe how the allocator starts returning memory in kind of a wild order and then the allocation addresses shoot higher and higher. This may be an interaction between the allocator and our allocation pattern: pocl allocating big chunks and small, administrative things in about equal proportions and the Python runtime allocating lots of tiny things. The libc allocator (glibc 2.31-9 from Debian in my case) may simply not be handling this load well. To get a memory usage log as clean as Intel CL, I suppose we could try to use a separate allocator for CL buffers.

jemalloc is liberally-licensed open-source and, conveniently, already in conda-forge for all platforms that we care about. It seems that building pocl against jemalloc might avoid our memory growth. jemalloc (by default) overrides malloc, which means that it needs to be linked in by the main executable (python), i.e. not happening. We could build a feedstock of a jemalloc with a prefix (je_malloc e.g.) and use that in pocl, but that seems overkill. I think simply LD_PRELOADing jemalloc for CI is probably OK. See #246.

Bonus data point: Using the malloc implementation from Intel's TBB:

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtbbmalloc_proxy.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1 

gives
grafik
i.e. an improvement over libc, but not nearly as good as jemalloc.

@inducer
Copy link
Contributor

inducer commented Feb 10, 2021

Because I was curious and making the plot was cheap: This is the allocation address pattern from jemalloc:
grafik

@inducer
Copy link
Contributor

inducer commented Feb 10, 2021

This is the plot for python -m pytest test_euler.py with jemalloc, which would previously leak like a sieve:

grafik

@MTCam
Copy link
Member Author

MTCam commented Feb 10, 2021

On my Intel(Core i7) Mac (Catalina 10.15.4) After:

conda install jemalloc
DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py
mprof plot

newmemusage

edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.

(#246)

@inducer
Copy link
Contributor

inducer commented Feb 10, 2021

edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.

I think you might be right.

Quoting from https://github.com/jemalloc/jemalloc/blob/041145c272711b55f91aa42128b108674a12fd91/INSTALL.md:

By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.

AFAICT, the conda forge feedstock does not override this. So it's not a surprise that this didn't do much on a Mac. We'll need a different solution then. Could you try with libtbbmalloc_proxy.dylib from https://github.com/conda-forge/tbb-feedstock?

@MTCam
Copy link
Member Author

MTCam commented Feb 10, 2021

Quoting from jemalloc/jemalloc@041145c/INSTALL.md:

By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.

Heh, wut? 🤔

@MTCam
Copy link
Member Author

MTCam commented Feb 10, 2021

Could you try with libtbbmalloc_proxy.dylib from conda-forge/tbb-feedstock?

Definitely different!

Screen Shot 2021-02-10 at 1 54 45 PM

edit: But at least something got returned?

@inducer
Copy link
Contributor

inducer commented Feb 10, 2021

Yikes! That looks worse.

@majosm
Copy link
Collaborator

majosm commented Feb 15, 2021

I tried using the tracemalloc module to see what was being allocated during time stepping. Here are some results:

tracemalloc w/ zero RHS using DOFArray (code here)
tracemalloc w/ zero RHS using numpy (code here)

The DOFArray version definitely does more small memory allocations than the equivalent numpy code. The numpy version does not exhibit the memory growth issue.

A large number of allocations are actually coming from a deprecation warning in numpy that gets triggered by the pyopencl array constructor. Unfortunately, fixing this doesn't seem to help with the memory deallocation issue. Would it be possible/worthwhile to remove some of these other allocations? Not sure how much they would come into play in a lazy evaluation context.

@inducer
Copy link
Contributor

inducer commented Feb 16, 2021

Thanks! I've actually never used tracemalloc, so I found this very informative.

/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py:287: size=1286 KiB, count=25806, average=51 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py", line 287
    warnings.warn(msg, DeprecationWarning, stacklevel=2)

Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size= an aggregate measure?)

/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:405: size=906 KiB, count=12877, average=72 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 405
    self.message = message
/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:412: size=425 KiB, count=6500, average=67 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 412
    self._category_name = category.__name__ if category else None

What are these? Any idea?

Some more thoughts:

  • Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)
  • How come numpy manages to do its entire time stepping with just four allocations?
  • I saw that tracemalloc has a handy "compare" mode. We could try that to see what (if any) memory sticks around between timesteps.

@majosm
Copy link
Collaborator

majosm commented Feb 16, 2021

Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size= an aggregate measure?)

What's being allocated is something related to the warning (not sure if it's the message, or the deprecation warning object, or what). It looks like the size is the sum of all the allocations.

What are these? Any idea?

Also seems to be related to the warnings. All of these go away if I replace the two instances of np.object with object in pyopencl/array.py.

  • Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)

I don't think it does. Sounds like it just instruments the memory allocation code on the Python side.

  • How come numpy manages to do its entire time stepping with just four allocations?

From what I understand tracemalloc only displays what's currently allocated at the time of the snapshot (and also only out of those objects that were allocated after tracemalloc.start()). Garbage collection complicates things a little. I'm going to try some more runs with garbage collection disabled so we can see everything that was allocated during stepping.

@majosm
Copy link
Collaborator

majosm commented Feb 16, 2021

Hmm, I get the same 4 allocations with numpy after disabling GC. And the reported sizes for the 3 large allocations seem odd, too -- they're listed as averaging 416 KiB, which is 1/3 of a DOF array.

Here are some results for a single step (code here):

Without GC collect afterwards
With GC collect afterwards

@MTCam
Copy link
Member Author

MTCam commented Feb 17, 2021

It seems that disabling warnings helps avoid the issue on my laptop:
pytest -W ignore test_diffusion.py

Screen Shot 2021-02-17 at 11 17 21 AM

@inducer
Copy link
Contributor

inducer commented Mar 25, 2021

Had an OpenCL-related theory for something that could be contributing: inducer/pyopencl#450 Turns out that's not it.

@thomasgibson
Copy link
Contributor

Here are some recent updates (using test_diffusion.py in mirgecom), ran on my local machine. DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py

Main:
main_mirgecom_difftest_jemalloc

And here is the same experiment using new world Grudge inducer/grudge#74
nwg_mirgecom_difftest_jemalloc

@thomasgibson
Copy link
Contributor

Same thing but for the entire mirgecom test suite. Here's main:
main_mirgecom_alltests
And now with the revamped grudge:
nwg_mirgecom_alltests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants