-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pocl memory leaks (not just) during CI #212
Comments
Appears to be fixed by #211. |
Hmm... interesting. Are you both running the same thing and getting different results? |
I'm running it again, it's just taking a long time. Update as soon as the data is in. |
Seems like it. I'm running pytest with |
Which OpenCL runtime are you using? |
We're both running with Our environments also look pretty similar: diff --git a/conda_list_matt.txt b/conda_list_mike.txt
index e5d90c2..3d98384 100644
--- a/conda_list_matt.txt
+++ b/conda_list_mike.txt
@@ -1,4 +1,4 @@
#
# Name Version Build Channel
alabaster 0.7.12 pypi_0 pypi
@@ -73,9 +73,9 @@ pocl 1.6 h4841041_1 conda-forge
psutil 5.8.0 pypi_0 pypi
pudb 2020.1 pypi_0 pypi
py 1.10.0 pypi_0 pypi
-pybind11 2.6.2 pypi_0 pypi
pycodestyle 2.6.0 pypi_0 pypi
pyflakes 2.2.0 pypi_0 pypi
+pygments 2.7.4 pypi_0 pypi
pymbolic 2020.2 dev_0 <develop>
pymetis 2020.1 py38hb45949e_2 conda-forge
pyopencl 2021.1 py38h73a25cc_0 conda-forge
@@ -84,6 +84,7 @@ pytato 2020.1.1 dev_0 <develop>
pytest 6.2.2 pypi_0 pypi
pytest-pudb 0.7.0 pypi_0 pypi
python 3.8.6 h624753d_5_cpython conda-forge
+python-dateutil 2.8.1 pypi_0 pypi
python_abi 3.8 1_cp38 conda-forge
pytools 2021.1 pyhd3deb0d_0 conda-forge
pytz 2021.1 pypi_0 pypi Not sure what's up with the |
That shouldn't matter. pybind11 is only needed when bindings are built from source. (i.e. it's not used at runtime by anything) |
I'm seeing similar leaky behavior as @MTCam when I run on lassen. Even with the diffusion operator call commented out and the RHS set to 0, the memory usage still grows pretty quickly (black line below). Interestingly, if I comment out the timestepping loop, most of the growth disappears (blue line). If I add more test cases (orders 4 and 5) it keeps growing: Same if I instead increase the number of timesteps in each test (50 to 150): |
Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system. |
Would it make sense to change the default allocator to the MemoryPool in array contexts? Other people have probably hit the same issue. |
IMO no. If we do that, we won't know what the problem is, we'll just have papered over it (so it could come back any moment). |
Here's some valgrind output. It measures a single execution of https://gist.github.com/inducer/2956323608953cbf7b3246fe5a426d33 |
(@majosm: Stuck your valgrind output in a gist, to keep this issue manageable. Hope that's OK with you.) |
I think I found a manageable reproducer without pytest that takes only a few seconds to run. I'm on f89f563, and I've applied this diff: diff --git a/test/test_euler.py b/test/test_euler.py
index 4e86e45..8121a8b 100644
--- a/test/test_euler.py
+++ b/test/test_euler.py
@@ -882,3 +882,9 @@ def test_isentropic_vortex(actx_factory, order):
eoc_rec.order_estimate() >= order - 0.5
or eoc_rec.max_error() < 1e-11
)
+
+
+if __name__ == "__main__":
+ import meshmode
+ for i in range(6):
+ test_inviscid_flux(meshmode._acf, 20, 3) Using that (pocl 1.6, built from source), running
gives me this plot out of memory-profiler's mplot: whereas using the Intel OpenCL runtime, running
|
Patching pocl/pocl@a2d016c with this diff: diff --git a/lib/CL/devices/basic/basic.c b/lib/CL/devices/basic/basic.c
index 5022970a..a912aced 100644
--- a/lib/CL/devices/basic/basic.c
+++ b/lib/CL/devices/basic/basic.c
@@ -212,7 +212,6 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
void *b = NULL;
cl_mem_flags flags = mem_obj->flags;
unsigned i;
- POCL_MSG_PRINT_MEMORY (" mem %p, dev %d\n", mem_obj, device->dev_id);
/* check if some driver has already allocated memory for this mem_obj
in our global address space, and use that*/
for (i = 0; i < mem_obj->context->num_devices; ++i)
@@ -231,6 +230,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
return CL_SUCCESS;
}
}
+ POCL_MSG_PRINT_MEMORY (" mem_alloc %p, dev %d, size %d\n", mem_obj, device->dev_id, mem_obj->size);
/* memory for this global memory is not yet allocated -> do it */
if (flags & CL_MEM_USE_HOST_PTR)
@@ -241,7 +241,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
}
else
{
- POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
+ // POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
b = pocl_aligned_malloc_global_mem (device, MAX_EXTENDED_ALIGNMENT,
mem_obj->size);
if (b==NULL)
@@ -273,6 +273,8 @@ pocl_basic_free (cl_device_id device, cl_mem memobj)
{
cl_mem_flags flags = memobj->flags;
+ POCL_MSG_PRINT_MEMORY (" mem_free %p, dev %d, size %d\n", memobj, device->dev_id, memobj->size);
+
/* allocation owner executes freeing */
if (flags & CL_MEM_USE_HOST_PTR ||
memobj->shared_mem_allocation_owner != device) and using this log reader on the resulting log gives this plot: So it seems that pocl is at least trying to return memory to the operating system...? |
A potentially interesting finding that corroborates the observation that the memory consumption is system-dependent: It seems that this has everything to do with the implementation of This (i.e. using the excellent jemalloc):
gives: With my libc's allocator by contrast, you can pretty much see the fragmentation happening. This is plotting the returned addresses vs. time: Observe how the allocator starts returning memory in kind of a wild order and then the allocation addresses shoot higher and higher. This may be an interaction between the allocator and our allocation pattern: pocl allocating big chunks and small, administrative things in about equal proportions and the Python runtime allocating lots of tiny things. The libc allocator (glibc 2.31-9 from Debian in my case) may simply not be handling this load well. To get a memory usage log as clean as Intel CL, I suppose we could try to use a separate allocator for CL buffers. jemalloc is liberally-licensed open-source and, conveniently, already in conda-forge for all platforms that we care about. Bonus data point: Using the malloc implementation from Intel's TBB:
gives |
On my Intel(Core i7) Mac (Catalina 10.15.4) After:
edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran. (#246) |
I think you might be right. Quoting from https://github.com/jemalloc/jemalloc/blob/041145c272711b55f91aa42128b108674a12fd91/INSTALL.md:
AFAICT, the conda forge feedstock does not override this. So it's not a surprise that this didn't do much on a Mac. We'll need a different solution then. Could you try with |
Heh, wut? 🤔 |
Definitely different! edit: But at least something got returned? |
Yikes! That looks worse. |
I tried using the tracemalloc w/ zero RHS using DOFArray (code here) The DOFArray version definitely does more small memory allocations than the equivalent numpy code. The numpy version does not exhibit the memory growth issue. A large number of allocations are actually coming from a deprecation warning in numpy that gets triggered by the pyopencl array constructor. Unfortunately, fixing this doesn't seem to help with the memory deallocation issue. Would it be possible/worthwhile to remove some of these other allocations? Not sure how much they would come into play in a lazy evaluation context. |
Thanks! I've actually never used tracemalloc, so I found this very informative.
Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the
What are these? Any idea? Some more thoughts:
|
What's being allocated is something related to the warning (not sure if it's the message, or the deprecation warning object, or what). It looks like the size is the sum of all the allocations.
Also seems to be related to the warnings. All of these go away if I replace the two instances of
I don't think it does. Sounds like it just instruments the memory allocation code on the Python side.
From what I understand tracemalloc only displays what's currently allocated at the time of the snapshot (and also only out of those objects that were allocated after |
Hmm, I get the same 4 allocations with numpy after disabling GC. And the reported sizes for the 3 large allocations seem odd, too -- they're listed as averaging 416 KiB, which is 1/3 of a DOF array. Here are some results for a single step (code here): |
|
Here are some recent updates (using And here is the same experiment using new world Grudge inducer/grudge#74 |
This diffusion test that uses over-integration seems to require a great deal of memory in CI, causing subsequent crash of
pytest
without error messages in #95.Here's a plot of the memory usage of
pytest
with and without the diffusion accuracy test. The blue data has the diffusion accuracy test disabled. When the diffusion accuracy test is disabled, CI in #95 runs to completion with all tests passing. When the diffusion test is enabled,pytest
crashes inexplicably in the next suite/file.The text was updated successfully, but these errors were encountered: