Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface Interpolation Function [Fixes Issue45] #52

Closed
wants to merge 8 commits into from
Closed

Conversation

ojeda-e
Copy link
Member

@ojeda-e ojeda-e commented Jul 22, 2021

This is a starting point to solve #45

Changes in this PR:

  • Added function to calculate interpolation of z_surface (surface_interpolation). Includes docstrings.
  • Added tests of surface_interpolation.

As suggested by @orbeckst in this comment, the function here included uses NumPy. Thanks for encouraging numpy!

Surfaces included in tests are dummy_arrays with the following characteristics:

  • size 3x3 with all values 150 and one np.nan.
  • size 4x3 with all values 150 and two np.nan.
  • size 4x4 with all values 150 and two np.nan.
  • size 3x3 array 3x3 with lots of np.nan. 💪🏽

Probably I'll be asked to add more tests, but this is ok to start. Probably @lilyminium will give me lots of ideas. 😅

Thanks.

@pep8speaks
Copy link

pep8speaks commented Jul 22, 2021

Hello @ojeda-e! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 7:88: W291 trailing whitespace

Comment last updated at 2021-08-07 22:39:35 UTC

@codecov
Copy link

codecov bot commented Jul 22, 2021

Codecov Report

Merging #52 (7d7f4ca) into main (8a52295) will not change coverage.
The diff coverage is 100.00%.

membrane_curvature/surface.py Outdated Show resolved Hide resolved
# interpolate values in array
interpolated_array = np.interp(index_array,
np.flatnonzero(~mask_nans),
array_surface[~mask_nans])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since mask_nans is used twice in inverted form, and never in its original form, perhaps it is worth it to invert at the original isnan location rather than inverting the original value twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely missed it and it's a good point. You mean, having this instead, correct?

   mask_nans = ~np.isnan(array_surface) #invert original

   index_array = np.arange(array_surface.shape[0])

   interpolated_array = np.interp(index_array,
                                  np.flatnonzero(mask_nans),   #not inverted
                                  array_surface[mask_nans]).   #not inverted

])
def test_surface_interpolation(dummy_surface, expected_interpolated_surface):
surface = surface_interpolation(dummy_surface)
assert_almost_equal(surface, expected_interpolated_surface)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrays of floats, the docstring for assert_almost_equal suggests using i.e., assert_allclose these days for better consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I remember @orbeckst mentioned that MDAnalysis typically uses assert_almost_equal. Maybe we can have his opinion on what would be best here?

Copy link
Member

@orbeckst orbeckst Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDAnalysis uses assert_almost_equal in preference to assert_array_almost_equal. assert_allclose might be more recent than when we started using numpy tests so MDAnalysis might just be outdated and should be switching eventually. Follow @tylerjereddy 's advice :-).

membrane_curvature/surface.py Show resolved Hide resolved
membrane_curvature/surface.py Outdated Show resolved Hide resolved
@ojeda-e ojeda-e linked an issue Jul 25, 2021 that may be closed by this pull request
@ojeda-e
Copy link
Member Author

ojeda-e commented Aug 5, 2021

The pushed changes include:

  • replaced assert_almost_equal by assert_allclose. For two assertions I set tolerance to rtol=6 otherwise the tests fail. Not sure you are ok with this change. (Thanks @tylerjereddy for the suggestion and @orbeckst for confirming)
  • added interpolation to base.py and tests for grids of different sizes and with a different number of undefined values.
  • In base.py, kept the data for z_surface and interpolated_z_surface as suggested by @lilyminium.

I know base.py could be written better, but maybe we can leave that improvement for later and work on the essentials for now if that's ok.
Hopefully I am not missing relevant things here. Thanks!

@ojeda-e
Copy link
Member Author

ojeda-e commented Aug 7, 2021

@lilyminium sorry, it's my fault because I made a bit of a mess here.
To clarify.
def test_mean_curvature_small passes with assert_allclose. I don't know why I ended up adding rtol it should go. (I added the comment here)

The problem is in def test_gaussian_curvature_all. The test passes with rtol=4 or atol=1e-7. From the docs it says allclose "... compares the difference between actual and desired to atol + rtol * abs(desired)."
So probably better have assert_allclose(k, k_test, atol=1e-7) ?

@lilyminium
Copy link
Member

So probably better have assert_allclose(k, k_test, atol=1e-7) ?

I think so. Even if you know yourself that the 400% difference is small because the real value is small, it communicates to future users that you expect the degree of difference to be around 0.0000001.

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests here are a good start! However I think you definitely need to vary your values (maybe by doing tests on your added data files) for robust checking. I am already reasonably certain that self.results.average_interpolated_z_surface etc. are not calculated the way you want them to, which wouldn't have been caught with current test cases (although there aren't actually tests for those yet -- they need some too!).


def _conclude(self):
self.results.average_z_surface = np.nanmean(self.results.z_surface, axis=0)
self.results.average_mean = np.nanmean(self.results.mean, axis=0)
self.results.average_gaussian = np.nanmean(self.results.gaussian, axis=0)
if self.interpolation:
self.results.average_interpolated_z_surface = np.nanmean(
self.results.interpolated_z_surface[self._frame_index], axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.results.interpolated_z_surface[self._frame_index], axis=0)
self.results.interpolated_z_surface, axis=0)

And with the others too -- if you specify frame_index, you'll wind up only getting the mean of the last frame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need np.nanmean? Do you expect np.nan values in your interpolated surfaces? If not, using np.mean will reveal errors in case there ever are np.nan values.


def interpolation_by_array(array_surface):
"""
Interpolates values contained in `array_surface` over axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over which axis?

Comment on lines +155 to +165
mask_nans = ~np.isnan(array_surface)

# index of array_surface
index_array = np.arange(array_surface.shape[0])

# interpolate values in array
interpolated_array = np.interp(index_array,
np.flatnonzero(mask_nans),
array_surface[mask_nans])

return interpolated_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite understanding this. mask_nans is 2D, with shape (n_x, n_y). index_array is 1D, with shape n_x. np.flatnonzero(mask_nans) and array_surface[mask_nans] will be 1D, of length between 0 to n_x * n_y. So I think that means that you are only ever operating on the first row of array_surface?

I guess my question is then, why is array_surface 2D? Why not just pass in one row at a time? In addition, as it is a 2D surface, why use a 1D interpolation function? Why not a 2D one?


"""

interpolated_surface = np.apply_along_axis(interpolation_by_array, axis=0, arr=array_surface)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see that you are applying interpolation_by_array by row, then. In that case I suggest amending the documentation of interpolation_by_array to specify that the input is 1-dimensional with length n_x, instead of n_x, n_y. Although I think doing 2D interpolation would be less arbitrary (choosing which axis is x and which is y is basically a random, right?)

Comment on lines +215 to +237
(np.array(([150., 150., 150.],
[150., np.nan, 150.],
[150., 150., 150.])),
np.full((3, 3), 150.)),
# array 3x4 with all 150 and two nans
(np.array([[150., 150, 150., 150.],
[150., np.nan, np.nan, 150.],
[150., 150., 150., 150.]]),
np.full((3, 4), 150.)),
# array 4x4 with all 150 and two nans
(np.array([[150., 150, 150., 150.],
[150., np.nan, np.nan, 150.],
[150., 130., 140., 150.],
[150., 150., 150., 150.]]),
np.array([[150., 150, 150., 150.],
[150., 140., 145., 150.],
[150., 130., 140., 150.],
[150., 150., 150., 150.]])),
# array 3x3 with lots of nans
(np.array([[np.nan, np.nan, 150.],
[150, np.nan, 150.],
[np.nan, 150., np.nan]]),
np.full((3, 3), 150.))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some tests with more interesting values? e.g. in the one below, the 145 number could have come from interpolating either along the x-axis or the y-axis. In order to be diagnostic of potential current and future bugs, I'd be interested in seeing one where using 1D interpolation gives different results if it's run across the x-axis, vs. the y-axis.

(np.array([[150., 150,  150.,  150.],
               [150., np.nan, np.nan,  150.],
               [150., 130., 140.,  150.],
               [150., 150., 150.,  150.]]),
     np.array([[150., 150, 150.,  150.],
               [150., 140., 145.,  150.],
               [150., 130., 140.,  150.],
               [150., 150., 150.,  150.]])),

Comment on lines +606 to +629
@pytest.mark.parametrize('dim_x, dim_y, x_bins, y_bins, dummy_array, expected_interp_surf', [
# array 3x3 with all 150 and one nan
(300, 300, 3, 3, np.array([[0., 0., 150.], [100., 0., 150.], [200., 0., 150.],
[0., 100., 150.], [100., 100., np.nan], [200., 100., 150.],
[0., 200., 150.], [100., 200., 150.], [200., 200., 150.]]),
np.full((1, 3, 3), 150.)),
# array 3x3 with all 150 and one nan
(300, 300, 3, 3, np.array([[0., 0., 150.], [100., 0., 150.], [200., 0., 150.],
[0., 100., 150.], [100., 100., np.nan], [200., 100., 150.],
[0., 200., np.nan], [100., 200., 150.], [200., 200., 150.]]),
np.full((1, 3, 3), 150.)),
# array 3x4 with all 150 and three nans
(300, 400, 3, 4, np.array([[0., 0., 150.], [100., 0., 150.], [200., 0., 150.],
[0., 100., 150.], [100., 100., np.nan], [200., 100., 150.],
[0., 200., 150], [100., 200., np.nan], [200., 200., np.nan],
[0., 300., 150.], [100., 300., 150.], [200., 300., 150.]]),
np.full((1, 3, 4), 150.)),
# array 4x4 with all 120 and many nans
(400, 400, 4, 4, np.array([[0., 0., np.nan], [100., 0., 120.], [200., 0., 120.], [300., 0., np.nan],
[0., 100., 120.], [100., 100., np.nan], [200., 100., 120.], [300., 100., 120.],
[0., 200., 120], [100., 200., np.nan], [200., 200., np.nan], [300., 200., 120.],
[0., 300., np.nan], [100., 300., 120.], [200., 300., 120.], [300., 300., np.nan]]),
np.full((1, 4, 4), 120.))
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, tests where the expected values aren't all one number are needed to really check assumptions such as:

  • is it interpolating along the axis I think it is
  • is it interpolating in a linear way (vs. some random other polynomial)
  • is it even interpolating, or just taking the most common number?
  • are we sure it's interpolating independently for every frame, or is it interpolating across different frames?

etc.

I am, incidentally, somewhat surprised that coordinates are getting wrapped with np.nan values in them. @richardjgowers should this be happening...?

@ojeda-e ojeda-e closed this Dec 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Undefined values in bins occupied by insertion in membrane.
5 participants