Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting StandAlone kernel #253

Open
TApplencourt opened this issue Mar 17, 2020 · 14 comments
Open

Extracting StandAlone kernel #253

TApplencourt opened this issue Mar 17, 2020 · 14 comments

Comments

@TApplencourt
Copy link
Contributor

Hi,

I open this issue to discuss the possibility of extracting key miniQMC kernels into standalone files.

Indeed having some standalone kernels will help the collaboration between QMCPACK and other ECP projects/vendors.
Those kernels will be easy to install, to benchmark, and port to the different programming models. This will greatly facilitate the early exploration and validation of new hardware/software/programming model.

Regards,
Thomas

@markdewing
Copy link
Collaborator

A little bit of work here
https://github.com/markdewing/qmc_kernels

The only kernels present are vector add (not really qmc-specific, but the simplest kernel) and 3D spline.

Possible additional kernels

  • distance calculation (various boundary conditions?)
  • inverse update (and delayed update version)
  • computation converting raw 3D splines to SPO's ?
  • 1D spline for Jastrow ?

@prckent
Copy link
Contributor

prckent commented Mar 17, 2020

The plan is to make an official maintained QMCPACK repository with splines and updates at first. The idea is that they are clean, zero baggage, well documented and accessible for performance analysis, total refactoring, accessible by non-experts etc. We have much of the code, but which versions should @TApplencourt use to start from? I think reference cpu, cuda, gpu offload etc. would all be of interest. e.g. @PDoakORNL made fresh CUDA implementations in a fork of miniqmc...

@TApplencourt
Copy link
Contributor Author

I can start with the spline of @markdewing if you (aka QMCPACK community) want.

If I understand correctly this code handle {double,single} / {real, complex} data type and many more type of spline.

My recommendation is to start with the bare minimum functionality (one type only for example) and to trim down the rest. It will make the porting / analysis easier.

@prckent
Copy link
Contributor

prckent commented Mar 18, 2020

Please take a careful look at the one in this repo (here, https://github.com/QMCPACK/miniqmc ). I am not sure which branch is best though - someone else will need to chime in. miniqmc knows how to setup various sizes of problems corresponding to NiO. i.e. It is realistic.

@prckent
Copy link
Contributor

prckent commented Mar 18, 2020

I would start with only single precision real. This is the "legacy CUDA" default in mainline and the one used in benchmarks.

@TApplencourt
Copy link
Contributor Author

@markdewing does your implementation differs from miniqmc one? I would prefer to start from our has it look simpler. But if they are different in can trim down the miniqmc too.

In all case, I will use miniqmc to generate realistic problem size.

@markdewing
Copy link
Collaborator

I started from the miniqmc version.

For correctness checking, the driver prints a couple of values from the reference implementation and a couple of values from the non-reference version and the user has to compare them manually. This needs to be done better.

The nx,ny,nz and nspline parameters for a few NiO problem sizes are:

a32-e384 is 112x66x66 with 144 splines
a64-e768 is 112x66x66 with 240 splines
a128-e1536 is 112x66x66 with 408 splines

@PDoakORNL
Copy link
Collaborator

It would be quite easy to take these
https://github.com/PDoakORNL/miniqmc/tree/one_code/src/Numerics/Spline2/test
And make a standalone repo with "my" spline kernel. Should I do that?

@prckent
Copy link
Contributor

prckent commented Mar 19, 2020

It looks like Peter's code has CPU, CUDA and Kokkos already. Peter - are/were these all working? It might well be better for Thomas to start with these since they look like a more comprehensive starting point.

@prckent
Copy link
Contributor

prckent commented Mar 19, 2020

@markdewing Those spline counts look very strange to me, but perhaps I misunderstand? a32-e384 = 32 atoms and 384 electrons, so 192 electrons per spin = 192 splines. The others should be multiples of this number.

Thomas: The grid size corresponds to the primitive cell, i.e. we assume we are doing tiling for the larger cells and running bulks, as we do for the ECP and CORAL benchmarks .

@PDoakORNL
Copy link
Collaborator

Yes but probably I should merge to the main repo again. The onecode in my current branch is the current state. The Kokkos had been dropped at that point so I don’t believe it works anymore. I started to look at extracting just the batched/blocked spline eval yesterday, I think it could be made fairly compact especially if some of the variants are deleted/templated.

@markdewing
Copy link
Collaborator

@prckent I took the numbers from QMCPACK. My understanding is that the splines are complex, and depending on the k-point, some of the values are converted to two orbitals, and some are not (in assign_v). Is this correct?
Maybe this is not necessary for a kernel - using real with number of splines equal to the number of SPO's is sufficient.

@prckent
Copy link
Contributor

prckent commented Mar 19, 2020

Yes, that explains the difference.

e.g. For the a32-e384 performance test we can see this on the line "NumDistinctOrbitals 144 numOrbs = 192"
https://cdash.qmcpack.org/CDash/testDetails.php?test=7697041&build=108519

@TApplencourt
Copy link
Contributor Author

TApplencourt commented Jun 5, 2020

It took longer than expected[*], but with Kevin, we did some progress on extracting the inner vgh-float kernel. It's really preliminary, but you can find it here: https://github.com/TApplencourt/nanoQMC.

May I ask people of this thread for review? I'm not sure If we initialize the input correctly.

Do you know about some sanity check I can run on the output to verify we don't do any stupid? (the norm should be 1, or something like that...). Now, the Hessian and gradient values look suspiciously large...

The next step is to create more robust testing, then put the outer_loop back and then porting it to multiple programming languages.


[*] I would like to be able to say that it is because I work from home and have to take care of my young child. But I far as I know, I don't have a toddler...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants