Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First try at automated data CLI #801

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

pfebrer
Copy link
Contributor

@pfebrer pfebrer commented Jun 26, 2024

As discussed with Nick, this is a minimal first attempt for the automatic generation of a CLI to postprocess data.

The aim is to organize things in a way that new methods to generate data can be simply registered to data classes and automatically added to the CLI, without needing to maintain CLI options all across sisl.

The automatic generation of a CLI requires of course some boilerplate code, but the amount of boilerplate code is almost constant when we add new functionality.

The approach taken:

  • Created data classes in sisl.data that should help gathering all the possible ways in which the same data can be generated. For now I only implemented DOSData.
  • Then functions to create that data should be registered. For now I have only registered the function that creates DOSData from eigSileSiesta. This functions should be type annotated to be automatically supported by the CLI.
  • The CLI is created independently in sisl.cli.sdata. For now it can be ran with python -m sisl.cli.sdata. That file is hyper documented, and there I explain all the parts that play a role in generating the CLI.

This standarization should not only help to create sisl's CLI but also facilitate the creation of other interfaces to sisl down the line (including the GUI).

To use it, for now, you need to install nodify's dev version, click and rich_click:

pip install [email protected]:pfebrer/nodify.git click rich_click 

The CLI can work in two ways:

Specify the type of data that you want and how to get it.

Main:
Screenshot from 2024-06-26 23-21-16

DOS data:
Screenshot from 2024-06-26 23-20-46

DOS data from the EIG file:
Screenshot from 2024-06-26 23-20-56

Pass a file, then only specify which data you want to get

Main:
Screenshot from 2024-06-26 23-24-25

DOS data:
Screenshot from 2024-06-26 23-25-38

Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 31.78295% with 88 lines in your changes missing coverage. Please review.

Project coverage is 86.93%. Comparing base (49d4398) to head (f4bd621).

Files Patch % Lines
src/sisl/cli/sdata.py 0.00% 68 Missing ⚠️
src/sisl/io/siesta/eig.py 22.22% 14 Missing ⚠️
src/sisl/data/data.py 75.00% 4 Missing ⚠️
src/sisl/data/_singledispatch.py 93.75% 1 Missing ⚠️
src/sisl/data/dos.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #801      +/-   ##
==========================================
- Coverage   87.07%   86.93%   -0.14%     
==========================================
  Files         400      405       +5     
  Lines       51657    51785     +128     
==========================================
+ Hits        44978    45018      +40     
- Misses       6679     6767      +88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Owner

@zerothi zerothi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to play a bit more with this after vacation. Thanks!

# List that keeps track of the data that the CLI has generated.
# When chaining commands, the first command will generate data and
# the subsequent commands will be methods to apply on that data.
GENERATED = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sub-optimal, imagine one parses a grid, and a geometry (from two sources) and then writes them out, then one has to figure out the order.

I think click contexts are better here, with a dict or something, then sub-commands can also change data to other data-types.
I guess we just need some way of distinguishing data, and also enable the possibility of adding more of the same data-type. E.g. comparing two DOS is not that unreasonable, and here it would be, perhaps better, to have a data-container that contains origin of data. So one can query where it came from.

import sys
from typing import Callable, Optional

import rich_click as click
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably more in favor of importing click, and if available importing rich-click.

@zerothi
Copy link
Owner

zerothi commented Aug 7, 2024

Some general comments here.

I don't particularly fancy the special data-type specification. I can see why it may be useful, but I kind of find it a bit duplicate because it builds on the current state of things and adds a layer between simple numpy arrays. And it is never know exactly when you might need what....

Your dos_from_eig is also duplication, I had imagined something more generic for something that reads an eigenvalue spectrum. For instance when you also do a dos_from_eig(vasp_eig) then the arguments needs to be duplicated. I guess the only way to do this generically would be to have some kind of dos_from_eig(eig, weights, Erange, dE, distribution, smearing) and then do dos_from_vasp(sile, **<extend interface with dos_from_eig>).

How, in this framework, would you envision how things comes out of generic stuff.

For instance, I could imagine reading a Hamiltonian, and then post-processing it for a DOS, but also PDOS and then plotting it?

What I do like about this is that it does not become depending on the sile it self... Perhaps I should play a bit more with this.

I will have a look at how this could be generalized a bit, towards simpler data-structures. E.g. the DOSdata seems like it will quickly explode in terms of how things gets implemented, we need specific classes for a huge number of things... Perhaps it wuold be smarter to be data-type agnostic in some form, but rely on some simpler metadata.

@zerothi
Copy link
Owner

zerothi commented Aug 7, 2024

@pfebrer would you mind rebasing this one? It should not interfere!

@pfebrer
Copy link
Contributor Author

pfebrer commented Oct 4, 2024

I have some time now to be able to work on this. Would you be available to review and merge?

I think it is really important so that the plotting CLI can be build, which would be really useful.

Your dos_from_eig is also duplication, I had imagined something more generic for something that reads an eigenvalue spectrum.

Duplication of what? I just took your current argument parser. Once there are multiple entry points that do something similar one can always generalize, of course.

E.g. the DOSdata seems like it will quickly explode in terms of how things gets implemented, we need specific classes for a huge number of things... Perhaps it wuold be smarter to be data-type agnostic in some form, but rely on some simpler metadata.

How would you centralize all the ways of computing a density of states?

DOSData can also be private, and be simply a tool for centralizing all the functionality.

How, in this framework, would you envision how things comes out of generic stuff.

For instance, I could imagine reading a Hamiltonian, and then post-processing it for a DOS, but also PDOS and then plotting it?

I think initially the CLI doesn't need to do very complex stuff, just a CLI that allows doing the same analysis (e.g. DOS) from different entry points is useful enough I think.

An advanced user can always write their own code to do custom stuff if needed. From my perspective, most users on a daily basis need simpler things than a generic interface to compute many properties at once. Having a more specific CLI for a particular popular analysis I think it is good for widespread adoption. As an example, you can just take the confusion of @ialcon with the MonkhorstPack object.

We can always add later another command to the CLI to do more generic stuff...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants