Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify environment detection #499

Open
vyasr opened this issue Apr 15, 2021 · 6 comments
Open

Simplify environment detection #499

vyasr opened this issue Apr 15, 2021 · 6 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 15, 2021

@bdice @mikemhenry @csadorf following up on #498, I think we should consider removing scheduler-based environment detection entirely. @bdice correct me if I've misunderstood the problem, but #498 came up because we don't support Andes, so this check succeeds for the DefaultLSFEnvironment and we fall back to scheduler detection. When this logic was originally written, subclassing Environment was nontrivial, but now a minimal environment just needs to define the hostname_pattern to be detected.

If we're concerned that regexes will scare users off of defining new environments (and I'm not sure that we should be), we could even define a small convenience DEFAULT_PATTERN=".*". This would make defining a new environment as simple as:

class MySlurmEnvironment(flow.environments.DefaultSlurmEnvironment):
    hostname_pattern = flow.DEFAULT_PATTERN

Such environments would fall back to the default Slurm template but otherwise be functional. We would also change ComputeEnvironment.is_present to always return False if hostname_pattern is None. This change would remove any erroneous environment detection, simplify our code, and remove the need for unnecessary scheduler interaction. Is there any reason this wouldn't work?

@bdice
Copy link
Member

bdice commented Apr 15, 2021

I know of several users who rely on the default scheduler detection for systems without templates. I think it is a very helpful and important feature for ease of use. There have only been a couple exceptions where this hasn’t worked, on clusters with multiple (incorrectly configured) schedulers, and those are easy to resolve by defining a custom environment as described above (roughly the same amount of work that would be required for ALL systems without templates if the default scheduler environments were removed). I’m strongly in favor of the status quo here.

@mikemhenry
Copy link
Collaborator

Really I think the only thing we could do better is change the logic a bit so that if more than one scheduler executable is detected, we print a warning that the scheduler type is ambiguous and we are guessing $scheduler_name

I also think autodetection is important and short of some user survey it is hard to guess how many people have had things JustWork ™️ since they don't open up github issues.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 15, 2021

Interesting, could you share what systems they are using it on? If people are actually making use of this feature I'm fine preserving it, but we're almost guaranteed to get misclassifications on systems without a defined environment unless we start parsing the exception payloads rather than using the simple solution of just returning True when exceptions are raised. I would strongly advocate for that in the long term.

Of the machines that we support, almost all of them would fall into the "incorrectly configured" bucket that you mentioned. Great Lakes, Comet, Bridges (not sure about Bridges2), and Stampede2 all have a qsub executable despite being Slurm machines. On some of them the executable doesn't do anything other than simply print a warning that the machine uses Slurm and indicate that qsub shouldn't be used, but I think that's actually very common on Slurm systems and means that accepting error codes from qsub. Therefore, without a hostname pattern defined all of them would incorrectly identify what scheduler is present, aside from possibly getting lucky based on the implementation detail that we search for environments in the order that they are registered (i.e. the order those modules are imported). IIRC flux (the old UM machine) also had a nonfunctional squeue, which would also trigger this problem.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 15, 2021

Really I think the only thing we could do better is change the logic a bit so that if more than one scheduler executable is detected, we print a warning that the scheduler type is ambiguous and we are guessing $scheduler_name

Agreed, but we already print the environment which for the base scheduler cases is hopefully obvious (e.g. DefaultSlurmEnvironment). We could make it more explicit that we're falling back to scheduler detection on an unsupported environment, though.

I also think autodetection is important and short of some user survey it is hard to guess how many people have had things JustWork ™️ since they don't open up github issues.

Yeah that's absolutely true. I can only speak to anecdotal experience, which aren't very helpful. I've either helped people from other places set up custom environments, or helped people set up and run a scheduler locally. The latter case definitely benefits from scheduler-only detection.

@csadorf
Copy link
Contributor

csadorf commented Apr 15, 2021

One other alternative to address this issue would be to make it even easier to configure a custom environment with the correct scheduler and template. I am imagining some kind of assistant like this:

This is an unconfigured environment, however flow auto-detected that this system might use a SLURM scheduler.
Proposed configuration:
hostname: ".*\.domain"
scheduler: slurm_scheduler
Run `flow configure-environment auto` to apply this configuration.

This would probably much easier to achieve if we allowed the template and environment configuration via config file. Likely preferable over the current approach anyways.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 15, 2021

I like that idea, I think it would be nice. I should be able to handle the flow schema work soon (just pending wrapping up some of the administrative hurdles discussed offline) and we could consider changes to support this as part of schema version 2.

Some informal polling (thanks @bdice!) indicated that one other critical use case is the use of a default scheduler environment with a custom template. That case alone is IMO sufficient reason to support scheduler-based detection. Bradley also made the point (offline) that environments don't really matter until the point that users interact with them (e.g. via submission or checking status), which I agree with. My main concern with #498 is that in my experience it basically means that most machines running a SLURM scheduler will also detect the presence of a PBS scheduler because qsub will exist in some form.

I think the main action item then is to to make sure that all signac-flow code paths that interact with the environment will give a consistent error if they detect that they're using a command for a scheduler that isn't actually present. This seems like something that should be implemented at the level of the Scheduler.jobs and Scheduler.submit, perhaps by making these internal methods (to be implemented by subclasses) and exposing public versions that wrap them in a suitable try-except. That would allow us to standardize the error handling while farming the actual fetch logic out to subclasses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants