Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untangling app, env, and the domains #13072

Open
AA-Turner opened this issue Oct 25, 2024 · 16 comments
Open

Untangling app, env, and the domains #13072

AA-Turner opened this issue Oct 25, 2024 · 16 comments
Labels
type:proposal a feature suggestion

Comments

@AA-Turner
Copy link
Member

Is your feature request related to a problem? Please describe.

The project has several central internal types, Sphinx, BuildEnvironment, Domain (& subclasses), and Project.

Of these, only Project is well-isolated from the rest of the classes. Creating a Sphinx object internally creates a BuildEnvironment object, which in turn creates an instance of all 10 builtin domains. All of this happens within Sphinx.__init__() and creates mutual dependencies between app, env, and each domain. This also happens to a lesser extent with the Builder object.

This design makes it hard to effectively seperate concerns. It makes designing a better parallel system harder, and means that testing/mocking parts of the application is much harder than it could be.

I would like to see if there is a realistic way we can uncouple these mutual dependencies, and provide a migration path for downstream extensions.

Describe the solution you'd like

My current idea is to break BuildEnvironment into three parts:

  1. A dataclass with information about the current document (all that is currently the env.tmp_data dict.
  2. A dataclass with information about the build at large
  3. A type that provides utility functions and contains the current-document and global dataclasses.

I believe this would allow us to pass around the data objects and effectivley construct BuildEnvironment accessor/wrappers on the fly where needed. For example, we can pass Domain.__init__ the domaindata dict as part of the global dataclass, instead of the entire env object as at present.

I am very open to other ideas though, as there may be approaches I haven't considered. Keeping backwards compatibility and providing a migration path for downstream users will be key, but I think should be doable with clever use of __getattr__ etc.

Describe alternatives you've considered

Do nothing. The easiest option, but leaves us in a less-than ideal situation.

cc: @sphinx-doc/developers

A

@AA-Turner AA-Turner added the type:proposal a feature suggestion label Oct 25, 2024
@AA-Turner AA-Turner pinned this issue Oct 25, 2024
@electric-coder
Copy link

electric-coder commented Oct 25, 2024

dataclass

A somewhat convoluted datatype that IMO isn't easy to work with (too many implicit rules). It might be easier to just go with a regular class, even if it's not minimal it should be easier to extend and understand.

and creates mutual dependencies

First step, IMO, start drawing UMLs of the status quo and begin thinking about gradual changes from there. (Since the application is large PlantUML might not cut it, a first step might be agreeing on an adequate modeling tool everyone can work with.)

@AA-Turner
Copy link
Member Author

'dataclass' in this instance is just an abstract class that holds data, it doesn't need to be dataclasses.dataclass.

Do we need to be as formal as UML diagrams? Sketching the core interactions (and abstracting the details) is fairly straightforward.

  1. Sphinx.__init__ sets self.env to the result of BuildEnvironment.__init__, passing itself as self
  2. BuildEnvironment.__init__ sets self.app, creating a mutual dependency, and sets self.domains which contains each of the 10 domains, which each call Domain.__init__, passing the environment object as self.
  3. Domain.__init__ sets self.env, creating another mutual dependency.
  4. All of this happens whilst initialising the Sphinx object, effectively making the entire application one big reference cycle.

This means that domains can access config values by doing self.env.app.config.spam or similar, which is very useful—what I want to do is untangle the creation of these objects.

A

P.S. apologies for any mistakes, I've written this on mobile.

@timhoffm
Copy link
Contributor

Since these objects are public to extensions, we need to be careful with the refactoring. But that's solvable, likely one has to introduce (pending?) deprecations on undesired attributes so that extensions have the change to migrate away before we change the API.

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.


@electric-coder I agree that visualizations are often helpful. You are welcome to draw a diagram of the current relevant relationships between the above mentioned classes. We don't need full diagrams of the full application and for now a one-time draw is sufficient to illustrate the current state, therefore, you may use whichever tool you like.

@electric-coder
Copy link

electric-coder commented Oct 26, 2024

Do we need to be as formal as UML diagrams?

I'd say getting the class diagrams is always worth it and the extra investment pays off immediately. There are practical problems however: tools that reverse the code to produce an adequate UML are generally proprietary. An alternative is using Pyreverse but producing a good layout is an NP hard problem so if the code has many classes the diagram it's likely to yield might be nearly unreadable. At one point you may want to do some manual layout but not every tool (like PlantUML) supports that. Thus an additional problem is created of maintaining UMLs.

effectively making the entire application one big reference cycle.

Designing an application to be a nice DAG generally has me keeping UMLs. Maybe because I'm getting older I've given up on trying to solve dependencies relying on memory alone.

Do we need to be as formal as UML diagrams?

So, returning to the original question, for me it's not a matter of formality but of cognitive necessity.

@chrisjsewell
Copy link
Member

Do we need to be as formal as UML diagrams?

yep I'm definitely +1 for diagramming the proposed before/after

It doesn't have to be anything formal, just sketch it out in powerpoint or whatever
(I have literally done that before 😉 aiidateam/aiida-core#5330)

@chrisjsewell
Copy link
Member

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions.

yeh this is definitely somethat that should be done first; at present, unless you are creating a domain, there really is no obvious API for storing data

@timhoffm
Copy link
Contributor

It doesn't have to be anything formal, just sketch it out in powerpoint or whatever

drawio is quite convenient for manual drawing.

@AA-Turner
Copy link
Member Author

An initial attempt:

Current

sphinx-uml-current

Proposed

sphinx-uml-proposed

UML (current)
@startuml

skin rose
hide class empty members

class Sphinx {
  config: Config
  env: BuildEnvironment
  project: Project
  __init__(*args, **kwds): None
}

class Project {
}
class Config {
  config_values: dict[str, _Opt]
  __init__(): None
}
class BuildEnvironment {
  app: Sphinx
  config: Config
  docname: str
  domaindata: dict[str, dict[str, Any]]
  domains: Collection[Domain]
  project: Project
  temp_data: dict[str, Any]
  __init__(app: Sphinx): None
  setup(app: Sphinx): None
}

class Domain {
  data: dict[str, Any]
  env: BuildEnvironment
  __init__(env: BuildEnvironment): None
}

Sphinx *-- Config: config
Sphinx *-- BuildEnvironment: env
Sphinx *-- Project: project
BuildEnvironment o-- Sphinx: app

BuildEnvironment *-- Domain: domains
Domain o-- BuildEnvironment: env

@enduml
UML (proposed)
@startuml

skin rose
hide class empty members

class Sphinx {
  config: Config
  env: BuildEnvironment
  project: Project
  __init__(*args, **kwds): None
}

class Project {
}
class Config {
  config_values: dict[str, _Opt]
  __init__(): None
}
class BuildEnvironment {
  app: Sphinx
  config: Config
  __init__(app: Sphinx): None
  setup(app: Sphinx): None
}

class DomainData {
  domaindata: dict[str, dict[str, Any]]
}
class EnvData {
  domains: Collection[Domain]
  project: Project
}
class CurrentDocumentData {
  docname: str
  temp_data: dict[str, Any]
}

class Domain {
  data: dict[str, Any]
  env: BuildEnvironment
  __init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None
}

Sphinx *-- Config: config
Sphinx *-- BuildEnvironment: env
Sphinx *-- Project: project
BuildEnvironment o-- Sphinx: " _app"

BuildEnvironment *-- DomainData: domain_data
BuildEnvironment *-- EnvData: env_data
BuildEnvironment *-- CurrentDocumentData: current_doc

EnvData *-- Domain: domains

@enduml

A

@timhoffm
Copy link
Contributor

Thanks for drawing the diagrams. Some remarks from a first look:

  • In current, BuildEnvironment contains references to config and project. There are no relational arrows drawn for these. Applies equally for the kept BuildEnvironment.config in proposed.
  • What do you want to achieve with proposed? Could you comment on the reasoning? It's difficult to figure this out from the diagram. For example, the DomainData class seems to pull out essentially one variable, which does not make sense to me. Also, more generally, it feels a bit weird to have BuildEnvironment.domaindata at all, in particular when there is also Domain.data, which AFAICS references the same dicts. BuildEnvironment.domaindata has a comment # domain-specific inventories, here to be pickled. If pickling is the only reason, that exists, IMHO it would be better to handle these things and not carry them around in the datastructure persistently. One could either do this via __setstate__ / __getstate__ (if one wants to rely on plain pickling of just the env; or, one could handle domain data explicitly on the outside, because we control the code that does the pickling/unpickling

@electric-coder
Copy link

electric-coder commented Oct 27, 2024

Excellent effort @AA-Turner 👍 it's evident now the problem won't be easy to solve.

This signature is obviously problematic:

Domain.__init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None

If each Domain instance holds a reference to the instances of the 3 new classes as an attribute that would count as an empty diamond. Ideally since Domain is a leaf in the instance order it should only be called, it shouldn't need to be calling the containing/composing class EnvData or the other classes that compose BuildEnvironment. So while the 3-split of BuildEnvironment is a step in the right direction the proposed solution doesn't yet seem to solve the circular dependency that's the hardest problem.

What seems to have happened is that "for convenience" Domain received BuildEnvironment and then called it internally to simplify the signatures and take advantage of the OO, but even if no env attribute were held by Domain and it only read a BuildEnvironment instance that would still be an association without a diamond - and still, a mutual functional dependency.

I also suppose you want to split the circular reference between the Sphinx and BuildEnvironment classes at a latter stage. It makes sense to start with the leaves before moving up to the root of the dependencies.

But lets say the maintainers managed to split, normalize, and decouple the whole thing... What kind of an interface would then allow backward compatibility (this remits me back to your initial announcement...) to be superimposed on the new implementation devoid of circular-references (an uncommon exercise for sure!)?

@picnixz
Copy link
Member

picnixz commented Oct 28, 2024

Here are my 2 cents (subject to modifications and maybe ideas)


For DomainData, I fail to understand why it would be passed to the Domain constructor. I think you meant data: dict[str, Any] as an initial domain data? and that DomainData is actually a mapping from domain names to their actual domain's data? And that DomainData would expose:

  • DomainData.store(domain_name: str, key: str, value: Any): add some data for a given domain, storing it under key.
  • DomainData.fetch(domain_name: str, key: str, default: Any): fetch domain-specific data
  • Some other methods perhaps.

Since the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in __init__() and builders are not set up yet!). Now, instead of accessing it via the environment, they can access it if we pass around the application to their setup() method instead. So extensions doing self.env.app in Domain.setup() would simply take app from it. And if they need to keep a reference to the application, they can store it when doing Domain.setup() (which should could as the "constructor finalization" when the environment is ready).

Note that we also need to address how serialization of components would work and if we can reconstruct them in the correct order. If we are trying to de-couple dependencies but in the end, we have something even more complex (and hard to follow), I'm not sure we need the change.

However, what was perhaps not said is that all constructors are actually not really constructors because they assume that the inputs are incomplete. When domains are constructed, they are so in the Environment constructor, which itself is called from the application constructor when the latter is not yet finished (builders are still not setup entirely!). So, instead of relying of a 2-step construction with an additional setup() function that acts as a post-initialization, I think we should use "contexts" instead of incomplete objects and the constructors would pick whatever they need from those contexts. The creation of the context can be done separtely as well and on the fly.

In other words, we may try a builder pattern instead of eager constructors (AFAICT, this is what you want: decouple the creation of each components and not have it in the application's constructor, right?)


A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.

I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want.

Additional data coming from extensions should be stored in the environment so to save them and pickle them.

@electric-coder
Copy link

electric-coder commented Oct 28, 2024

In other words, we may try a builder pattern instead of eager constructors

The right solution.

Essentially, it's just a huge dictionary where keys are "tokens"

I was drawing an UML based on the proposal and since support for Python 3.10 is being dropped the correct solution to get rid of the dependencies might be going for ParamSpec in the constructors to avoid keeping circular references to instances yet formalize the interface whilst having it type checked.

@AA-Turner
Copy link
Member Author

In current, BuildEnvironment contains references to config and project. There are no relational arrows drawn for these. Applies equally for the kept BuildEnvironment.config in proposed.

I wasn't really sure how many arrows you're meant to include. I left them out because the project attribute on env is extracted from app, rather than passed separately.

What do you want to achieve with proposed? Could you comment on the reasoning? It's difficult to figure this out from the diagram. For example, the DomainData class seems to pull out essentially one variable, which does not make sense to me.

Remove the mutual dependencies between types' constructors.

app should set things up, manage extension interfaces (add_*()) and run the builder.

env currently plays a triple(?) role:

  1. Holding global constants (eg Project, Config, etc). These shouldn't change by the time that they get to env
  2. Holding mutable global data generated during reading documents (eg all reference targets, all objects, domain data, custom data from an extension)
  3. Holding per-document data (eg the document name)
  4. A number of utility methods

By splitting these into three, we can be more specific in what we need/use at each call-site.

Also, more generally, it feels a bit weird to have BuildEnvironment.domaindata at all, in particular when there is also Domain.data, which AFAICS references the same dicts. BuildEnvironment.domaindata has a comment # domain-specific inventories, here to be pickled.

Each Domain.data dict is a value of env.domaindata, keyed to the domain name. I don't know why this design was originally chosen (simplicity?), but I think it's part of the whole data-store problem.

When doing parallel builds, data from each forked environment object gets merged back together, and it is the only(?) supported location for storing data during reading. Domains are nicer to store things in mainly as the .data attribute abstracts the involvement of env.

One end goal of this work is to restructure our parallel support, but I wanted to split the tasks up as we'll end up having a hopelessly large scope.

If pickling is the only reason, that exists, IMHO it would be better to handle these things and not carry them around in the datastructure persistently. One could either do this via __setstate__ / __getstate__ (if one wants to rely on plain pickling of just the env; or, one could handle domain data explicitly on the outside, because we control the code that does the pickling/unpickling

This would be a good idea to experiment with. We would need to be careful re env-merge-info etc and ensure that if we started storing data on a domain with no reference to the environment that we merged it back in after parallel reading.

If each Domain instance holds a reference to the instances of the 3 new classes as an attribute that would count as an empty diamond. Ideally since Domain is a leaf in the instance order it should only be called, it shouldn't need to be calling the containing/composing class EnvData or the other classes that compose BuildEnvironment. So while the 3-split of BuildEnvironment is a step in the right direction the proposed solution doesn't yet seem to solve the circular dependency that's the hardest problem.

You're right, we'd want to split whatever held config and other environment constants from the container of domains itself.

For DomainData, I fail to understand why it would be passed to the Domain constructor. I think you meant data: dict[str, Any] as an initial domain data? and that DomainData is actually a mapping from domain names to their actual domain's data? And that DomainData would expose:

  • DomainData.store(domain_name: str, key: str, value: Any): add some data for a given domain, storing it under key.
  • DomainData.fetch(domain_name: str, key: str, default: Any): fetch domain-specific data
  • Some other methods perhaps.

This seems better. I removed a lot of detail from the diagrams, and perhaps there was other domain-data attributes to go in that type? But yes each domain should just have a dict[str, Any] initial data.

Since the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in __init__() and builders are not set up yet!). Now, instead of accessing it via the environment, they can access it if we pass around the application to their setup() method instead. So extensions doing self.env.app in Domain.setup() would simply take app from it. And if they need to keep a reference to the application, they can store it when doing Domain.setup() (which should could as the "constructor finalization" when the environment is ready).

This is interesting, though I'd prefer not to need to expose app to the domains. What APIs are missing from env that we could add? (Perhaps for later discussion)

two-stage setup() function

Note that this will be needed for as long as we use pickling, as unpickling an object calls __setstate__ and not __init__

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.

I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want.

Additional data coming from extensions should be stored in the environment so to save them and pickle them.

Having a better interface such as this would be good. Currently we recommend in documentation to set an attribute directly, which doesn't integrate well with static typing.

A

@timhoffm
Copy link
Contributor

What I'm quite unclear about is how extensions fit in here. My understanding is that they are implicitly spread across the whole architecture: They can add config values, they may store information in the BuildEnvironment, they may hook into app events, they add or modify domains. Are they well enough isolated to not pose a problem to planned refactorings?

@electric-coder
Copy link

electric-coder commented Oct 29, 2024

I wasn't really sure how many arrows you're meant to include.

It's optional. You can draw an arrow, or declare an attribute/argument with a type, or both - they're equivalent although with different expressiveness (depends on what conciseness/explicitness you want). I did draw the URL with all the arrows and it was quite a mess :P


So here's a hybrid sketch joining the post_init() and "Builder pattern" that @picnixz mentioned (also to not let @AA-Turner be the lone user that contributes a UML).

Sphinx_builder_pattern

PlantUML source code
@startuml

skin rose
hide class empty members

class Sphinx {
  config: Config
  env: BuildEnvironment
  project: Project
  __init__(*args, **kwargs)
}

class Project {
}
class Config {
  config_values: dict[str, _Opt]
  __init__(*args, **kwargs)
}
class BuildEnvironment {
  __init__(*args, **kwargs)
  post_init(???): Self
  setup(app: Sphinx): None
}

note top of BuildEnvironment: How to solve the\nsetup(app) dependency?
note left of BuildEnvironment: A chain of post_init() calls\nsharing ParamSpec?

class DomainData {
  domaindata: dict[str, dict[str, Any]]
  post_init(???): Self
}
class EnvData {
  domains: Collection[Domain]
  post_init(???): Self
}
class CurrentDocumentData {
  docname: str
  temp_data: dict[str, Any]
  post_init(???): Self
}
class _BuilderPattern1 {
  __init__(\n    config: Config, project: Project, env: BuildEnvironment)
  post_init(???): tuple[Config, Project, BuildEnvironment]
}

class Domain {
  data: dict[str, Any]
  __init__(*args, **kwargs)
  post_init(???): Self
}

Sphinx *-- Config: config
Sphinx *-- BuildEnvironment: env
Sphinx *-- Project: project

BuildEnvironment *-- DomainData: domain_data
BuildEnvironment *-- EnvData: env_data
BuildEnvironment *-- CurrentDocumentData: current_doc

EnvData "1" *-- "1...N" Domain: domains

_BuilderPattern1 --> Project
_BuilderPattern1 --> BuildEnvironment
_BuilderPattern1 --> Config
Sphinx +-- _BuilderPattern1

@enduml

Are they well enough isolated to not pose a problem to planned refactorings?

Dunno about the internals, but in the UML _BuilderPattern1 could also be a kind of top-level class and I suppose extensions also behave like top-level classes that have visibility over the entire API that's exposed.

@picnixz
Copy link
Member

picnixz commented Oct 29, 2024

This is interesting, though I'd prefer not to need to expose app to the domains. What APIs are missing from env that we could add? (Perhaps for later discussion)

Do you mean things you cannot access as env.thing unless you do env.app.thing? I don't really remember but there shouldn't have much. IIRC, at the time the domain is constructed, the things being available are:

  • config & project
  • pre-loaded builders (they are still in an incomplete state at the time of the call I think)
  • extensions after they've called setup() (so they could also be in an incomplete state if they rely on config-inited)
  • application output streams (maybe that?)
  • catalogs (maybe this?)
  • event manager (accessible at the level of env.emit I think? or do we need env.app.emit?)

I don't have the code in front of me so there are probably stuff I'm missing.

Having a better interface such as this would be good

I can try to extract it and make it public if you want. And make it more robust. Actually, temporary data can just be regarded as a namespace that you carry over. It's essentially the same as a dictionary and autodoc could also benefit of similar stuff. I have contextual data that I need to carry over autodoc so what I essentially do is just creating some kind of "smart namespace".

We had a similar discussion in #12198 concerning the typing of the Config object and how to make it properly extensible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:proposal a feature suggestion
Projects
None yet
Development

No branches or pull requests

5 participants