-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Untangling app
, env
, and the domains
#13072
Comments
A somewhat convoluted datatype that IMO isn't easy to work with (too many implicit rules). It might be easier to just go with a regular class, even if it's not minimal it should be easier to extend and understand.
First step, IMO, start drawing UMLs of the status quo and begin thinking about gradual changes from there. (Since the application is large PlantUML might not cut it, a first step might be agreeing on an adequate modeling tool everyone can work with.) |
'dataclass' in this instance is just an abstract class that holds data, it doesn't need to be Do we need to be as formal as UML diagrams? Sketching the core interactions (and abstracting the details) is fairly straightforward.
This means that domains can access config values by doing A P.S. apologies for any mistakes, I've written this on mobile. |
Since these objects are public to extensions, we need to be careful with the refactoring. But that's solvable, likely one has to introduce (pending?) deprecations on undesired attributes so that extensions have the change to migrate away before we change the API. A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling. @electric-coder I agree that visualizations are often helpful. You are welcome to draw a diagram of the current relevant relationships between the above mentioned classes. We don't need full diagrams of the full application and for now a one-time draw is sufficient to illustrate the current state, therefore, you may use whichever tool you like. |
I'd say getting the class diagrams is always worth it and the extra investment pays off immediately. There are practical problems however: tools that reverse the code to produce an adequate UML are generally proprietary. An alternative is using Pyreverse but producing a good layout is an NP hard problem so if the code has many classes the diagram it's likely to yield might be nearly unreadable. At one point you may want to do some manual layout but not every tool (like PlantUML) supports that. Thus an additional problem is created of maintaining UMLs.
Designing an application to be a nice DAG generally has me keeping UMLs. Maybe because I'm getting older I've given up on trying to solve dependencies relying on memory alone.
So, returning to the original question, for me it's not a matter of formality but of cognitive necessity. |
yep I'm definitely +1 for diagramming the proposed before/after It doesn't have to be anything formal, just sketch it out in powerpoint or whatever |
yeh this is definitely somethat that should be done first; at present, unless you are creating a domain, there really is no obvious API for storing data |
drawio is quite convenient for manual drawing. |
An initial attempt: Current Proposed UML (current)@startuml
skin rose
hide class empty members
class Sphinx {
config: Config
env: BuildEnvironment
project: Project
__init__(*args, **kwds): None
}
class Project {
}
class Config {
config_values: dict[str, _Opt]
__init__(): None
}
class BuildEnvironment {
app: Sphinx
config: Config
docname: str
domaindata: dict[str, dict[str, Any]]
domains: Collection[Domain]
project: Project
temp_data: dict[str, Any]
__init__(app: Sphinx): None
setup(app: Sphinx): None
}
class Domain {
data: dict[str, Any]
env: BuildEnvironment
__init__(env: BuildEnvironment): None
}
Sphinx *-- Config: config
Sphinx *-- BuildEnvironment: env
Sphinx *-- Project: project
BuildEnvironment o-- Sphinx: app
BuildEnvironment *-- Domain: domains
Domain o-- BuildEnvironment: env
@enduml UML (proposed)@startuml
skin rose
hide class empty members
class Sphinx {
config: Config
env: BuildEnvironment
project: Project
__init__(*args, **kwds): None
}
class Project {
}
class Config {
config_values: dict[str, _Opt]
__init__(): None
}
class BuildEnvironment {
app: Sphinx
config: Config
__init__(app: Sphinx): None
setup(app: Sphinx): None
}
class DomainData {
domaindata: dict[str, dict[str, Any]]
}
class EnvData {
domains: Collection[Domain]
project: Project
}
class CurrentDocumentData {
docname: str
temp_data: dict[str, Any]
}
class Domain {
data: dict[str, Any]
env: BuildEnvironment
__init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None
}
Sphinx *-- Config: config
Sphinx *-- BuildEnvironment: env
Sphinx *-- Project: project
BuildEnvironment o-- Sphinx: " _app"
BuildEnvironment *-- DomainData: domain_data
BuildEnvironment *-- EnvData: env_data
BuildEnvironment *-- CurrentDocumentData: current_doc
EnvData *-- Domain: domains
@enduml A |
Thanks for drawing the diagrams. Some remarks from a first look:
|
Excellent effort @AA-Turner 👍 it's evident now the problem won't be easy to solve. This signature is obviously problematic: Domain.__init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None If each What seems to have happened is that "for convenience" I also suppose you want to split the circular reference between the But lets say the maintainers managed to split, normalize, and decouple the whole thing... What kind of an interface would then allow backward compatibility (this remits me back to your initial announcement...) to be superimposed on the new implementation devoid of circular-references (an uncommon exercise for sure!)? |
Here are my 2 cents (subject to modifications and maybe ideas) For
Since the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in Note that we also need to address how serialization of components would work and if we can reconstruct them in the correct order. If we are trying to de-couple dependencies but in the end, we have something even more complex (and hard to follow), I'm not sure we need the change. However, what was perhaps not said is that all constructors are actually not really constructors because they assume that the inputs are incomplete. When domains are constructed, they are so in the Environment constructor, which itself is called from the application constructor when the latter is not yet finished (builders are still not setup entirely!). So, instead of relying of a 2-step construction with an additional setup() function that acts as a post-initialization, I think we should use "contexts" instead of incomplete objects and the constructors would pick whatever they need from those contexts. The creation of the context can be done separtely as well and on the fly. In other words, we may try a builder pattern instead of eager constructors (AFAICT, this is what you want: decouple the creation of each components and not have it in the application's constructor, right?)
I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want. Additional data coming from extensions should be stored in the environment so to save them and pickle them. |
The right solution.
I was drawing an UML based on the proposal and since support for Python 3.10 is being dropped the correct solution to get rid of the dependencies might be going for |
I wasn't really sure how many arrows you're meant to include. I left them out because the
Remove the mutual dependencies between types' constructors.
By splitting these into three, we can be more specific in what we need/use at each call-site.
Each When doing parallel builds, data from each forked environment object gets merged back together, and it is the only(?) supported location for storing data during reading. Domains are nicer to store things in mainly as the One end goal of this work is to restructure our parallel support, but I wanted to split the tasks up as we'll end up having a hopelessly large scope.
This would be a good idea to experiment with. We would need to be careful re
You're right, we'd want to split whatever held
This seems better. I removed a lot of detail from the diagrams, and perhaps there was other domain-data attributes to go in that type? But yes each domain should just have a
This is interesting, though I'd prefer not to need to expose
Note that this will be needed for as long as we use pickling, as unpickling an object calls
Having a better interface such as this would be good. Currently we recommend in documentation to set an attribute directly, which doesn't integrate well with static typing. A |
What I'm quite unclear about is how extensions fit in here. My understanding is that they are implicitly spread across the whole architecture: They can add config values, they may store information in the BuildEnvironment, they may hook into app events, they add or modify domains. Are they well enough isolated to not pose a problem to planned refactorings? |
It's optional. You can draw an arrow, or declare an attribute/argument with a type, or both - they're equivalent although with different expressiveness (depends on what conciseness/explicitness you want). I did draw the URL with all the arrows and it was quite a mess :P So here's a hybrid sketch joining the PlantUML source code@startuml
skin rose
hide class empty members
class Sphinx {
config: Config
env: BuildEnvironment
project: Project
__init__(*args, **kwargs)
}
class Project {
}
class Config {
config_values: dict[str, _Opt]
__init__(*args, **kwargs)
}
class BuildEnvironment {
__init__(*args, **kwargs)
post_init(???): Self
setup(app: Sphinx): None
}
note top of BuildEnvironment: How to solve the\nsetup(app) dependency?
note left of BuildEnvironment: A chain of post_init() calls\nsharing ParamSpec?
class DomainData {
domaindata: dict[str, dict[str, Any]]
post_init(???): Self
}
class EnvData {
domains: Collection[Domain]
post_init(???): Self
}
class CurrentDocumentData {
docname: str
temp_data: dict[str, Any]
post_init(???): Self
}
class _BuilderPattern1 {
__init__(\n config: Config, project: Project, env: BuildEnvironment)
post_init(???): tuple[Config, Project, BuildEnvironment]
}
class Domain {
data: dict[str, Any]
__init__(*args, **kwargs)
post_init(???): Self
}
Sphinx *-- Config: config
Sphinx *-- BuildEnvironment: env
Sphinx *-- Project: project
BuildEnvironment *-- DomainData: domain_data
BuildEnvironment *-- EnvData: env_data
BuildEnvironment *-- CurrentDocumentData: current_doc
EnvData "1" *-- "1...N" Domain: domains
_BuilderPattern1 --> Project
_BuilderPattern1 --> BuildEnvironment
_BuilderPattern1 --> Config
Sphinx +-- _BuilderPattern1
@enduml
Dunno about the internals, but in the UML |
Do you mean things you cannot access as
I don't have the code in front of me so there are probably stuff I'm missing.
I can try to extract it and make it public if you want. And make it more robust. Actually, temporary data can just be regarded as a namespace that you carry over. It's essentially the same as a dictionary and autodoc could also benefit of similar stuff. I have contextual data that I need to carry over autodoc so what I essentially do is just creating some kind of "smart namespace". We had a similar discussion in #12198 concerning the typing of the |
Is your feature request related to a problem? Please describe.
The project has several central internal types,
Sphinx
,BuildEnvironment
,Domain
(& subclasses), andProject
.Of these, only
Project
is well-isolated from the rest of the classes. Creating aSphinx
object internally creates aBuildEnvironment
object, which in turn creates an instance of all 10 builtin domains. All of this happens withinSphinx.__init__()
and creates mutual dependencies betweenapp
,env
, and each domain. This also happens to a lesser extent with theBuilder
object.This design makes it hard to effectively seperate concerns. It makes designing a better parallel system harder, and means that testing/mocking parts of the application is much harder than it could be.
I would like to see if there is a realistic way we can uncouple these mutual dependencies, and provide a migration path for downstream extensions.
Describe the solution you'd like
My current idea is to break
BuildEnvironment
into three parts:env.tmp_data
dict.I believe this would allow us to pass around the data objects and effectivley construct
BuildEnvironment
accessor/wrappers on the fly where needed. For example, we can passDomain.__init__
thedomaindata
dict as part of the global dataclass, instead of the entire env object as at present.I am very open to other ideas though, as there may be approaches I haven't considered. Keeping backwards compatibility and providing a migration path for downstream users will be key, but I think should be doable with clever use of
__getattr__
etc.Describe alternatives you've considered
Do nothing. The easiest option, but leaves us in a less-than ideal situation.
cc: @sphinx-doc/developers
A
The text was updated successfully, but these errors were encountered: