Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate usage of DataClasses, TypedDicts for e.g. requests in RequestQueue #64

Closed
jirimoravcik opened this issue Feb 13, 2023 · 10 comments
Assignees
Labels
debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@jirimoravcik
Copy link
Member

We should check possible ways how to pass data to storages. The current dicts are not very user-friendly and provide no type-safety or hints. One interesting option is a dataclass, asdict https://docs.python.org/3/library/dataclasses.html#dataclasses.asdict can be used to convert data to a dict that can be sent using clients.

Another interesting possibility is a TypedDict, it can work nicely for return types that have fixed members.

@jirimoravcik jirimoravcik added enhancement New feature or request. low priority Low priority issues to be done eventually. solutioning The issue is not being implemented but only analyzed and planned. t-platform Issues with this label are in the ownership of the platform team. labels Feb 13, 2023
@jirimoravcik
Copy link
Member Author

Possibly check https://www.attrs.org/en/stable/index.html

@fnesveda
Copy link
Member

I was playing with this a bit now, because I'm not a fan of how the API client returns generic dictionaries without any types, and because accessing deeply nested values in responses is quite annoying, and I managed to make a custom dataclass which accepts extra argumens (so that if we ever add fields to the API it won't break the dataclass).

Not sure if this is the direction we would want to go with, but it could be a possibility for client v2 (perhaps along with API v3?).

from dataclasses import dataclass, fields
from typing import Dict, List, Optional, get_origin

def maybe_convert_to_generic_api_object(val):
    if isinstance(val, dict):
        return GenericApiObject(**val)
    if isinstance(val, list):
        return [maybe_convert_to_generic_api_object(item) for item in val]
    return val


@dataclass(init=False, repr=False)
class GenericApiObject:
    def __init_subclass__(cls) -> None:
        dataclass(cls, init=False, repr=False)

    def __init__(self, **kwargs: Dict):
        fields_dict = {field.name: field for field in fields(self)}

        for key, value in kwargs.items():
            if key in fields_dict:
                field = fields_dict[key]
                if isinstance(field.type, type) and issubclass(field.type, GenericApiObject):
                    value = field.type(**value)
                elif get_origin(field.type) is list:
                    list_type = field.type.__args__[0]
                    if isinstance(list_type, type) and issubclass(list_type, GenericApiObject):
                        value = [list_type(**item) for item in value]
                elif get_origin(field.type) is dict:
                    dict_type = field.type.__args__[1]
                    if isinstance(dict_type, type) and issubclass(dict_type, GenericApiObject):
                        value = { k: dict_type(**v) for k, v in value.items() }
            else:
                value = maybe_convert_to_generic_api_object(value)

            setattr(self, key, value)

    def __repr__(self):
        return f'{self.__class__.__name__}({", ".join(f"{k}={v!r}" for k, v in self.__dict__.items())})'

class InnerApiObject(GenericApiObject):
    inner_string: str

class OuterApiObject(GenericApiObject):
    number: int
    string: str
    generic_list: List
    generic_dict: Dict
    list_of_numbers: List[int]
    dict_of_numbers: Dict[str, int]
    inner_object: InnerApiObject
    list_of_inner_objects: List[InnerApiObject]
    dict_of_inner_objects: Dict[str, InnerApiObject]
    missing_arg_with_default: str = 'default'
    optional_arg: Optional[str] = None
    missing_arg: str

data = {
    'number': 1,
    'string': 'string',
    'list_of_numbers': [1, 2, 3],
    'dict_of_numbers': { 'a': 1, 'b': 2 },
    'generic_list': [1, 'a', { 'b': 2 }],
    'generic_dict': { 'a': 1, 'b': 'b', 'c': { 'd': 2 } },
    'inner_object': { 'inner_string': 'inner' },
    'list_of_inner_objects': [{ 'inner_string': 'inner' }, { 'inner_string': 'inner2' }],
    'dict_of_inner_objects': { 'a': { 'inner_string': 'inner' }, 'b': { 'inner_string': 'inner2' } },
    
    'extra_arg': 'extra',
    'extra_dict': { 'a': 1 },
    'extra_list': [1, 2, 3],
    'extra_list_of_dicts': [{ 'a': 1 }, { 'b': 2 }],
}

o = OuterApiObject(**data)

@fnesveda
Copy link
Member

fnesveda commented Jul 3, 2023

CC @vdusek @B4nan we were talking about the missing type hints on requests in RQ today

@fnesveda fnesveda added the debt Code quality improvement or decrease of technical debt. label Jul 3, 2023
@vdusek
Copy link
Contributor

vdusek commented Jul 28, 2023

FYI: there is also a Pydantic with its dataclasses - https://docs.pydantic.dev/latest/usage/dataclasses/. Maybe it could be included in the investigation as well.

@jirimoravcik
Copy link
Member Author

FYI: there is also a Pydantic with its dataclasses - https://docs.pydantic.dev/latest/usage/dataclasses/. Maybe it could be included in the investigation as well.

We used Pydantic in the past but removed it. See 8f3b9ac for details

@jirimoravcik
Copy link
Member Author

Also worth checking out https://pypi.org/project/beartype/ for type validation

@fnesveda fnesveda added the backend Issues related to the platform backend. label Oct 10, 2023
@fnesveda fnesveda added t-tooling Issues with this label are in the ownership of the tooling team. and removed t-platform Issues with this label are in the ownership of the platform team. low priority Low priority issues to be done eventually. backend Issues related to the platform backend. labels Jan 19, 2024
@fnesveda
Copy link
Member

@vdusek @B4nan I moved this on your team since you're responsible for Python tooling now 🙂

@janbuchar
Copy link
Contributor

We discussed this with @vdusek and @B4nan and decided to put pydantic back. msgspec is a close second, but we chose the more popular library. It is true that there will be a 20% increase in dependency size, but that doesn't seem to be a showstopper.

@jirimoravcik
Copy link
Member Author

We discussed this with @vdusek and @B4nan and decided to put pydantic back. msgspec is a close second, but we chose the more popular library. It is true that there will be a 20% increase in dependency size, but that doesn't seem to be a showstopper.

Is there some document with full analysis and benchmarks supporting this decision? Would love to read through it and see which alternatives were considered. Thanks 😉

@vdusek
Copy link
Contributor

vdusek commented Feb 29, 2024

Let me give you a TLDR of our yesterday's research.

Start by summarizing our requirements

  1. Define data types using built-in type hints (to use the same way of defining types for both data model & static analysis tools - type checker).
  2. Coercion/transformation - renaming attributes to match Python conventions (in our case typically from camelCase to snake_case).
  3. Support for runtime validation (mainly for handling user input, e.g. user creates request object, scrapy-apify requests conversion, ...):
    • a) Type checking
    • b) Custom conditions
  4. Generate models from JSON/YAML/OpenAPI/... (regarding Investigate sharing constants between JavaScript and Python #9).

The available options in Python

Evaluation

Unfortunately, the built-in solutions lack support for points 2 and 3. Attrs is not compatible with koxudaxi/datamodel-code-generator and subjectively has a less intuitive API. Consequently, our choice narrows down to Pydantic and Msgspec.

Msgspec seems a pretty cool project, although relatively new, 1.8k stars, and developed by a single individual. It does not support general runtime validation, only during the deserialization. However, we didn't consider it as a blocker, as we could integrate third-party tools like beartype if needed.

Pydantic on the other hand stands out as the de facto standard in the field, with 18k stars, it's used in FastAPI (69k stars). It is developed by many people (7 developers with over 100 commits). Pydantic V1 was pretty slow regarding creating objects and setting attributes because of its validaiton. In V2, which was released in mid-2023, they separated the validation process into a distinct package (pydantic-core), where the validation itself is implemented in Rust.

Taking these into account, Pydantic is a winner for us despite its size (8.5 MB). Msgspec as a viable alternative in the second place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

4 participants