Investigate usage of DataClasses, TypedDicts for e.g. requests in RequestQueue #64

jirimoravcik · 2023-02-13T12:19:50Z

We should check possible ways how to pass data to storages. The current dicts are not very user-friendly and provide no type-safety or hints. One interesting option is a dataclass, asdict https://docs.python.org/3/library/dataclasses.html#dataclasses.asdict can be used to convert data to a dict that can be sent using clients.

Another interesting possibility is a TypedDict, it can work nicely for return types that have fixed members.

The text was updated successfully, but these errors were encountered:

jirimoravcik · 2023-02-24T14:31:01Z

Possibly check https://www.attrs.org/en/stable/index.html

fnesveda · 2023-05-24T23:37:34Z

I was playing with this a bit now, because I'm not a fan of how the API client returns generic dictionaries without any types, and because accessing deeply nested values in responses is quite annoying, and I managed to make a custom dataclass which accepts extra argumens (so that if we ever add fields to the API it won't break the dataclass).

Not sure if this is the direction we would want to go with, but it could be a possibility for client v2 (perhaps along with API v3?).

from dataclasses import dataclass, fields
from typing import Dict, List, Optional, get_origin

def maybe_convert_to_generic_api_object(val):
    if isinstance(val, dict):
        return GenericApiObject(**val)
    if isinstance(val, list):
        return [maybe_convert_to_generic_api_object(item) for item in val]
    return val


@dataclass(init=False, repr=False)
class GenericApiObject:
    def __init_subclass__(cls) -> None:
        dataclass(cls, init=False, repr=False)

    def __init__(self, **kwargs: Dict):
        fields_dict = {field.name: field for field in fields(self)}

        for key, value in kwargs.items():
            if key in fields_dict:
                field = fields_dict[key]
                if isinstance(field.type, type) and issubclass(field.type, GenericApiObject):
                    value = field.type(**value)
                elif get_origin(field.type) is list:
                    list_type = field.type.__args__[0]
                    if isinstance(list_type, type) and issubclass(list_type, GenericApiObject):
                        value = [list_type(**item) for item in value]
                elif get_origin(field.type) is dict:
                    dict_type = field.type.__args__[1]
                    if isinstance(dict_type, type) and issubclass(dict_type, GenericApiObject):
                        value = { k: dict_type(**v) for k, v in value.items() }
            else:
                value = maybe_convert_to_generic_api_object(value)

            setattr(self, key, value)

    def __repr__(self):
        return f'{self.__class__.__name__}({", ".join(f"{k}={v!r}" for k, v in self.__dict__.items())})'

class InnerApiObject(GenericApiObject):
    inner_string: str

class OuterApiObject(GenericApiObject):
    number: int
    string: str
    generic_list: List
    generic_dict: Dict
    list_of_numbers: List[int]
    dict_of_numbers: Dict[str, int]
    inner_object: InnerApiObject
    list_of_inner_objects: List[InnerApiObject]
    dict_of_inner_objects: Dict[str, InnerApiObject]
    missing_arg_with_default: str = 'default'
    optional_arg: Optional[str] = None
    missing_arg: str

data = {
    'number': 1,
    'string': 'string',
    'list_of_numbers': [1, 2, 3],
    'dict_of_numbers': { 'a': 1, 'b': 2 },
    'generic_list': [1, 'a', { 'b': 2 }],
    'generic_dict': { 'a': 1, 'b': 'b', 'c': { 'd': 2 } },
    'inner_object': { 'inner_string': 'inner' },
    'list_of_inner_objects': [{ 'inner_string': 'inner' }, { 'inner_string': 'inner2' }],
    'dict_of_inner_objects': { 'a': { 'inner_string': 'inner' }, 'b': { 'inner_string': 'inner2' } },
    
    'extra_arg': 'extra',
    'extra_dict': { 'a': 1 },
    'extra_list': [1, 2, 3],
    'extra_list_of_dicts': [{ 'a': 1 }, { 'b': 2 }],
}

o = OuterApiObject(**data)

fnesveda · 2023-07-03T12:23:05Z

CC @vdusek @B4nan we were talking about the missing type hints on requests in RQ today

vdusek · 2023-07-28T09:48:59Z

FYI: there is also a Pydantic with its dataclasses - https://docs.pydantic.dev/latest/usage/dataclasses/. Maybe it could be included in the investigation as well.

jirimoravcik · 2023-07-28T10:06:33Z

FYI: there is also a Pydantic with its dataclasses - https://docs.pydantic.dev/latest/usage/dataclasses/. Maybe it could be included in the investigation as well.

We used Pydantic in the past but removed it. See 8f3b9ac for details

jirimoravcik · 2023-07-28T10:07:52Z

Also worth checking out https://pypi.org/project/beartype/ for type validation

fnesveda · 2024-01-19T08:33:56Z

@vdusek @B4nan I moved this on your team since you're responsible for Python tooling now 🙂

janbuchar · 2024-02-28T15:48:05Z

We discussed this with @vdusek and @B4nan and decided to put pydantic back. msgspec is a close second, but we chose the more popular library. It is true that there will be a 20% increase in dependency size, but that doesn't seem to be a showstopper.

jirimoravcik · 2024-02-28T16:08:36Z

We discussed this with @vdusek and @B4nan and decided to put pydantic back. msgspec is a close second, but we chose the more popular library. It is true that there will be a 20% increase in dependency size, but that doesn't seem to be a showstopper.

Is there some document with full analysis and benchmarks supporting this decision? Would love to read through it and see which alternatives were considered. Thanks 😉

vdusek · 2024-02-29T10:20:39Z

Let me give you a TLDR of our yesterday's research.

Start by summarizing our requirements

Define data types using built-in type hints (to use the same way of defining types for both data model & static analysis tools - type checker).
Coercion/transformation - renaming attributes to match Python conventions (in our case typically from camelCase to snake_case).
Support for runtime validation (mainly for handling user input, e.g. user creates request object, scrapy-apify requests conversion, ...):
- a) Type checking
- b) Custom conditions
Generate models from JSON/YAML/OpenAPI/... (regarding Investigate sharing constants between JavaScript and Python #9).
- e.g. by using koxudaxi/datamodel-code-generator

The available options in Python

Solutions from the standard library:
3rd part solutions:
- Pydantic
  - Size: ~8.5 MB (pydantic 3MB, pydantic-core 5.5MB)
- Attrs
  - Size: ~0.5 MB
- Msgspec
  - Size: ~0.6 MB

Evaluation

Unfortunately, the built-in solutions lack support for points 2 and 3. Attrs is not compatible with koxudaxi/datamodel-code-generator and subjectively has a less intuitive API. Consequently, our choice narrows down to Pydantic and Msgspec.

Msgspec seems a pretty cool project, although relatively new, 1.8k stars, and developed by a single individual. It does not support general runtime validation, only during the deserialization. However, we didn't consider it as a blocker, as we could integrate third-party tools like beartype if needed.

Pydantic on the other hand stands out as the de facto standard in the field, with 18k stars, it's used in FastAPI (69k stars). It is developed by many people (7 developers with over 100 commits). Pydantic V1 was pretty slow regarding creating objects and setting attributes because of its validaiton. In V2, which was released in mid-2023, they separated the validation process into a distinct package (pydantic-core), where the validation itself is implemented in Rust.

Taking these into account, Pydantic is a winner for us despite its size (8.5 MB). Msgspec as a viable alternative in the second place.

jirimoravcik added enhancement New feature or request. low priority Low priority issues to be done eventually. solutioning The issue is not being implemented but only analyzed and planned. t-platform Issues with this label are in the ownership of the platform team. labels Feb 13, 2023

fnesveda added the debt Code quality improvement or decrease of technical debt. label Jul 3, 2023

fnesveda added the backend Issues related to the platform backend. label Oct 10, 2023

jirimoravcik mentioned this issue Jan 3, 2024

Add Scrapy ApifyHttpProxyMiddleware for managing proxies #158

Merged

vdusek assigned vdusek and janbuchar Feb 29, 2024

vdusek added this to the 84th sprint - Tooling team milestone Feb 29, 2024

vdusek removed the enhancement New feature or request. label Feb 29, 2024

vdusek closed this as completed Mar 1, 2024

vdusek mentioned this issue Nov 11, 2024

Handle request list and proxy configuration inputs in a user-friendly way #310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate usage of DataClasses, TypedDicts for e.g. requests in RequestQueue #64

Investigate usage of DataClasses, TypedDicts for e.g. requests in RequestQueue #64

jirimoravcik commented Feb 13, 2023

jirimoravcik commented Feb 24, 2023

fnesveda commented May 24, 2023

fnesveda commented Jul 3, 2023

vdusek commented Jul 28, 2023

jirimoravcik commented Jul 28, 2023

jirimoravcik commented Jul 28, 2023

fnesveda commented Jan 19, 2024

janbuchar commented Feb 28, 2024

jirimoravcik commented Feb 28, 2024

vdusek commented Feb 29, 2024 •

edited

Loading

Investigate usage of DataClasses, TypedDicts for e.g. requests in RequestQueue #64

Investigate usage of DataClasses, TypedDicts for e.g. requests in RequestQueue #64

Comments

jirimoravcik commented Feb 13, 2023

jirimoravcik commented Feb 24, 2023

fnesveda commented May 24, 2023

fnesveda commented Jul 3, 2023

vdusek commented Jul 28, 2023

jirimoravcik commented Jul 28, 2023

jirimoravcik commented Jul 28, 2023

fnesveda commented Jan 19, 2024

janbuchar commented Feb 28, 2024

jirimoravcik commented Feb 28, 2024

vdusek commented Feb 29, 2024 • edited Loading

Start by summarizing our requirements

The available options in Python

Evaluation

vdusek commented Feb 29, 2024 •

edited

Loading