-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Handle request list user input #326
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
Fix typing issues WIP
TODO: Finish test for it. WIP
Add test for checking all genrated request properties.
Add helper class for input keys Add top level Input class for handling actor inputs.
Add few more tests for regex
It had too many assumptions that users might not be interested in. Users should create such Input helper classes based on their specific inputs and their names.
Update TCH001, TCH002, TCH003 uses.
c237c66
to
6ff9e90
Compare
Sorry for the force-push. Due to my previous mainly Gerrit-based experience I was expecting this Rebase button in Github to do something more clean in the log... |
@@ -141,6 +141,9 @@ indent-style = "space" | |||
docstring-quotes = "double" | |||
inline-quotes = "single" | |||
|
|||
[tool.ruff.lint.flake8-type-checking] | |||
runtime-evaluated-base-classes = ["pydantic.BaseModel", "crawlee.configuration.Configuration"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though crawlee.configuration.Configuration inherits from pydantic_settings.BaseSettings and that one from pydantic.BaseModel, ruff has some issues in following inheritance hierarchy too far. So until that changes, some models will have to be explicitly mentioned even though they inherit from pydantic.BaseModel.
See closed issue where this is described as known limitation to certain degree:
astral-sh/ruff#8725
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good! Just a bunch of nits to iron out...
src/apify/storages/request_list.py
Outdated
url_input_adapter = TypeAdapter(list[Union[_RequestsFromUrlInput, _SimpleUrlInput]]) | ||
|
||
|
||
class RequestList(CrawleeRequestList): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a @docs_group
decorator here with an appropriate group. Also, can you mention that this class also adds the open
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet available in version used in sdk. Added it as comment for now.
src/apify/storages/request_list.py
Outdated
class _RequestDetails(BaseModel): | ||
method: HttpMethod = 'GET' | ||
payload: str = '' | ||
headers: dict[str, str] = Field(default_factory=dict) | ||
user_data: dict[str, str] = Field(default_factory=dict, alias='userData') | ||
|
||
|
||
class _RequestsFromUrlInput(_RequestDetails): | ||
requests_from_url: str = Field(alias='requestsFromUrl') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit for consistency - we usually use the headers: Annotated[dict[str, str], Field(default_factory=dict)] = {}
way of writing pydantic models - it lets you use the default value the way it was originally intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I didn't know it works like this in Pydantic. I was actually trigger by it so much, I had to test, that it is not mutable default input :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it. It shouldn't even be necessary to have both default_factory
and an actual default, but mypy
seems to disagree.
expected_user_data = UserData() | ||
if 'userData' in optional_input: | ||
for key, value in optional_input['userData'].items(): | ||
expected_user_data[key] = value | ||
assert request.user_data == expected_user_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the copy of optional_input['userData']
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not exactly copy. It is transformation to UserData and that object does not have any convenient dict based constructor. It would be nice to do just UserData(optional_input['userData'])
with mock.patch.object( | ||
http_client, 'send_request', return_value=_create_dummy_response(mocked_read_outputs) | ||
) as mocked_send_request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, you could either take advantage of respx
or just make a mock implementation of BaseHttpClient
(probably with unittest.mock
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to go with mocking, then please mock BaseHttpClient
instead of a concrete implementation. I have a feeling that using respx
could result in a more readable code - did you try it?
name is name of the returned RequestList | ||
request_list_sources_input can contain list dicts with either url or requestsFromUrl key | ||
http_client is client that will be used to send get request to url defined in requestsFromUrl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the same docblock style as we do in the rest of the code please.
http_client is client that will be used to send get request to url defined in requestsFromUrl | ||
|
||
Example request_list_sources_input: | ||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wrap examples in markdown code blocks - https://github.com/apify/crawlee-python/blob/master/src/crawlee/http_crawler/_http_crawler.py#L30-L52
if not http_client: | ||
http_client = HttpxHttpClient() | ||
|
||
ulr_inputs = url_input_adapter.validate_python(request_list_sources_input) # instance of list[Union[...]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpyo, plus the comment was just meant for you, it's not that useful in the code.
with mock.patch.object( | ||
http_client, 'send_request', return_value=_create_dummy_response(mocked_read_outputs) | ||
) as mocked_send_request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to go with mocking, then please mock BaseHttpClient
instead of a concrete implementation. I have a feeling that using respx
could result in a more readable code - did you try it?
Add helper function to handle request list inputs.
Closes: #310