Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1668981: Update is wrong when source is anti-join #2305

Open
tvdboom opened this issue Sep 17, 2024 · 6 comments
Open

SNOW-1668981: Update is wrong when source is anti-join #2305

tvdboom opened this issue Sep 17, 2024 · 6 comments
Assignees
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team

Comments

@tvdboom
Copy link

tvdboom commented Sep 17, 2024

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.11.6 (tags/v3.11.6:8b6ee5b, Oct 2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]

  2. What operating system and processor architecture are you using?

    Windows-10-10.0.22631-SP0

  3. What are the component versions in the environment (pip freeze)?

    pandas==2.2.2
    snowflake-snowpark-python==1.22.1

  4. What did you do?

from snowflake.snowpark import Session
from snowflake.snowpark.functions import lit

mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(pd.DataFrame({"a": [1, 2]}))

df1 = mock_session.create_dataframe(pd.DataFrame({"A": [0, 1], "B": ['a', 'b']}))
df2 = mock_session.create_dataframe(pd.DataFrame({"A": [0, 1], "B": ['a', 'c']}))

anti = df1.join(df2, on=["A", "B"], how="anti")
anti.show()  # Has only 1 row

# This update should only update 1 row in df1
result = df1.update(
    assignments={"B": lit("f")},
    condition=(df1["A"] == anti["A"]) & (df1["B"] == anti["B"]),
    source=anti,
)

print(result)  # UpdateResult(rows_updated=2, multi_joined_rows_updated=0)
mock_session.table(df1.table_name).show()  # Two rows are updated
  1. What did you expect to see?

    The updated table should have only 1 updated row. Instead, both rows are updated.

@tvdboom tvdboom added bug Something isn't working needs triage Initial RCA is required labels Sep 17, 2024
@github-actions github-actions bot changed the title [Local testing] Update is wrong when source is anti-join SNOW-1668981: [Local testing] Update is wrong when source is anti-join Sep 17, 2024
@sfc-gh-jrose sfc-gh-jrose self-assigned this Sep 17, 2024
@sfc-gh-jrose sfc-gh-jrose removed the needs triage Initial RCA is required label Sep 17, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage_done Initial triage done, will be further handled by the driver team label Sep 24, 2024
@sfc-gh-jrose sfc-gh-jrose added the local testing Local Testing issues/PRs label Oct 1, 2024
@sfc-gh-jrose sfc-gh-jrose changed the title SNOW-1668981: [Local testing] Update is wrong when source is anti-join SNOW-1668981: Update is wrong when source is anti-join Oct 8, 2024
@sfc-gh-jrose sfc-gh-jrose removed the local testing Local Testing issues/PRs label Oct 8, 2024
@sfc-gh-jrose
Copy link
Contributor

I have investigated this issue and it appears to also occur outside of local testing mode. Before fixing this for local testing we will also need to fix query generation for live mode.

@tvdboom
Copy link
Author

tvdboom commented Oct 9, 2024

Is this something that will be picked up for next release? My production processes depend on a query like this

@sfc-gh-aling
Copy link
Contributor

hi @tvdboom, this is an issue in our join alias mapping, and the root cause is that df1["A] and anti["A"] they mistakenly point to the same column alias in the update call.

a quick workaround for this kind of join issue is call cache_result on anti to reset the internal alias mapping:

...
anti = df1.join(df2, on=["A", "B"], how="anti")
anti = anti.cache_result()  # this will reset the alias map 
...

completely solving the ambiguous alias mapping issue might take some time, I will dive deeper into that part

@tvdboom
Copy link
Author

tvdboom commented Oct 11, 2024

I understand. Thanks for the tip

@tvdboom
Copy link
Author

tvdboom commented Nov 11, 2024

@sfc-gh-aling Could you give me a headsup when this has been solved? thanks

@sfc-gh-aling
Copy link
Contributor

hi @tvdboom , I'm actively working on it, trying to find out the optimal solution for this problem. thanks for your patience in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

4 participants