SNOW-1561999: Inconsistent Results with random_split on Large DataFrames with Fixed Seed #1991
Labels
bug
Something isn't working
status-triage_done
Initial triage done, will be further handled by the driver team
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]
Windows-10-10.0.22631-SP0
3. What are the component versions in the environment?
Name Version Build Channel
asn1crypto 1.5.1 pyhd8ed1ab_0 conda-forge
brotli-python 1.1.0 py311h12c1d0e_1 conda-forge
bzip2 1.0.8 h2466b09_7 conda-forge
ca-certificates 2024.7.4 h56e8100_0 conda-forge
certifi 2024.7.4 pyhd8ed1ab_0 conda-forge
cffi 1.16.0 py311ha68e1ae_0 conda-forge
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
cryptography 42.0.8 py311hfd75b31_0 conda-forge
filelock 3.15.4 pyhd8ed1ab_0 conda-forge
h2 4.1.0 pyhd8ed1ab_0 conda-forge
hpack 4.0.0 pyh9f0ad1d_0 conda-forge
hyperframe 6.0.1 pyhd8ed1ab_0 conda-forge
idna 3.7 pyhd8ed1ab_0 conda-forge
intel-openmp 2024.2.0 h57928b3_980 conda-forge
libblas 3.9.0 23_win64_mkl conda-forge
libcblas 3.9.0 23_win64_mkl conda-forge
libexpat 2.6.2 h63175ca_0 conda-forge
libffi 3.4.2 h8ffe710_5 conda-forge
libhwloc 2.11.1 default_h8125262_1000 conda-forge
libiconv 1.17 hcfcfb64_2 conda-forge
liblapack 3.9.0 23_win64_mkl conda-forge
libsqlite 3.46.0 h2466b09_0 conda-forge
libxml2 2.12.7 h0f24e4e_4 conda-forge
libzlib 1.3.1 h2466b09_1 conda-forge
mkl 2024.1.0 h66d3029_694 conda-forge
numpy 2.0.1 py311h35ffc71_0 conda-forge
openssl 3.3.1 h2466b09_2 conda-forge
packaging 24.1 pyhd8ed1ab_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge
pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge
pycparser 2.22 pyhd8ed1ab_0 conda-forge
pyjwt 2.8.0 pyhd8ed1ab_1 conda-forge
pyopenssl 24.2.1 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 pyh0701188_6 conda-forge
python 3.11.9 h631f459_0_cpython conda-forge
python_abi 3.11 4_cp311 conda-forge
pytz 2024.1 pyhd8ed1ab_0 conda-forge
pyyaml 6.0.1 py311ha68e1ae_1 conda-forge
requests 2.32.3 pyhd8ed1ab_0 conda-forge
setuptools 71.0.4 pyhd8ed1ab_0 conda-forge
snowflake-connector-python 3.11.0 py311hcf9f919_0 conda-forge
snowflake-snowpark-python 1.20.0 py311h1ea47a8_0 conda-forge
sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge
tbb 2021.12.0 hc790b64_3 conda-forge
tk 8.6.13 h5226925_1 conda-forge
tomlkit 0.13.0 pyha770c72_0 conda-forge
typing-extensions 4.12.2 hd8ed1ab_0 conda-forge
typing_extensions 4.12.2 pyha770c72_0 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
ucrt 10.0.22621.0 h57928b3_0 conda-forge
urllib3 2.2.2 pyhd8ed1ab_1 conda-forge
vc 14.3 h8a93ad2_20 conda-forge
vc14_runtime 14.40.33810 ha82c5b3_20 conda-forge
vs2015_runtime 14.40.33810 h3bf8584_20 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
win_inet_pton 1.1.0 pyhd8ed1ab_6 conda-forge
xz 5.2.6 h8d14728_0 conda-forge
yaml 0.2.5 h8ffe710_2 conda-forge
zstandard 0.23.0 py311h53056dc_0 conda-forge
zstd 1.5.6 h0ea2cb4_0 conda-forge
from utils import get_session
session = get_session.session()
df_range = session.range(1, 10**7).to_df("a")
train_1, test_1 = df_range.random_split([0.5, 0.5], seed=42)
train_2, test_2 = df_range.random_split([0.5, 0.5], seed=42)
train_train = train_1.join(train_2, on="a", how="inner")
train_test = train_1.join(test_2, on="a", how="inner")
test_train = test_1.join(train_2, on="a", how="inner")
test_test = test_1.join(test_2, on="a", how="inner")
print(train_train.count(), train_test.count(), test_train.count(), test_test.count())
output:
2819263 2181740 2181520 2817476
I expected train_1 and train_2 to be identical because they were both created by splitting the same DataFrame with a fixed seed. Consequently, I expected the inner joins train_test and test_train to be empty since test_1 and test_2 should contain different rows from train_1 and train_2 respectively. However, the resulting counts show that random_split does not produce consistent splits for a fixed seed when the DataFrame contains more than 10**7 samples.
The text was updated successfully, but these errors were encountered: