query a parquet file 4 times slower than clickhouse local #115

l1t1 · 2023-09-30T14:09:19Z

(you don't have to strictly follow this form)

Describe the situation
SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));
chdb costs 400s, clickhouse local costs 100s
How to reproduce

Which ClickHouse server version to use 23.6
Which interface to use, if matters CLI.py
Non-default settings, if any
CREATE TABLE statements for all tables involved
select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
Sample data for all these tables, use clickhouse-obfuscator if necessary
Queries to run that lead to slow performance
SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));

Expected performance
What are your performance expectation, why do you think they are realistic? Has it been working faster in older ClickHouse releases? Is it working faster in some specific other system?
I hope chdb runs as fast as clickhouse local.
Additional context
Add any other context about the problem here.
btw
select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
chdb runs as fast as clickhouse local

The text was updated successfully, but these errors were encountered:

l1t1 · 2023-10-01T11:14:47Z

a 1/10 size parquet has the same problem

root@localhost:/data/data/com.termux/files/home# time python3  -m chdb "SELECT avg(i) FROM file('/data/t2.parquet') group by round(log10(i)); " Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
┏━━━━━━━━━━━━┓
┃     avg(i) ┃
┡━━━━━━━━━━━━┩
│          2 │
├────────────┤
│   17392527 │
├────────────┤
│    17392.5 │
├────────────┤
│     1739.5 │
├────────────┤
│     173925 │
├────────────┤
│       17.5 │
├────────────┤
│        174 │
├────────────┤
│ 65811388.5 │
├────────────┤
│  1739252.5 │
└────────────┘

real    0m38.475s
user    0m37.700s
sys     0m17.630s


root@localhost:/data/data/com.termux/files/home# time ./ch239 --local -q "SELECT avg(i) FROM file('/data/t2.parquet') group by round(log10(i));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
Lowered mark cache size to 2.79 GiB because the system has limited RAM
2
17392527
17392.5
1739.5
173925
17.5
174
65811388.5
1739252.5

real    0m13.121s
user    0m24.100s
sys     0m6.240s

lmangani · 2023-10-01T11:32:22Z

Thanks for the report @l1t1 we appreciate it! This definitely helps refine our scope. The next chdb versions should improve any gaps in performance, although clickhouse-local and chdb are built slightly different (jemalloc for one) so details about the execution context are extremely important. I cannot reproduce this performance gap on my system for instance.

<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled. This is quite odd. Could you tell us more about the running environment?

l1t1 · 2023-10-01T11:45:16Z

not only query file, query numbers_mt() has the same question. the max_threads values are the same.


root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
2
17392527
17392.5
1739.5
173925
17.5
174
65811388.5
1739252.5

real    0m1.535s
user    0m7.530s
sys     0m0.200s
root@localhost:/data/data/com.termux/files/home# time python3  -m chdb "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number))" Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
┏━━━━━━━━━━━━━┓
┃ avg(number) ┃
┡━━━━━━━━━━━━━┩
│           2 │
├─────────────┤
│    17392527 │
├─────────────┤
│     17392.5 │
├─────────────┤
│      1739.5 │
├─────────────┤
│      173925 │
├─────────────┤
│        17.5 │
├─────────────┤
│         174 │
├─────────────┤
│  65811388.5 │
├─────────────┤
│   1739252.5 │
└─────────────┘

real    0m3.695s
user    0m8.820s
sys     0m0.460s

root@localhost:/data/data/com.termux/files/home# time python3  -m chdb "SELECT getSetting('max_threads')" Pretty                                        <jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ getSetting('max_threads') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│                         8 │
└───────────────────────────┘

real    0m2.337s
user    0m1.500s
sys     0m0.380s
root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT getSetting('max_threads');"                                            <jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
8

real    0m0.609s
user    0m0.390s
sys     0m0.110s

l1t1 · 2023-10-01T12:07:06Z

I run them on termux + proot-distro ubuntu 22.04
a phone has 8-core CPU and 6GB memory


Termux Variables:
TERMUX_APK_RELEASE=GITHUB
TERMUX_APP_PACKAGE_MANAGER=apt
TERMUX_APP_PID=3239
TERMUX_IS_DEBUGGABLE_BUILD=1
TERMUX_MAIN_PACKAGE_FORMAT=debian
TERMUX_VERSION=0.118.0
Packages CPU architecture:
aarch64
Subscribed repositories:
# sources.list
deb https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable main
# tur-repo (sources.list.d/tur.list)
deb https://tur.kcubeterm.com tur-packages tur tur-on-device tur-continuous
Updatable packages:
apt/stable 2.7.6 aarch64 [upgradable from: 2.7.3]
brotli/stable 1.1.0 aarch64 [upgradable from: 1.0.9-1]
command-not-found/stable 2.3.0-3 aarch64 [upgradable from: 2.3.0-1]
coreutils/stable 9.4 aarch64 [upgradable from: 9.3]
curl/stable 8.3.0 aarch64 [upgradable from: 8.2.1]
dos2unix/stable 7.5.1 aarch64 [upgradable from: 7.5.0]
libarchive/stable 3.7.2 aarch64 [upgradable from: 3.7.1]
libcurl/stable 8.3.0 aarch64 [upgradable from: 8.2.1]
libimagequant/stable 4.2.1 aarch64 [upgradable from: 4.2.0]
libsqlite/stable 3.43.1 aarch64 [upgradable from: 3.42.0]
libunbound/stable 1.18.0 aarch64 [upgradable from: 1.17.1-2]
libwebp/stable 1.3.2 aarch64 [upgradable from: 1.3.1-2]
openssl/stable 1:3.1.3 aarch64 [upgradable from: 1:3.1.2]
python-numpy/stable 1.26.0 aarch64 [upgradable from: 1.25.2]
python-pandas/tur-packages 2.1.1 aarch64 [upgradable from: 2.1.0]
python-pillow/stable 10.0.1 aarch64 [upgradable from: 10.0.0]
rust/stable 1.72.1 aarch64 [upgradable from: 1.70.0-1]
unbound/stable 1.18.0 aarch64 [upgradable from: 1.17.1-2]
vim-runtime/stable 9.0.1900 all [upgradable from: 9.0.1800]
vim/stable 9.0.1900 aarch64 [upgradable from: 9.0.1800]
termux-tools version:
1.39.0
Android version:
8.1.0
Kernel build information:
Linux localhost 4.4.78-perf-g6520db4 #1 SMP PREEMPT Sat Dec 8 00:50:29 CST 2018 aarch64 Android
Device manufacturer:
Xiaomi
Device model:
Mi Note 3
LD Variables:
LD_LIBRARY_PATH=
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so

l1t1 · 2023-10-01T12:47:01Z

another phone didn't show jmalloc message, has the same question too

root@localhost:/data/data/com.termux/files/home# time ./clickhouse --local -q "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number));"
2
17392527
17392.5
1739.5
173925
17.5
174
65811388.5
1739252.5

real    0m0.617s
user    0m3.563s
sys     0m0.145s
root@localhost:/data/data/com.termux/files/home# time python3  -m chdb "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number))" Pretty
┏━━━━━━━━━━━━━┓
┃ avg(number) ┃
┡━━━━━━━━━━━━━┩
│           2 │
├─────────────┤
│    17392527 │
├─────────────┤
│     17392.5 │
├─────────────┤
│      1739.5 │
├─────────────┤
│      173925 │
├─────────────┤
│        17.5 │
├─────────────┤
│         174 │
├─────────────┤
│  65811388.5 │
├─────────────┤
│   1739252.5 │
└─────────────┘

real    0m1.688s
user    0m3.928s
sys     0m0.208s

the cpu info

~ $ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           Cortex-A55
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r1p0
    CPU(s) scaling MHz: 72%
    CPU max MHz:        1863.0000
    CPU min MHz:        554.0000
    BogoMIPS:           3.84
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrc
                        pc dcpop asimddp
  Model name:           -
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          2
    Stepping:           0x3
    CPU(s) scaling MHz: 65%
    CPU max MHz:        2861.0000
    CPU min MHz:        826.0000
    BogoMIPS:           3.84
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrc
                        pc dcpop asimddp

l1t1 · 2023-10-02T10:23:04Z

maybe it only occurs on a slow CPU, at faster phone, when increasing the data size, chdb runs as fast as clickhouse local.

root@localhost:/data/data/com.termux/files/home# time ./clickhouse --local -q "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number));"
2
658113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5

real    0m4.716s
user    0m35.153s
sys     0m0.238s
root@localhost:/data/data/com.termux/files/home# time ./clickhouse --local -q "SELECT avg(number) FROM numbers_mt(1,2000000000) group by round(log10(number));"
2
1158113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5

real    0m9.257s
user    1m10.508s
sys     0m0.407s

root@localhost:~# time python3  -m chdb "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number))" Pretty
┏━━━━━━━━━━━━━┓
┃ avg(number) ┃
┡━━━━━━━━━━━━━┩
│           2 │
├─────────────┤
│ 658113883.5 │
├─────────────┤
│    17392527 │
├─────────────┤
│     17392.5 │
├─────────────┤
│      1739.5 │
├─────────────┤
│      173925 │
├─────────────┤
│        17.5 │
├─────────────┤
│         174 │
├─────────────┤
│ 173925271.5 │
├─────────────┤
│   1739252.5 │
└─────────────┘

real    0m5.027s
user    0m33.919s
sys     0m0.405s
root@localhost:~# time python3  -m chdb "SELECT avg(number) FROM numbers_mt(1,2000000000) group by round(log10(number))" Pretty
┏━━━━━━━━━━━━━━┓
┃  avg(number) ┃
┡━━━━━━━━━━━━━━┩
│            2 │
├──────────────┤
│ 1158113883.5 │
├──────────────┤
│     17392527 │
├──────────────┤
│      17392.5 │
├──────────────┤
│       1739.5 │
├──────────────┤
│       173925 │
├──────────────┤
│         17.5 │
├──────────────┤
│          174 │
├──────────────┤
│  173925271.5 │
├──────────────┤
│    1739252.5 │
└──────────────┘

real    0m9.718s
user    1m8.479s
sys     0m0.670s

lmangani · 2023-10-02T10:33:19Z

Thanks for the @l1t1 this is interesting and most likely means the performance issue is related to jemalloc use in chdb or even with build options for certain processors or instruction sets. Could we compare the two CPU features by any chance?

l1t1 · 2023-10-02T11:28:05Z

the slow CPU mentions is
Qualcomm Snapdragon 660 Processor

~ $ free
               total        used        free      shared  buff/cache   available
Mem:         5849420     2074308      424304       28024     3350808     3515508
Swap:        2621436      327476     2293960
~ $ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              Qualcomm
  Model name:           Kryo-V2
    Model:              4
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           0xa
    CPU(s) scaling MHz: 76%
    CPU max MHz:        1843.2000
    CPU min MHz:        633.6000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
  Model name:           Falkor-V1/Kryo
    Model:              2
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           0xa
    CPU(s) scaling MHz: 50%
    CPU max MHz:        2208.0000
    CPU min MHz:        1113.6000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
Caches (sum of all):
  L1d:                  384 KiB (8 instances)
  L1i:                  384 KiB (8 instances)
  L2:                   2 MiB (2 instances)

when increasing data size(not the parquet file size), the gap is smaller.

root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT avg(number) FROM numbers_mt(1,400000000) group by round(log10(number));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
2
358113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5

real    0m4.258s
user    0m28.420s
sys     0m0.460s
root@localhost:/data/data/com.termux/files/home# time python3  -m chdb "SELECT avg(number) FROM numbers_mt(1,400000000) group by round(log10(number))" Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
┏━━━━━━━━━━━━━┓
┃ avg(number) ┃
┡━━━━━━━━━━━━━┩
│           2 │
├─────────────┤
│ 358113883.5 │
├─────────────┤
│    17392527 │
├─────────────┤
│     17392.5 │
├─────────────┤
│      1739.5 │
├─────────────┤
│      173925 │
├─────────────┤
│        17.5 │
├─────────────┤
│         174 │
├─────────────┤
│ 173925271.5 │
├─────────────┤
│   1739252.5 │
└─────────────┘

real    0m6.457s
user    0m30.310s
sys     0m0.690s
root@localhost:/data/data/com.termux/files/home#
root@localhost:/data/data/com.termux/files/home# time python3  -m chdb "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number))" Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
┏━━━━━━━━━━━━━┓                                                                                                                                         ┃ avg(number) ┃
┡━━━━━━━━━━━━━┩
│           2 │
├─────────────┤
│ 658113883.5 │
├─────────────┤
│    17392527 │                                                                                                                                         ├─────────────┤
│     17392.5 │
├─────────────┤
│      1739.5 │
├─────────────┤
│      173925 │
├─────────────┤
│        17.5 │
├─────────────┤
│         174 │
├─────────────┤
│ 173925271.5 │
├─────────────┤
│   1739252.5 │
└─────────────┘

real    0m12.387s
user    1m13.140s
sys     0m1.580s
root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
2
658113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5

real    0m9.979s
user    1m11.570s
sys     0m0.590s

lmangani added enhancement New feature or request help wanted Extra attention is needed labels Oct 1, 2023

auxten added the Arrow Apache Arrow support label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query a parquet file 4 times slower than clickhouse local #115

query a parquet file 4 times slower than clickhouse local #115

l1t1 commented Sep 30, 2023

l1t1 commented Oct 1, 2023

lmangani commented Oct 1, 2023

l1t1 commented Oct 1, 2023

l1t1 commented Oct 1, 2023

l1t1 commented Oct 1, 2023

l1t1 commented Oct 2, 2023

lmangani commented Oct 2, 2023 •

edited

Loading

l1t1 commented Oct 2, 2023

query a parquet file 4 times slower than clickhouse local #115

query a parquet file 4 times slower than clickhouse local #115

Comments

l1t1 commented Sep 30, 2023

l1t1 commented Oct 1, 2023

lmangani commented Oct 1, 2023

l1t1 commented Oct 1, 2023

l1t1 commented Oct 1, 2023

l1t1 commented Oct 1, 2023

l1t1 commented Oct 2, 2023

lmangani commented Oct 2, 2023 • edited Loading

l1t1 commented Oct 2, 2023

lmangani commented Oct 2, 2023 •

edited

Loading