Speed comparison with LLD of Clang program #1341

marxin · 2024-09-12T08:45:52Z

The README page mentions the following benchmark: Clang 19 (1.56 GiB) 42.07s 33.13s 5.20s 1.35s, but I cannot reproduce it on my AMD machine. First, am I right about the binary size (1.56 GiB) is measured with debug info? If so, did you use -DCMAKE_BUILD_TYPE=RelWithDebInfo or something else? Have you used any --compress-debug-sections= option?

My numbers for AMD Ryzen 9 7900X 12-Core Processor are:

❯ bloaty ../../../../bin/clang-20
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.3%  3.31Gi   0.0%       0    .debug_info
   9.4%   427Mi   0.0%       0    .debug_loclists
   5.1%   231Mi   0.0%       0    .debug_str
   4.8%   220Mi   0.0%       0    .debug_line
   2.1%  95.8Mi  59.8%  95.8Mi    .text
   1.7%  76.4Mi   0.0%       0    .debug_rnglists
   0.9%  41.3Mi  25.8%  41.3Mi    .rodata
   0.5%  23.6Mi   0.0%       0    .debug_abbrev
   0.5%  23.5Mi   0.0%       0    .strtab
   0.2%  9.72Mi   6.1%  9.72Mi    .eh_frame
   0.1%  5.44Mi   0.0%       0    .symtab
   0.1%  4.49Mi   2.8%  4.49Mi    .dynstr
   0.1%  4.22Mi   2.6%  4.22Mi    .data.rel.ro
   0.1%  3.64Mi   0.0%       0    .debug_aranges
   0.0%  1.30Mi   0.8%  1.30Mi    .dynsym
   0.0%  1.24Mi   0.8%  1.24Mi    .eh_frame_hdr
   0.0%       0   0.4%   715Ki    .bss
   0.0%   505Ki   0.3%   505Ki    [24 Others]
   0.0%   444Ki   0.3%   444Ki    .hash
   0.0%   404Ki   0.2%   404Ki    .gnu.hash
   0.0%   365Ki   0.0%       0    .debug_line_str
 100.0%  4.45Gi 100.0%   160Mi    TOTAL
❯ hyperfine ... -fuse-ld=mold
  Time (mean ± σ):      2.802 s ±  0.110 s    [User: 0.009 s, System: 0.003 s]
  Range (min … max):    2.658 s …  2.999 s    10 runs
❯hyperfine ... -fuse-ld=lld
  Time (mean ± σ):      4.160 s ±  0.225 s    [User: 40.475 s, System: 13.092 s]
  Range (min … max):    3.604 s …  4.428 s    10 runs
❯ ld.lld --version
LLD 19.1.0 (compatible with GNU linkers)
❯ mold --version
mold 2.33.0 (compatible with GNU ld)

Both LLD and Mold are provided from openSUSE package (built with LTO). Compared to your numbers, LLD is 1.48x slower, while your numbers claim it's 3.85x. Can you please remeasure it?

The text was updated successfully, but these errors were encountered:

marxin · 2024-09-12T09:22:30Z

I format to mention that the object files are using ZSTD compression for the debug info sections.

marxin · 2024-09-12T09:23:18Z

And my perf report looks as follows:

    25.61%  ld.mold   mold                  [.] mold::elf::MergeableSection<mold::elf::X86_64>::get_fragment(long)
    10.70%  ld.mold   mold                  [.] mold::elf::InputSection<mold::elf::X86_64>::apply_reloc_nonalloc(mold::elf::Context<mold::elf::X86_64>&, unsigned char*)
     9.97%  ld.mold   libc.so.6             [.] __memcmp_evex_movbe
     8.35%  ld.mold   mold                  [.] mold::elf::MergedSection<mold::elf::X86_64>::insert(mold::elf::Context<mold::elf::X86_64>&, std::basic_string_view<char, std::char_traits<char> >, unsigned long, long) [clone .isra.0]
     5.43%  ld.mold   mold                  [.] mold::elf::MergeableSection<mold::elf::X86_64>::split_contents(mold::elf::Context<mold::elf::X86_64>&)
     4.89%  ld.mold   mold                  [.] blake3_hash_many_avx512
     2.65%  ld.mold   mold                  [.] mold::elf::find_null(std::basic_string_view<char, std::char_traits<char> >, long, long) [clone .lto_priv.17] [clone .lto_priv.0]
     2.20%  ld.mold   mold                  [.] mold::elf::InputSection<mold::elf::X86_64>::get_fragment(mold::elf::Context<mold::elf::X86_64>&, mold::elf::ElfRel<mold::elf::X86_64> const&) [clone .isra.0]
     2.05%  ld.mold   libzstd.so.1.5.6      [.] 0x000000000006c393
     1.95%  ld.mold   mold                  [.] mold::elf::InputSection<mold::elf::X86_64>::record_undef_error(mold::elf::Context<mold::elf::X86_64>&, mold::elf::ElfRel<mold::elf::X86_64> const&)
     1.04%  ld.mold   mold                  [.] mold::elf::InputSection<mold::elf::X86_64>::get_tombstone(mold::elf::Symbol<mold::elf::X86_64>&, mold::elf::SectionFragment<mold::elf::X86_64>*)
     0.90%  ld.mold   mold                  [.] mold::elf::MergeableSection<mold::elf::X86_64>::get_contents(long)
     0.71%  ld.mold   libc.so.6             [.] __memchr_evex
     0.71%  ld.mold   libc.so.6             [.] __memmove_avx512_unaligned_erms
     0.67%  ld.mold   mold                  [.] mold::Integer<long, (std::endian)1234, 8>::operator long() const [clone .isra.0]
     0.55%  ld.mold   mold                  [.] mold::elf::Symbol<mold::elf::X86_64>::get_addr(mold::elf::Context<mold::elf::X86_64>&, long) const
     0.51%  ld.mold   mold                  [.] mold::elf::MergedSection<mold::elf::X86_64>::compute_section_size(mold::elf::Context<mold::elf::X86_64>&)::{lambda(long)#1}::operator()(long) const

rui314 · 2024-09-13T23:23:59Z

I ran a quick benchmark again and the number seems consistent.

ruiu@odyssey:~/llvm-project/b$ taskset -c 0-31 hyperfine 'mold @rsp' 'ld.lld @rsp'
Benchmark 1: mold @rsp
  Time (mean ± σ):      1.636 s ±  0.018 s    [User: 0.005 s, System: 0.003 s]
  Range (min … max):    1.609 s …  1.663 s    10 runs

Benchmark 2: ld.lld @rsp
  Time (mean ± σ):      5.985 s ±  0.025 s    [User: 27.125 s, System: 16.205 s]
  Range (min … max):    5.946 s …  6.018 s    10 runs

Summary
  mold @rsp ran
    3.66 ± 0.04 times faster than ld.lld @rsp

marxin · 2024-09-14T09:26:31Z

Interesting! Can you please provide the output of bloaty for the linked binary? And perf report for Mold linker ;)

rui314 · 2024-09-15T08:58:42Z

$ ~/bloaty/build/bloaty bin/clang-19
    FILE SIZE        VM SIZE
 --------------  --------------
  70.2%  2.85Gi   0.0%       0    .debug_info
   6.0%   249Mi   0.0%       0    .debug_str
   5.8%   240Mi   0.0%       0    .strtab
   4.7%   196Mi   0.0%       0    .debug_line
   4.6%   190Mi  49.5%   190Mi    .text
   2.7%   111Mi  29.1%   111Mi    .rodata
   1.4%  58.8Mi   0.0%       0    .symtab
   1.3%  55.6Mi   0.0%       0    .debug_aranges
   0.9%  38.6Mi  10.0%  38.6Mi    .eh_frame
   0.8%  35.3Mi   0.0%       0    .debug_rnglists
   0.6%  23.8Mi   0.0%       0    .debug_abbrev
   0.3%  12.1Mi   3.2%  12.1Mi    .rela.dyn
   0.2%  9.44Mi   2.5%  9.44Mi    .eh_frame_hdr
   0.2%  8.31Mi   2.2%  8.31Mi    .data.rel.ro
   0.2%  7.75Mi   2.0%  7.75Mi    .dynstr
   0.0%       0   0.7%  2.87Mi    .bss
   0.0%  1.92Mi   0.5%  1.92Mi    .dynsym
   0.0%   495Ki   0.1%   495Ki    .gnu.hash
   0.0%   429Ki   0.1%   429Ki    .data
   0.0%   315Ki   0.0%       0    .debug_line_str
   0.0%   252Ki   0.1%   252Ki    [27 Others]
 100.0%  4.06Gi 100.0%   383Mi    TOTAL

And here is my perf report.

+   29.04%     0.19%  mold     [kernel.kallsyms]     [k] asm_exc_page_fault
+   28.50%     0.15%  mold     [kernel.kallsyms]     [k] exc_page_fault
+   28.03%     0.15%  mold     [kernel.kallsyms]     [k] do_user_addr_fault
+   27.46%     0.18%  mold     [kernel.kallsyms]     [k] handle_mm_fault
+   26.93%     0.29%  mold     [kernel.kallsyms]     [k] __handle_mm_fault
+   26.56%     0.07%  mold     [kernel.kallsyms]     [k] handle_pte_fault
+   21.83%     0.10%  mold     [kernel.kallsyms]     [k] do_fault
+   21.72%     0.88%  mold     libc.so.6             [.] __memmove_avx512_unaligned_erms
+   17.52%     0.02%  mold     [kernel.kallsyms]     [k] do_page_mkwrite
+   17.48%     0.43%  mold     [kernel.kallsyms]     [k] ext4_page_mkwrite
+   16.63%     0.05%  mold     [kernel.kallsyms]     [k] block_page_mkwrite
+   16.08%     7.61%  mold     mold                  [.] mold::MergeableSection<mold::X86_64>::get_fragment(long)
+   15.95%     0.04%  mold     [kernel.kallsyms]     [k] mark_buffer_dirty
+   15.91%    15.73%  mold     [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
+   15.81%     0.06%  mold     [kernel.kallsyms]     [k] __block_commit_write
+   15.70%     0.04%  mold     [kernel.kallsyms]     [k] __folio_mark_dirty
+   15.54%     0.00%  mold     [unknown]             [k] 0000000000000000
+   15.03%     0.03%  mold     [kernel.kallsyms]     [k] __raw_spin_lock_irqsave
+   15.03%     0.01%  mold     [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
+   13.13%     7.72%  mold     mold                  [.] mold::MergedSection<mold::X86_64>::insert(mold::Context<mold::X86_64>&, std::basic_string_view<char, std::char_traits<char> >, unsigned long, long)
+   11.07%     7.42%  mold     mold                  [.] mold::InputSection<mold::X86_64>::apply_reloc_nonalloc(mold::Context<mold::X86_64>&, unsigned char*)
+    9.87%     0.19%  mold     libc.so.6             [.] __sched_yield
+    9.53%     0.20%  mold     [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
+    9.40%     0.32%  mold     [kernel.kallsyms]     [k] do_syscall_64
+    8.65%     0.16%  mold     [kernel.kallsyms]     [k] x64_sys_call
+    8.29%     1.23%  mold     libc.so.6             [.] __memcmp_evex_movbe
+    7.96%     0.10%  mold     [kernel.kallsyms]     [k] __x64_sys_sched_yield
+    7.54%     0.92%  mold     [kernel.kallsyms]     [k] do_sched_yield
+    6.16%     0.14%  mold     [kernel.kallsyms]     [k] schedule
+    5.85%     0.71%  mold     [kernel.kallsyms]     [k] __schedule
+    5.62%     4.66%  mold     mold                  [.] mold::MergeableSection<mold::X86_64>::resolve_contents(mold::Context<mold::X86_64>&)
+    4.54%     0.16%  mold     [kernel.kallsyms]     [k] pick_next_task
+    4.41%     4.35%  mold     mold                  [.] blake3_hash_many_avx512
+    4.41%     1.19%  mold     [kernel.kallsyms]     [k] pick_next_task_fair
+    4.36%     0.00%  mold     [unknown]             [k] 0x0000000000004000
+    4.13%     2.51%  mold     libc.so.6             [.] __memchr_evex
+    4.01%     2.08%  mold     mold                  [.] mold::MergeableSection<mold::X86_64>::split_contents(mold::Context<mold::X86_64>&)
+    3.38%     2.60%  mold     mold                  [.] mold::InputSection<mold::X86_64>::record_undef_error(mold::Context<mold::X86_64>&, mold::ElfRel<mold::X86_64> const&)
+    2.91%     1.37%  mold     [kernel.kallsyms]     [k] update_curr
+    2.48%     0.15%  mold     [kernel.kallsyms]     [k] do_anonymous_page
+    2.28%     0.00%  mold     [unknown]             [.] 0x49544100ee86e305
+    2.28%     0.00%  mold     mold                  [.] mold::ObjectFile<mold::X86_64>::~ObjectFile()
+    2.17%     0.01%  mold     [kernel.kallsyms]     [k] do_read_fault
+    2.12%     0.20%  mold     [kernel.kallsyms]     [k] filemap_map_pages
+    1.98%     0.58%  mold     [kernel.kallsyms]     [k] set_pte_range
+    1.97%     1.00%  mold     mold                  [.] mold::ObjectFile<mold::X86_64>::initialize_sections(mold::Context<mold::X86_64>&)
+    1.83%     1.07%  mold     mold                  [.] mold::Symbol<mold::X86_64>* mold::get_symbol<mold::X86_64>(mold::Context<mold::X86_64>&, std::basic_string_view<char, std::char_traits<char> >, std::basic_s
+    1.79%     0.00%  mold     [kernel.kallsyms]     [k] do_wp_page
+    1.78%     0.01%  mold     [kernel.kallsyms]     [k] wp_page_copy
+    1.64%     1.36%  mold     mold                  [.] mold::ObjectFile<mold::X86_64>::resolve_symbols(mold::Context<mold::X86_64>&)
+    1.52%     0.87%  mold     [kernel.kallsyms]     [k] srso_alias_safe_ret
+    1.47%     0.00%  mold     [kernel.kallsyms]     [k] flush_tlb_mm_range
+    1.45%     0.00%  mold     [kernel.kallsyms]     [k] on_each_cpu_cond_mask
+    1.45%     0.00%  mold     [kernel.kallsyms]     [k] native_flush_tlb_multi
+    1.44%     0.00%  mold     [kernel.kallsyms]     [k] ptep_clear_flush
+    1.41%     0.30%  mold     [kernel.kallsyms]     [k] _raw_spin_lock
+    1.41%     0.77%  mold     libc.so.6             [.] __strlen_evex
     1.33%     1.27%  mold     mold                  [.] XXH_INLINE_XXH3_64bits
+    1.23%     0.60%  mold     mold                  [.] tbb::detail::d2::concurrent_hash_map<std::basic_string_view<char, std::char_traits<char> >, mold::Symbol<mold::X86_64>, HashCmp, tbb::detail::d1::tbb_alloca
+    1.18%     0.19%  mold     mold                  [.] tbb::detail::d2::concurrent_hash_map<std::basic_string_view<char, std::char_traits<char> >, mold::ComdatGroup, HashCmp, tbb::detail::d1::tbb_allocator<std::
+    1.16%     0.02%  mold     [kernel.kallsyms]     [k] vma_alloc_folio
+    1.16%     0.02%  mold     [kernel.kallsyms]     [k] alloc_anon_folio
+    1.09%     0.86%  mold     mold                  [.] mold::ObjectFile<mold::X86_64>::reattach_section_pieces(mold::Context<mold::X86_64>&)
+    1.06%     0.02%  mold     [kernel.kallsyms]     [k] alloc_pages_mpol
+    1.01%     0.92%  mold     [kernel.kallsyms]     [k] __mod_node_page_state
+    1.01%     0.00%  mold     [unknown]             [.] 0xcccccccccccccccc
+    0.92%     0.09%  mold     [kernel.kallsyms]     [k] asm_sysvec_call_function
+    0.91%     0.82%  mold     mold                  [.] mold::ObjectFile<mold::X86_64>::compute_symtab_size(mold::Context<mold::X86_64>&)
+    0.91%     0.06%  mold     [kernel.kallsyms]     [k] __alloc_pages
+    0.90%     0.55%  mold     [kernel.kallsyms]     [k] smp_call_function_many_cond
+    0.89%     0.02%  mold     [kernel.kallsyms]     [k] __do_fault
     0.88%     0.76%  mold     [kernel.kallsyms]     [k] srso_alias_return_thunk
+    0.87%     0.80%  mold     mold                  [.] tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data
+    0.85%     0.08%  mold     [kernel.kallsyms]     [k] folio_add_file_rmap_ptes
     0.85%     0.42%  mold     [kernel.kallsyms]     [k] ext4_dirty_folio
+    0.83%     0.23%  mold     [kernel.kallsyms]     [k] __lruvec_stat_mod_folio
+    0.83%     0.04%  mold     [kernel.kallsyms]     [k] filemap_fault

Note that I built it not with RelWithDebInfo but with Debug .

marxin · 2024-09-15T10:56:09Z

Thanks, if filter out only the function related to mold, then the profile is very comparable to what I see. Interestingly, your CPU (which is faster) behaves quite differently, where LLD is much slower than on my machine and mold is faster.

Anyway, please update README.md where Clang has the following binary size: Clang 19 (1.56 GiB). I think it should be 4.06GiB, right?

DataM0del · 2024-09-30T12:31:33Z

The README page mentions the following benchmark: Clang 19 (1.56 GiB) 42.07s 33.13s 5.20s 1.35s, but I cannot reproduce it on my AMD machine. First, am I right about the binary size (1.56 GiB) is measured with debug info? If so, did you use -DCMAKE_BUILD_TYPE=RelWithDebInfo or something else? Have you used any --compress-debug-sections= option?

My numbers for AMD Ryzen 9 7900X 12-Core Processor are:

❯ bloaty ../../../../bin/clang-20
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.3%  3.31Gi   0.0%       0    .debug_info
   9.4%   427Mi   0.0%       0    .debug_loclists
   5.1%   231Mi   0.0%       0    .debug_str
   4.8%   220Mi   0.0%       0    .debug_line
   2.1%  95.8Mi  59.8%  95.8Mi    .text
   1.7%  76.4Mi   0.0%       0    .debug_rnglists
   0.9%  41.3Mi  25.8%  41.3Mi    .rodata
   0.5%  23.6Mi   0.0%       0    .debug_abbrev
   0.5%  23.5Mi   0.0%       0    .strtab
   0.2%  9.72Mi   6.1%  9.72Mi    .eh_frame
   0.1%  5.44Mi   0.0%       0    .symtab
   0.1%  4.49Mi   2.8%  4.49Mi    .dynstr
   0.1%  4.22Mi   2.6%  4.22Mi    .data.rel.ro
   0.1%  3.64Mi   0.0%       0    .debug_aranges
   0.0%  1.30Mi   0.8%  1.30Mi    .dynsym
   0.0%  1.24Mi   0.8%  1.24Mi    .eh_frame_hdr
   0.0%       0   0.4%   715Ki    .bss
   0.0%   505Ki   0.3%   505Ki    [24 Others]
   0.0%   444Ki   0.3%   444Ki    .hash
   0.0%   404Ki   0.2%   404Ki    .gnu.hash
   0.0%   365Ki   0.0%       0    .debug_line_str
 100.0%  4.45Gi 100.0%   160Mi    TOTAL
❯ hyperfine ... -fuse-ld=mold
  Time (mean ± σ):      2.802 s ±  0.110 s    [User: 0.009 s, System: 0.003 s]
  Range (min … max):    2.658 s …  2.999 s    10 runs
❯hyperfine ... -fuse-ld=lld
  Time (mean ± σ):      4.160 s ±  0.225 s    [User: 40.475 s, System: 13.092 s]
  Range (min … max):    3.604 s …  4.428 s    10 runs
❯ ld.lld --version
LLD 19.1.0 (compatible with GNU linkers)
❯ mold --version
mold 2.33.0 (compatible with GNU ld)

Both LLD and Mold are provided from openSUSE package (built with LTO). Compared to your numbers, LLD is 1.48x slower, while your numbers claim it's 3.85x. Can you please remeasure it?

bloaty ../../../../bin/clang-20

That's clang 20, not clang 19.

marxin · 2024-09-30T16:05:29Z

That's clang 20, not clang 19.

Yeah, but these two are very similar in size.

DataM0del · 2024-09-30T16:33:16Z

That's clang 20, not clang 19.

Yeah, but these two are very similar in size.

Yes, but the size doesn't really matter, it's like saying if I statically include 5,000 libraries and my executable is 5 MB but if I dynamically include those same libraries that the linker performance is the same, but the executable is only 3 MB.
Statically linking something requires adding it into the binary, so the linker should be slower on static libraries.
Dynamic linking, however, shouldn't need to link a shared object / DLL at compile time, the linker should only need to link the program and the static libraries that the program depends on.
Also, you have changes made between clang 19 and clang 20, for example, it might require another library to statically link with, or replace a statically linked library with a dynamically linked library, or it may require more static libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed comparison with LLD of Clang program #1341

Speed comparison with LLD of Clang program #1341

marxin commented Sep 12, 2024

marxin commented Sep 12, 2024

marxin commented Sep 12, 2024

rui314 commented Sep 13, 2024

marxin commented Sep 14, 2024

rui314 commented Sep 15, 2024

marxin commented Sep 15, 2024 •

edited

Loading

DataM0del commented Sep 30, 2024

marxin commented Sep 30, 2024

DataM0del commented Sep 30, 2024 •

edited

Loading

Speed comparison with LLD of Clang program #1341

Speed comparison with LLD of Clang program #1341

Comments

marxin commented Sep 12, 2024

marxin commented Sep 12, 2024

marxin commented Sep 12, 2024

rui314 commented Sep 13, 2024

marxin commented Sep 14, 2024

rui314 commented Sep 15, 2024

marxin commented Sep 15, 2024 • edited Loading

DataM0del commented Sep 30, 2024

marxin commented Sep 30, 2024

DataM0del commented Sep 30, 2024 • edited Loading

marxin commented Sep 15, 2024 •

edited

Loading

DataM0del commented Sep 30, 2024 •

edited

Loading