You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Without going full int128 there are 2 huge optimisations opportunities on 64-bit processors.
All 64-bit platforms implement the division of a 128-bit number by a 64-bit one and also implement the extended precision multiplication of 2 64-bit number to a 128-bit one. The reason why is that 64x64 multiplication and 64/64 division requires extended precision and then discarding part of the result.
There is currently no way in pure C to force those instructions besides using int128 and even then the int128 code would probably less efficient by introducing checks/branches even though we know that only the low 64-bit part is used.
The benefits would be tremendous, replacing ~50 instructions by a single one in both multiplication and division case.
# Fix the reminder, we're at most 2 iterations off
if r < m:
dec q
r += d
if r >= d and r < m:
dec q
r += d
r -= m
(q, r)
let
d_hi = d shr halfSize
d_lo = d and halfMask
n_lohi = nlo shr halfSize
n_lolo = nlo and halfMask
# First half of the quotient
let (q1, r1) =halfQR(n_hi, n_lohi, d, d_hi, d_lo)
# Second half
let (q2, r2) =halfQR(r1, n_lolo, d, d_hi, d_lo)
q = (q1 shl halfSize) or q2
r = r2
x86_64 assembly implementation. The same divq opcodes exists as divl on all 32-bit platform for 64div32 and the equivalent exists on ARM, MIPS etc.
funcasm_x86_64_div2n1n(q, r: varuint64, n_hi, n_lo, d: uint64) {.inline.}=## Division uint128 by uint64# DIV r/m64# Divide RDX:RAX (n_hi:n_lo) by r/m64## Inputs# - numerator high word in RDX,# - numerator low word in RAX,# - divisor as r/m parameter (register or memory at the compiler discretion)# Result# - Quotient in RAX# - Remainder in RDXasm""" divq %[divisor] // We name the register/memory divisor : "=a" (`*q`), "=d" (`*r`) // Don't forget to dereference the var hidden pointer : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`) : // no register clobbered besides explicitly used RAX and RDX"""
templateextPrecMulImpl(result: varUintImpl[uint64], op: untyped, u, v: uint64) =
const
p =64div2
base: uint64=1shl p
var
x0, x1, x2, x3: uint64
let
ul =lo(u)
uh =hi(u)
vl =lo(v)
vh =hi(v)
x0 = ul * vl
x1 = ul * vh
x2 = uh * vl
x3 = uh * vh
x1 +=hi(x0) # This can't carry
x1 += x2 # but this can
if x1 < x2: # if carry, add it to x3
x3 += base
op(result.hi, x3 +hi(x1))
op(result.lo, (x1 shl p) orlo(x0))
x86_64 assembly implementation. The same mulq opcodes exists as mull on all 32-bit platform for 32x32to64 and the equivalent exists on ARM, MIPS etc.
funcasm_x86_64_extMul(hi, lo: varuint64, a, b: uint64) {.inline.}=## Extended precision multiplication uint64 * uint64 --> uint128# MUL r/m64# Multiply RAX by r/m64## Inputs:# - RAX# - r/m# Outputs:# - High word in RDX# - Low word in RAXasm""" mulq %[operand] : "=d" (`*hi`), "=a" (`*lo`) // Don't forget to dereference the var hidden pointer : "a" (`a`), [operand] "rm" (`b`) : // no clobbered registers"""
Tests
whenisMainModule:
block: # Multiplicationvar hi, lo: uint64asm_x86_64_extMul(hi, lo, 1shl32, 1shl33) # 2^65doAssert hi ==2doAssert lo ==0block: # Divisionvar q, r: uint64# (1 shl 64) div 3let n_hi =1'u64let n_lo =0'u64let d =3'u64asm_x86_64_div2n1n(q, r, n_hi, n_lo, d)
doAssert q ==6148914691236517205'u64doAssert r ==1
The text was updated successfully, but these errors were encountered:
Without going full int128 there are 2 huge optimisations opportunities on 64-bit processors.
All 64-bit platforms implement the division of a 128-bit number by a 64-bit one and also implement the extended precision multiplication of 2 64-bit number to a 128-bit one. The reason why is that 64x64 multiplication and 64/64 division requires extended precision and then discarding part of the result.
There is currently no way in pure C to force those instructions besides using int128 and even then the int128 code would probably less efficient by introducing checks/branches even though we know that only the low 64-bit part is used.
The benefits would be tremendous, replacing ~50 instructions by a single one in both multiplication and division case.
Division
nim-stint/stint/private/uint_div.nim
Lines 131 to 169 in edb1ade
x86_64 assembly implementation. The same divq opcodes exists as divl on all 32-bit platform for 64div32 and the equivalent exists on ARM, MIPS etc.
Multiplication
nim-stint/stint/private/uint_mul.nim
Lines 48 to 73 in edb1ade
x86_64 assembly implementation. The same mulq opcodes exists as mull on all 32-bit platform for 32x32to64 and the equivalent exists on ARM, MIPS etc.
Tests
The text was updated successfully, but these errors were encountered: