Table of Contents generated with DocToc
- High-Performance Scala Notes
Some overall notes on JVM performance optimization techniques:
- JVM Optimization Techniques
- Hybrid JVM Perf Profiling - using AGTC and perf on Linux
- Profiling JVM Applications - a great guide to FlameGraphs and other tools
Throwing exceptions is way too common, especially in Java and Java-ish Scala code. The problem is that unless it is truly exceptional, exceptions are VERY expensive, in terms of both heap and runtime cost.
For example, our BinaryVector
classes originally could throw if trying to add an element beyond the available space for a vector. The problem is that this is not really exceptional. Exceeding space is a fairly common use case.
Redesigning APIs to return proper values is not only more performant, but more functional and easier to test as well.
Of course as Scala programmers you learn that vars are evil, except sometimes for performance reasons you really need them. However, beware that the default implementation for a class instance var adds a check for both reads and writes to make sure that the var is initialized already, otherwise the bytecode throws an UninitializedException
. For vars that you are counting on to change quickly, this means vars might not get inlined and might be very slow. To skip this logic, put your vars inside the class constructors as private var
s instead. This guarantees that they are initialized, causes scalac to skip the initialization check (Scala 2.11), and gives you a 2x to 5x performance boost!
NOTE: The above is due to -Xcheckinit
compile time flag. Turn that off for faster code!
Another idea worth considering: use unsafe.getAndAddInt() when adding. Much cheaper (one instruction) than the default bytecode.
The most efficient interface is to directly work with each primitive type of course, which guarantees minimal bytecode. To make the code a little more readable, we can type alias the primitive into something more readable:
scala> type BinPtr = Long
defined type alias BinPtr
scala> class Bar
defined class Bar
scala> class Bar {
| def getWireFormat(p: BinPtr): BinPtr = p + 4
| }
defined class Bar
scala> :javap Bar
...
{
public long getWireFormat(long);
descriptor: (J)J
flags: ACC_PUBLIC
Code:
stack=4, locals=3, args_size=2
0: lload_1
1: ldc2_w #8 // long 4l
4: ladd
5: lreturn
Long argument, Long return result, minimal bytecode.
The problem though is that a type alias does not make our new type unique. For example in the code above, any Long can be passed in as a BinPtr. How can we have a little more protection?
scala> trait Pointer
defined trait Pointer
scala> trait WordLengthRegion
defined trait WordLengthRegion
scala> type BinaryVectPtr = Long with Pointer with WordLengthRegion
defined type alias BinaryVectPtr
scala> class Foo {
| def getWireFormat(p: BinaryVectPtr): BinaryVectPtr = (p + 4).asInstanceOf[BinaryVectPtr]
| }
defined class Foo
scala> :javap Foo
...
{
public long getWireFormat(long);
descriptor: (J)J
flags: ACC_PUBLIC
Code:
stack=4, locals=3, args_size=2
0: lload_1
1: ldc2_w #8 // long 4l
4: ladd
5: invokestatic #15 // Method scala/runtime/BoxesRunTime.boxToLong:(J)Ljava/lang/Long;
8: invokestatic #19 // Method scala/runtime/BoxesRunTime.unboxToLong:(Ljava/lang/Object;)J
11: lreturn
This is nearly as good, still a primitive return value and argument, but there are two extra mysterious back to back invokestatic
s which call boxing and unboxing, which seem unnecessary and may be optimized out in the final bytecode/assembly after JIT.
You can test it yourself but the above definition prevents regular Longs from being passed in as a BinaryVectPtr
. The disadvantage is that we need casts in the code to make it work. Those can be hidden behind some apply magic I guess. Also, the above definition can be put into arrays and result in a
Another possible solution is to use scala-newtype. There is a @newsubtype
macro which can wrap primitives with no runtime cost around a new "type". For this raw code:
object Test {
@newsubtype case class VectorPointer(addr: Long)
class Foo {
def add4(p1: VectorPointer): VectorPointer = VectorPointer(p1.addr + 4)
}
}
This produces this bytecode:
public long add4(long);
descriptor: (J)J
flags: ACC_PUBLIC
Code:
stack=5, locals=3, args_size=2
0: getstatic #13 // Field filodb/akkabootstrapper/Test$VectorPointer$.MODULE$:Lfilodb/akkabootstrapper/Test$VectorPointer$;
3: getstatic #18 // Field filodb/akkabootstrapper/Test$VectorPointer$Ops$newtype$.MODULE$:Lfilodb/akkabootstrapper/Test$VectorPointer$Ops$newtype$;
6: getstatic #13 // Field filodb/akkabootstrapper/Test$VectorPointer$.MODULE$:Lfilodb/akkabootstrapper/Test$VectorPointer$;
9: lload_1
10: invokevirtual #21 // Method filodb/akkabootstrapper/Test$VectorPointer$.Ops$newtype:(J)J
13: invokevirtual #24 // Method filodb/akkabootstrapper/Test$VectorPointer$Ops$newtype$.addr$extension:(J)J
16: ldc2_w #25 // long 4l
19: ladd
20: invokevirtual #29 // Method filodb/akkabootstrapper/Test$VectorPointer$.apply:(J)J
23: lreturn
The bytecode looks ugly, but you still have primitive arguments and return values, and I think most of the bytecode above can be JITted out, but will have to verify that.
Adding some testing code and changing the adder used to each type, we get the following results:
object Test {
@newsubtype case class VectorPointer(addr: Long)
class VectorPointerAdder {
def add4(p1: VectorPointer): VectorPointer = VectorPointer(p1.addr + 4)
}
trait Pointer
trait BinVector
type VectorPtrWith = Long with BinVector with Pointer
object VectorPtrWith {
def apply(l: Long): VectorPtrWith = l.asInstanceOf[VectorPtrWith]
}
class VectorPtrWithAdder {
def add4(p1: VectorPtrWith): VectorPtrWith = VectorPtrWith(p1 + 4)
}
type VectorPtrLong = Long
class VectorPtrLongAdder {
def add4(p1: VectorPtrLong): VectorPtrLong = p1 + 4
}
val subTypeAdder = new VectorPointerAdder
val withAdder = new VectorPtrWithAdder
val regAdder = new VectorPtrLongAdder
def time: Unit = {
var longVal = 1000L
val endLong = startLong + 4 * 1000000
val startTime = System.nanoTime
while (longVal < endLong) {
longVal = regAdder.addr(longVal)
}
val endTime = System.nanoTime
//scalstyle:off
println(s"One million iterations took ${endTime - startTime} ns or ${(endTime-startTime)/1000000.0} ns per")
}
}
(All measurements taken after a few warmup cycles to allow JIT to work)
- VectorPtrLongAdder - One million iterations took 501270 ns or 0.50127 ns per
- VectorPointerAdder - 9 - 97ns per call, averaging around 50 :(
- VectorPtrWithAdder - 2.4 to 70ns per call. Pretty big variance.
- VectorValAdder (value class) - One million iterations took 487316 ns or 0.487316 ns per
Unfortunately the only alternative which can consistently even come close to matching the performance of a raw Long (or thinly veiled typedef'd one) are Scala value classes, with all of their warts and all. The problem is that value classes box rather easily.
First some tips:
- Enabling -XX:+PrintAssembly - includes instructions for OSX
- JMH Resources
- Using JMH and Perfasm to debug Scala
- Perfasm explained
- Black Magic of Java Method Dispatch - a detailed guide about inlining
After enabling PrintAssembly above, add these options to JMH: -jvmArgsAppend -XX:+UnlockDiagnosticVMOptions -jvmArgsAppend -XX:+PrintInlining
You will get a huge amount of output. Search for what you need and you will find some annotations, showing what calls are inlined and intrinsic too:
[info] @ 40 filodb.core.query.SumDoublesAggregate::aggregateNoNAs (7 bytes) inline (hot)
[info] @ 3 filodb.core.query.SumDoublesAggregate::aggregateNoNAs$mcD$sp (54 bytes) inline (hot)
[info] @ 26 org.velvia.filo.vectors.DoubleIntWrapper::apply$mcD$sp (12 bytes) inline (hot)
[info] @ 5 org.velvia.filo.vectors.IntBinaryVector$$anon$7::apply$mcI$sp (20 bytes) inline (hot)
[info] \-> TypeProfile (9488/9488 counts) = org/velvia/filo/vectors/IntBinaryVector$$anon$7
[info] @ 4 org.velvia.filo.vectors.IntBinaryVector::base (5 bytes) accessor
[info] @ 8 org.velvia.filo.vectors.IntBinaryVector::bufOffset (5 bytes) inline (hot)
[info] @ 16 org.velvia.filo.UnsafeUtils$::getShort (10 bytes) inline (hot)
[info] @ 1 org.velvia.filo.UnsafeUtils$::unsafe (5 bytes) accessor
[info] @ 6 sun.misc.Unsafe::getShort (0 bytes) (intrinsic)
Some notes on the above:
- We can see most methods are inlined along with how big they are. Thus smaller nested method calls pay off (vs big methods)
- Unsafe accesses are intrinsic
- We can see specialization worked. The specialized method is called throughout (signatures like
apply$mcI$sp
) and no calls to Box or Unbox
Here's some more inlining output:
[info] @ 83 filodb.core.query.TimeGroupingAggregate$mcD$sp::aggNoNAs (8 bytes) inline (hot)
[info] @ 4 filodb.core.query.TimeGroupingAggregate$mcD$sp::aggNoNAs$mcD$sp (86 bytes) inline (hot)
[info] @ 23 org.velvia.filo.vectors.DeltaDeltaVector::apply$mcJ$sp (25 bytes) inline (hot)
[info] @ 1 org.velvia.filo.vectors.DeltaDeltaVector::org$velvia$filo$vectors$DeltaDeltaVector$$initValue (5 bytes) inlining too deep
[info] @ 5 org.velvia.filo.vectors.DeltaDeltaVector::org$velvia$filo$vectors$DeltaDeltaVector$$slope (5 bytes) accessor
[info] @ 13 org.velvia.filo.vectors.DeltaDeltaVector::org$velvia$filo$vectors$DeltaDeltaVector$$inner (5 bytes) accessor
[info] @ 17 org.velvia.filo.vectors.IntBinaryVector$$anon$13::apply$mcI$sp (28 bytes) inlining too deep
[info] \-> TypeProfile (8781/8781 counts) = org/velvia/filo/vectors/IntBinaryVector$$anon$13
[info] @ 54 org.velvia.filo.vectors.DoubleIntWrapper::apply$mcD$sp (12 bytes) inline (hot)
[info] @ 5 org.velvia.filo.vectors.IntBinaryVector$$anon$7::apply$mcI$sp (20 bytes) inlining too deep
[info] \-> TypeProfile (9150/9150 counts) = org/velvia/filo/vectors/IntBinaryVector$$anon$7
[info] @ 67 filodb.core.query.TimeGroupingAggregate::filodb$core$query$TimeGroupingAggregate$$bucketWidth (5 bytes) inline (hot)
[info] @ 72 filodb.core.query.TimeGroupingAvgDoubleAgg::aggregateOne (7 bytes) inline (hot)
[info] @ 3 filodb.core.query.TimeGroupingAvgDoubleAgg::aggregateOne$mcD$sp (29 bytes) inlining too deep
Here you see lots of inlining too deep
's. What's up with that? It turns out that by default the JVM will only inline 9 levels. For certain code this is probably nowhere near enough. In this case, adding a simple -XX:MaxInlineLevel=20
boosted the speed by like 30%.
MaxInlineSize
andFreqInlineSize
might be good to tune too. First check thePrintInlining
output.
"What resource are you constrained by?" This is a great question when designing and architecting systems. You may find, especially for a JVM-based in-memory processing system, that memory, specifically heap memory, is the most precious resource.
- GC is in general much more efficient at clearing lots of temporary objects than lots of permanent objects
- Offheap is much better for long-lived data whose lifecycle you want to control anyways
- Beware of object graphs. This can very easily bloat on heap usage, especially when you have millions of copies of something
- Offheap (fully native) and direct buffer memory are not moved around, so you can obtain C-like pointers and use them safely in code
- Onheap objects are moved around by the JVM GC. It is in theory possible to obtain a native pointer to heap memory, but not worth the cost (see below). It is not possible to pin onheap objects.
- Possible to use Scala value classes with offheap data to minimize onheap object allocation and usage. Value classes have their own pain points though.
In short, you can but shouldn't, because on-heap objects get moved around.
TO do so:
scala> Array("foo", "bah", "blaz")
res0: Array[String] = Array(foo, bah, blaz)
scala> filodb.memory.format.UnsafeUtils.getLong(res0, 16)
res1: Long = -3370045110338506547
From here it gets tricky.
- Due to default JDK8
+UseCompressedOops
, the JVM actually fits TWO object pointers into that 64-bit space. However this is only true if the JVM can fit everything into a 32GB heap space. - In general the formula for decompressing is like:
<wide_oop> = <narrow-oop-base> + (<narrow-oop> << 3)
- This article on CompressedOops goes into much more fantastic detail
- The
narrow-oop-base
can be configured (see-XX:HeapBaseMinAddress=8g
) - A StackOverflow review of all the subtleties
- Also see this StackOverflow question on using the sa-jdi.jar Hotspot Serviceability Agent to programmatically determine the narrow-oop-base and other info
- 64 vs 32-bit pointers can very easily be determined from
unsafe.arrayIndexScale(classOf[Array[String]])
- this is 4 if pointer compression is enabled
There is also having to deal with the address moving around. I found that if you write the object to an object array, then the JVM will actually update the array when the physical address moves. I suppose if you check it often enough you can always get the newest location.
- In critical sections and tight inner loops, don't use tuples. They have an allocation cost and boxing cost too.