Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garnet crash on load data #766

Open
rnz opened this issue Nov 1, 2024 · 5 comments
Open

Garnet crash on load data #766

rnz opened this issue Nov 1, 2024 · 5 comments
Assignees

Comments

@rnz
Copy link

rnz commented Nov 1, 2024

Describe the bug

    _________
   /_||___||_\      Garnet 1.0.35 64 bit; standalone mode
   '. \   / .'      Port: 6379
     '.\ /.'        https://aka.ms/GetGarnet
       '.'

03::27::19 info: GarnetServer[0] Garnet 1.0.35 64 bit; standalone mode; Port: 6379
03::27::19 info: ArgParser[0] Configuration import from embedded resource succeeded. Path: defaults.conf.
03::27::19 info: ArgParser[0] Configuration import from local machine succeeded. Path: /etc/garnet/garnet.conf.
03::27::19 info: Options[0] [Store] Using page size of 32m
03::27::19 info: Options[0] [Store] Using log memory size of 128g, with 896 empty pages, for effective size of -0.109252929688p
03::27::19 info: Options[0] [Store] There are 4k log pages in memory
03::27::19 info: Options[0] [Store] Using disk segment size of 1g
03::27::19 info: Options[0] [Store] Using hash index size of 128m (2m cache lines)
03::27::19 info: Options[0] [Store] Hash index size is optimized for up to ~8m distinct keys
03::27::19 info: Options[0] [Store] Using log mutable percentage of 90%
03::27::19 info: Options[0] [Store] Not using Revivification
03::27::19 trce: TsavoriteKV [main][0] KV Initialize size:2097152, sizeBytes:134217728 sectorSize:512 alignedSizeBytes:134218240
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using page size of 1m
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Each page can hold ~43690 key-value pairs of objects
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using log memory size of 32m
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] This can hold ~1398101 key-value pairs of objects in memory total
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] There are 32 log pages in memory
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using disk segment size of 32m
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using hash index size of 16m (256k cache lines)
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Hash index size is optimized for up to ~1m distinct keys
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using log mutable percentage of 90%
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Total memory size including heap objects is unlimited
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Not using Revivification
03::27::19 trce: GarnetServer[0] TLS is disabled
03::27::19 info: GarnetServer[0] Total configured memory limit: 107558731776
03::27::19 info: TsavoriteKV [main][0] ********* Primary Recovery Information ********
03::27::19 info: StoreWrapper[0] Error during recovery of store; storeVersion = -1; objectStoreVersion = -1 Tsavorite.core.TsavoriteException: Unable to find valid HybridLog token    at Tsavorite.core.TsavoriteKV`4.FindRecoveryInfo(Int64 requestedVersion, HybridLogCheckpointInfo& recoveredHlcInfo, IndexCheckpointInfo& recoveredICInfo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Recovery/Recovery.cs:line 327    at Tsavorite.core.TsavoriteKV`4.Recover(Int32 numPagesToPreload, Boolean undoNextVersion, Int64 recoverTo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs:line 349    at Garnet.server.StoreWrapper.RecoverCheckpoint(Boolean recoverMainStoreFromToken, Boolean recoverObjectStoreFromToken, Guid storeIndexToken, Guid storeHlogToken, Guid objectStoreIndexToken, Guid objectStoreHlogToken) in /src/libs/server/StoreWrapper.cs:line 246
* Ready to accept connections
03::27::19 info: StoreWrapper[0] NOTE: Take a checkpoint (SAVE/BGSAVE) in order to actually delete the older data segments (files) from disk
03::27::23 dbug: GarnetServer[0] Accepted TCP connection from 10.200.3.201:50768
03::27::23 dbug: Session[0] [10.200.3.201:50768] [005C39D4] Starting RespServerSession Id=1

Here redis-cli --pipe < db0.resp

03::28::02 dbug: GarnetServer[0] Accepted TCP connection from 172.16.16.1:40488
03::28::02 dbug: Session[0] [172.16.16.1:40488] [005818D1] Starting RespServerSession Id=2
Unhandled exception. System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at Garnet.server.GarnetObjectSerializer.DeserializeInternal(BinaryReader binaryReader) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 46
   at Garnet.server.GarnetObjectSerializer.Deserialize(IGarnetObject& obj) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 29
   at Tsavorite.core.GenericAllocatorImpl`3.RetrievedFullRecord(Byte* record, AsyncIOContext`2& ctx) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/GenericAllocatorImpl.cs:line 909
   at Tsavorite.core.AllocatorBase`4.AsyncGetFromDiskCallback(UInt32 errorCode, UInt32 numBytes, Object context) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs:line 1719
   at Tsavorite.core.NativeStorageDevice._callback(IntPtr context, Int32 errorCode, UInt64 numBytes) in /src/libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs:line 112

Steps to reproduce the bug

  1. dump many data from redis/keydb to Resp file (https://github.com/yannh/redis-dump-go):
# podman run ghcr.io/yannh/redis-dump-go:latest -insecure -host source-host -db 0 -n 1000 > db0.resp
Database 0: 61267252 element dumped
  1. load data from resp-file redis-cli --pipe < db0.resp
  2. wait

Expected behavior

Full loaded data from resp-file by redis-cli to garnet

Screenshots

No response

Release version

v1.0.35

IDE

No response

OS version

debian 11

Additional context

# cat /opt/garnet/etc/garnet.conf 
{
	"Port" : 6379,
	"Address" : null, 
	"MemorySize" : "100g",
	"PageSize" : "32m",
	"SegmentSize" : "1g",
	"IndexSize" : "128m",
	"IndexMaxSize": "",
	"MutablePercent" : 90,
	"ObjectStoreHeapMemorySize" : "",
	"ObjectStoreLogMemorySize" : "32m",
	"ObjectStorePageSize" : "1m",
	"ObjectStoreSegmentSize" : "32m",
	"ObjectStoreIndexSize" : "16m",
	"ObjectStoreIndexMaxSize": "",
	"ObjectStoreMutablePercent" : 90,
	"EnableStorageTier" : true,
	"CopyReadsToTail" : false,
	"ObjectStoreCopyReadsToTail" : false,
	"LogDir" : "/data",
	"CheckpointDir" : null,
	"Recover" : true,
	"DisablePubSub" : false,
	"EnableIncrementalSnapshots" : false,
	"PubSubPageSize" : "4k",
	"DisableObjects" : false,
	"EnableCluster" : false,
	"CleanClusterConfig" : false,
	"AuthenticationMode" : "NoAuth",
	"Password" : null,
	"ClusterUsername" : null,
	"ClusterPassword" : null,
	"AclFile" : null,
	"AadAuthority" : "https://login.host.com",
	"AadAudiences" : null,
	"AadIssuers" : null,
	"AuthorizedAadApplicationIds" : null,
	"AadValidateUsername": false,
	"EnableAOF" : false,
	"AofMemorySize" : "64m",
	"AofPageSize" : "4m",
	"AofReplicationRefreshFrequencyMs": 10,
	"SubscriberRefreshFrequencyMs": 0,
	"CommitFrequencyMs" : 0,
	"WaitForCommit" : false,
	"AofSizeLimit" : "",
	"CompactionFrequencySecs" : 300,
	"CompactionType" : "Lookup",
	"CompactionForceDelete": false,
	"CompactionMaxSegments" : 32,
	"ObjectStoreCompactionMaxSegments" : 32,
	"EnableLua" : false,
	"LuaTransactionMode" : false,
	"GossipSamplePercent" : 100,
	"GossipDelay" : 5,
	"ClusterTimeout" : 60,
	"ClusterTlsClientTargetHost" : "GarnetTest",
	"EnableTLS" : false,
	"CertFileName" : null,
	"CertPassword" : null,
	"CertSubjectName" : null,
	"CertificateRefreshFrequency" : 0,
	"ClientCertificateRequired" : true,
	"ServerCertificateRequired" : true,
	"CertificateRevocationCheckMode" : "NoCheck",
	"IssuerCertificatePath" : "",
	"LatencyMonitor" : true,
	"MetricsSamplingFrequency" : 1,
	"QuietMode" : false,
	"LogLevel" : "Trace",
	"LoggingFrequency" : "10",
	"DisableConsoleLogger" : false,
	"FileLogger" : null,
	"ThreadPoolMinThreads" : 0,
	"ThreadPoolMaxThreads" : 0,
	"UseAzureStorage" : false,
	"AzureStorageConnectionString" : null,
	"CheckpointThrottleFlushDelayMs" : 0,
	"EnableFastCommit" : true,
	"FastCommitThrottleFreq" : 1000,
	"NetworkSendThrottleMax" : 8,
	"EnableScatterGatherGet" : false,
	"ReplicaSyncDelayMs" : 5,
	"MainMemoryReplication" : false,
	"OnDemandCheckpoint" : false,
	"UseAofNullDevice" : false,
	"UseNativeDeviceLinux" : true,
	"RevivBinRecordSizes" : null,
	"RevivBinRecordCounts" : null,
	"RevivifiableFraction" : 1.0,
	"EnableRevivification" : false,
	"RevivNumberOfBinsToSearch" : 0,
	"RevivBinBestFitScanLimit" : 0,
	"RevivInChainOnly" : false,
	"RevivObjBinRecordCount" : 256,
	"ObjectScanCountLimit" : 1000,
	"ExtensionBinPaths": null,
	"ExtensionAllowUnsignedAssemblies": false,
	"IndexResizeFrequencySecs": 60,
	"IndexResizeThreshold": 50,
	"LoadModuleCS": null
}

FYI: dragonfly is load same resp file (in same environment).

@TalZaccai TalZaccai self-assigned this Nov 12, 2024
@rnz
Copy link
Author

rnz commented Nov 13, 2024

Garnet 1.0.37, 1.0.39:

$ du -hs --apparent-size rdb05-db0-again-31073609.resp 
13G	rdb05-db0-again-31073609.resp (31073609 keys)
     "CreateCommand": [
          "podman",          "run",
          "-dt",
          "--pod",          "new:garnet",
          "--name",          "gr01",
          "-p",          "6379:6379",
          "-v",           "/var/lib/garnet:/data",
          "-v",          "/opt/garnet/etc:/etc/garnet",
          "ghcr.io/microsoft/garnet",
          "--config-import-path",          "/etc/garnet/garnet.conf"
     ],

Same error on EPYC 9634 (Zen4), Debian 11 (ldd (Debian GLIBC 2.31-13+deb11u8) 2.31) kernel 5.15.143-1-pve, zfs:
Same error on EPYC 9634 (Zen4), Debian 12 (ldd (Debian GLIBC 2.36-9+deb12u9) 2.36) kernel 6.8.12-3-pve, zfs:

root@n01-test-podman:~/dl# podman logs -f gr01
    _________
   /_||___||_\      Garnet 1.0.39 64 bit; standalone mode
   '. \   / .'      Port: 6379
     '.\ /.'        https://aka.ms/GetGarnet
       '.'

06::54::54 info: GarnetServer[0] Garnet 1.0.39 64 bit; standalone mode; Port: 6379
06::54::54 info: ArgParser[0] Configuration import from embedded resource succeeded. Path: defaults.conf.
06::54::54 info: ArgParser[0] Configuration import from local machine succeeded. Path: /etc/garnet/garnet.conf.
06::54::54 info: Options[0] [Store] Using page size of 512m
06::54::54 info: Options[0] [Store] Using log memory size of 128g, with 56 empty pages, for effective size of -6.875t
06::54::54 info: Options[0] [Store] There are 256 log pages in memory
06::54::54 info: Options[0] [Store] Using disk segment size of 1g
06::54::54 info: Options[0] [Store] Using hash index size of 128m (2m cache lines)
06::54::54 info: Options[0] [Store] Hash index size is optimized for up to ~8m distinct keys
06::54::54 info: Options[0] [Store] Using log mutable percentage of 90%
06::54::54 info: Options[0] [Store] Not using Revivification
06::54::54 trce: TsavoriteKV [main][0] KV Initialize size:2097152, sizeBytes:134217728 sectorSize:512 alignedSizeBytes:134218240
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using page size of 8m
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Each page can hold ~349525 key-value pairs of objects
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using log memory size of 64m
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] This can hold ~2796202 key-value pairs of objects in memory total
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] There are 8 log pages in memory
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using disk segment size of 128m
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using hash index size of 64m (1m cache lines)
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Hash index size is optimized for up to ~4m distinct keys
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using log mutable percentage of 90%
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Total memory size including heap objects is unlimited
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Not using Revivification
06::54::54 trce: GarnetServer[0] TLS is disabled
06::54::54 info: GarnetServer[0] Total configured memory limit: 107642617856
06::54::54 info: TsavoriteKV [main][0] ********* Primary Recovery Information ********
06::54::54 info: StoreWrapper[0] Error during recovery of store; storeVersion = -1; objectStoreVersion = -1 Tsavorite.core.TsavoriteException: Unable to find valid HybridLog token    at Tsavorite.core.TsavoriteKV`4.FindRecoveryInfo(Int64 requestedVersion, HybridLogCheckpointInfo& recoveredHlcInfo, IndexCheckpointInfo& recoveredICInfo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Recovery/Recovery.cs:line 327    at Tsavorite.core.TsavoriteKV`4.Recover(Int32 numPagesToPreload, Boolean undoNextVersion, Int64 recoverTo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs:line 349    at Garnet.server.StoreWrapper.RecoverCheckpoint(Boolean replicaRecover, Boolean recoverMainStoreFromToken, Boolean recoverObjectStoreFromToken, CheckpointMetadata metadata) in /src/libs/server/StoreWrapper.cs:line 262
* Ready to accept connections
06::54::54 info: StoreWrapper[0] NOTE: Take a checkpoint (SAVE/BGSAVE) in order to actually delete the older data segments (files) from disk
06::55::18 dbug: GarnetServer[0] Accepted TCP connection from 172.16.16.1:38760
06::55::18 dbug: Session[0] [172.16.16.1:38760] [009FE5F7] Starting RespServerSession Id=1

Unhandled exception. System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at Garnet.server.GarnetObjectSerializer.DeserializeInternal(BinaryReader binaryReader) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 46
   at Garnet.server.GarnetObjectSerializer.Deserialize(IGarnetObject& obj) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 29
   at Tsavorite.core.GenericAllocatorImpl`3.RetrievedFullRecord(Byte* record, AsyncIOContext`2& ctx) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/GenericAllocatorImpl.cs:line 909
   at Tsavorite.core.AllocatorBase`4.AsyncGetFromDiskCallback(UInt32 errorCode, UInt32 numBytes, Object context) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs:line 1719
   at Tsavorite.core.NativeStorageDevice._callback(IntPtr context, Int32 errorCode, UInt64 numBytes) in /src/libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs:line 112

But no error on Ryzen 9 7900X (Zen4), ArchLinux (ldd (GNU libc) 2.40) kernel 6.11.5-arch1-1, ext4 on LUKS, same run under podman (conman) but rootless:

$ time redis-cli --pipe < rdb05-db0-again-31073609.resp 
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 31169367

real	0m44.228s
user	0m3.314s
sys	0m8.543s

But time redis-cli --scan | wc -l on Ryzen 9 7900x is very slow and Garnet (1.0.39) have big CPU utilizatiion:

$ time redis-cli --scan | wc -l
31073609

real	677m44.041s
user	0m30.711s
sys	0m35.334s

@rnz
Copy link
Author

rnz commented Nov 14, 2024

677 minutes - performed scan... This is uber slow!

@badrishc
Copy link
Contributor

dump many data from redis/keydb to Resp file (https://github.com/yannh/redis-dump-go):

can you share exactly what data to load into redis before dumping to Resp file, as the repro might depend of the number of keys, size of values etc. that are being dumped.

@badrishc
Copy link
Contributor

badrishc commented Nov 14, 2024

Also, what is the result of info store when the crash occurred.

was a save/bgsave invoked either by client or the dump loading tool?

try re-running with UseNativeDeviceLinux false, and see if that also fails loading.

the error indicates that a read was conducted, why would loading a dump cause a read is not clear. any idea on what the actual operations were being done? why would there have been a read being performed.

@badrishc
Copy link
Contributor

badrishc commented Nov 14, 2024

677 minutes - performed scan... This is uber slow!

try loading the same data directly using Resp.benchmark and see if scan is still slow. it is hard to say what might be the cause of this slowdown with the given information. is the hash table too small, leading to lots of collisions, for example.

Hash index size is optimized for up to ~4m distinct keys

Make sure you did not try to store more than this number of distinct objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants