Tracking Issue: improve compaction #987

Rachelint · 2023-06-12T02:55:23Z

Describe This Problem

Now the design of compaction in ceresdb is still so rough, we should make more efforts in it.
There are several improvements that can be made in the following areas:

Compaction strategy. Now we just impl TWSC actually, we define a level 1 but do nothing special for it.
How to do compaction more efficiently. Speed of compaction may important equally important to strategy.
Metrics and tests. We should have ways to check the correctness and effectiveness(especially in query improvement) about our compaction strategy.

Proposal

1. Compaction strategy

Introduce score mechanism to integrate multiple rules.
Consider sequence(wal) when picking compacting files to ensure the correctness.
Eliminate time range overlap of ssts in level 1.
Take priority of respective table in consideration.

2. Performace of compaction

Keep more data in memtable and larger L0 flushed sst. Use dictionary to store string in memtable #1029
Optimize sst iterator and filter build to consume less CPU. refactor: optimize sst iterator and filter build to consume less CPU #975

3. Metrics and tests

Emulator for compaction strategy inspired by iox
Add metrics (like read amplification, write amplification, space amplification) to check the effectiveness of the strategy.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jiacai2050 · 2023-06-19T04:13:27Z

Add metrics (like read amplification, write amplification, space amplification) to check the effectiveness of the strategy.

Current codebase already have basic metrics for compact:

Input sst size/row num
Output sst size/row num

https://github.com/CeresDB/ceresdb/blob/f873980175e46eb436fb316cabaa6911985794ef/analytic_engine/src/table/metrics.rs#L62

## Rationale Part of #987. Current implementation will compact by file size, max_seq is not considered, this may cause data corruption in corner case, eg: - sst1, max_seq:10, PK1=10 - sst2, max_seq:11, PK1=9 - sst3, max_seq:12, no PK1 If compact pick sst1 and sst3, and output sst4, its max_seq will be 12, now PK1 exists in two files: - sst2, max_seq:11, PK1=9 - sst4, max_seq:12, PK1=10 That's to say, PK1's value is 10 now, which is wrong value(9 is right). ## Detailed Changes When do compaction, first sort sst by max_seq desc, then only pick adjacent ssts, the original issue is fixed in this way. At the same time picked ssts are ensured to meet other requirements such as `min_threshold`, `max_threshold`, `max_input_size`. ## Test Plan UT and manually.

Rachelint added feature New feature or request A-analytic-engine Area: Analytic Engine tracking issue Issue tracks progress for something labels Jun 12, 2023

Rachelint changed the title ~~Improve compaction~~ Tracking Issue: Improve compaction Jun 12, 2023

Rachelint changed the title ~~Tracking Issue: Improve compaction~~ Tracking Issue: improve compaction Jun 12, 2023

Rachelint pinned this issue Jun 12, 2023

jiacai2050 mentioned this issue Jun 29, 2023

fix: compaction support pick by max_seq #1041

Merged

Rachelint unpinned this issue Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: improve compaction #987

Tracking Issue: improve compaction #987

Rachelint commented Jun 12, 2023 •

edited by jiacai2050

Loading

jiacai2050 commented Jun 19, 2023 •

edited

Loading

Tracking Issue: improve compaction #987

Tracking Issue: improve compaction #987

Comments

Rachelint commented Jun 12, 2023 • edited by jiacai2050 Loading

Describe This Problem

Proposal

1. Compaction strategy

2. Performace of compaction

3. Metrics and tests

Additional Context

jiacai2050 commented Jun 19, 2023 • edited Loading

Rachelint commented Jun 12, 2023 •

edited by jiacai2050

Loading

jiacai2050 commented Jun 19, 2023 •

edited

Loading