Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue: improve compaction #987

Open
3 of 8 tasks
Rachelint opened this issue Jun 12, 2023 · 1 comment
Open
3 of 8 tasks

Tracking Issue: improve compaction #987

Rachelint opened this issue Jun 12, 2023 · 1 comment
Labels
A-analytic-engine Area: Analytic Engine feature New feature or request tracking issue Issue tracks progress for something

Comments

@Rachelint
Copy link
Contributor

Rachelint commented Jun 12, 2023

Describe This Problem

Now the design of compaction in ceresdb is still so rough, we should make more efforts in it.
There are several improvements that can be made in the following areas:

  • Compaction strategy. Now we just impl TWSC actually, we define a level 1 but do nothing special for it.
  • How to do compaction more efficiently. Speed of compaction may important equally important to strategy.
  • Metrics and tests. We should have ways to check the correctness and effectiveness(especially in query improvement) about our compaction strategy.

Proposal

1. Compaction strategy

  • Introduce score mechanism to integrate multiple rules.
  • Consider sequence(wal) when picking compacting files to ensure the correctness.
  • Eliminate time range overlap of ssts in level 1.
  • Take priority of respective table in consideration.

2. Performace of compaction

3. Metrics and tests

  • Emulator for compaction strategy inspired by iox
  • Add metrics (like read amplification, write amplification, space amplification) to check the effectiveness of the strategy.

Additional Context

No response

@Rachelint Rachelint added feature New feature or request A-analytic-engine Area: Analytic Engine tracking issue Issue tracks progress for something labels Jun 12, 2023
@Rachelint Rachelint changed the title Improve compaction Tracking Issue: Improve compaction Jun 12, 2023
@Rachelint Rachelint changed the title Tracking Issue: Improve compaction Tracking Issue: improve compaction Jun 12, 2023
@Rachelint Rachelint pinned this issue Jun 12, 2023
@jiacai2050
Copy link
Contributor

jiacai2050 commented Jun 19, 2023

Add metrics (like read amplification, write amplification, space amplification) to check the effectiveness of the strategy.

Current codebase already have basic metrics for compact:

  1. Input sst size/row num
  2. Output sst size/row num

https://github.com/CeresDB/ceresdb/blob/f873980175e46eb436fb316cabaa6911985794ef/analytic_engine/src/table/metrics.rs#L62

jiacai2050 added a commit that referenced this issue Jul 5, 2023
## Rationale
Part of #987.

Current implementation will compact by file size, max_seq is not
considered, this may cause data corruption
in corner case, eg:
- sst1, max_seq:10, PK1=10
- sst2, max_seq:11, PK1=9
- sst3, max_seq:12, no PK1

If compact pick sst1 and sst3, and output sst4, its max_seq will be 12,
now PK1 exists in two files:
- sst2, max_seq:11, PK1=9
- sst4, max_seq:12, PK1=10

That's to say, PK1's value is 10 now, which is wrong value(9 is right).

## Detailed Changes
When do compaction, first sort sst by max_seq desc, then only pick
adjacent ssts, the original issue is fixed in this way.
At the same time picked ssts are ensured to meet other requirements such
as `min_threshold`, `max_threshold`, `max_input_size`.

## Test Plan
UT and manually.
@Rachelint Rachelint unpinned this issue Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-analytic-engine Area: Analytic Engine feature New feature or request tracking issue Issue tracks progress for something
Projects
None yet
Development

No branches or pull requests

2 participants