Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28600: Iceberg: Check that table/partition requires compaction b… #5529

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

difin
Copy link
Contributor

@difin difin commented Nov 5, 2024

…efore compacting

What changes were proposed in this pull request?

Added checking if Iceberg table/partition requires compaction before starting a manual compaction request.

Why are the changes needed?

Performance improvement to skip tables/partitions which do not require compaction.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Updated existing q-tests, added new unit test.

@@ -458,6 +458,52 @@ public static List<DeleteFile> getDeleteFiles(Table table, String partitionPath)
t -> ((PositionDeletesScanTask) t).file()));
}

public static float getFileSizeRatio(Table table, String partitionPath, long fileSizeInBytesThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the method does not show what ratio it, as it does nothing with the file actual size, rather it check the number of files.
Like e.g. getUncompactedRatio would be better.

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does nothing with the file actual size, rather it check the number of files.

The method uses the actual file sizes, it checks for each file if its actual size is above or below the threshold and returns (number of files whose size is below threshold) / (total number of files).
I think getUncompactedRatio is fine, I will rename like that.

CloseableIterable.filter(fileScanTasks, t -> {
DataFile file = t.asFileScanTask().file();
return (!table.spec().isPartitioned() ||
partitionPath == null && file.specId() != table.spec().specId() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.

Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)

The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.

CloseableIterable<ScanTask> filteredDeletesScanTasks =
CloseableIterable.filter(deletesScanTasks, t -> {
DeleteFile file = ((PositionDeletesScanTask) t).file();
return !table.spec().isPartitioned() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.

Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous comment, previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)

The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly disagree. Please check this:

boolean x = false;
boolean y = true;
boolean z = true;
if(x && y ||z){
  System.out.println("true");
}else{
  System.out.println("false");
}

if(x && (y ||z)){
  System.out.println("true");
}else{
  System.out.println("false");
}

The result for the first is true, for the second it is false.

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no contradiction, your test confirms that && has higher precedence than || as it gets evaluated first.

First case:

x && y || z
false && true || true
false || true
true

Second case:

x && (y || z)
false && (true || true)
false && (true)
false

@@ -458,6 +458,52 @@ public static List<DeleteFile> getDeleteFiles(Table table, String partitionPath)
t -> ((PositionDeletesScanTask) t).file()));
}

public static float getFileSizeRatio(Table table, String partitionPath, long fileSizeInBytesThreshold) {
long uncompactedFilesCount = getDataFileCount(table, partitionPath, fileSizeInBytesThreshold, true);
long compactedFilesCount = getDataFileCount(table, partitionPath, fileSizeInBytesThreshold, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the fileSizeInBytesThreshold is the comparator if a file is uncompacted or compacted?
Isn't there a better approach to make this check?

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A file is decided to be compacted or uncompacted based on comparing its actual size with the threshold defined in Conf depending on compaction type:

HIVE_ICEBERG_MAJOR_COMPACTION_FILE_SIZE_THRESHOLD("hive.iceberg.major.compaction.file.size.threshold", "96mb",
        new SizeValidator(), "Iceberg data file size in megabytes below which a file needs to be compacted."),
HIVE_ICEBERG_MINOR_COMPACTION_FILE_SIZE_THRESHOLD("hive.iceberg.minor.compaction.file.size.threshold", "16mb",
        new SizeValidator(), "Iceberg data file size in megabytes below which a file needs to be compacted."),

fileSizeInBytesThreshold gets the value from these configs depending on compaction type.

('fn11','ln11', 3, 20, 101),
('fn12','ln12', 3, 20, 101);

select * from ice_orc where company_id = 100;
select * from ice_orc where dept_id = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you delete those selects?

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because with the other changes done to this q file in this PR the order of the result records from these selects is not fixed and changes randomly in different executions of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding -- SORT_QUERY_RESULTS ?

if (table.getStorageHandler() != null &&
!table.getStorageHandler().canCompact(context.getConf(), table, compactionRequest.getPartitionname(),
compactionRequest.getType())) {
LOG.info("Table=%s%s doesn't meet requirements for compaction", table.getTableName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more description what requirement didn't met.
e.g.: What was expected, what is the actual?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed logging from this place and instead added it to HiveIcebergStorageHandler#canCompact.
This way it will log the exact reason why a table/partition doesn't meet compaction requirements.

Copy link
Contributor

@zratkai zratkai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments!

Copy link

sonarcloud bot commented Nov 15, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants