-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28600: Iceberg: Check that table/partition requires compaction b… #5529
base: master
Are you sure you want to change the base?
Conversation
@@ -458,6 +458,52 @@ public static List<DeleteFile> getDeleteFiles(Table table, String partitionPath) | |||
t -> ((PositionDeletesScanTask) t).file())); | |||
} | |||
|
|||
public static float getFileSizeRatio(Table table, String partitionPath, long fileSizeInBytesThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the method does not show what ratio it, as it does nothing with the file actual size, rather it check the number of files.
Like e.g. getUncompactedRatio would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does nothing with the file actual size, rather it check the number of files.
The method uses the actual file sizes, it checks for each file if its actual size is above or below the threshold and returns (number of files whose size is below threshold) / (total number of files)
.
I think getUncompactedRatio
is fine, I will rename like that.
CloseableIterable.filter(fileScanTasks, t -> { | ||
DataFile file = t.asFileScanTask().file(); | ||
return (!table.spec().isPartitioned() || | ||
partitionPath == null && file.specId() != table.spec().specId() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.
Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)
The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.
CloseableIterable<ScanTask> filteredDeletesScanTasks = | ||
CloseableIterable.filter(deletesScanTasks, t -> { | ||
DeleteFile file = ((PositionDeletesScanTask) t).file(); | ||
return !table.spec().isPartitioned() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.
Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to previous comment, previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)
The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highly disagree. Please check this:
boolean x = false;
boolean y = true;
boolean z = true;
if(x && y ||z){
System.out.println("true");
}else{
System.out.println("false");
}
if(x && (y ||z)){
System.out.println("true");
}else{
System.out.println("false");
}
The result for the first is true, for the second it is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be no contradiction, your test confirms that && has higher precedence than || as it gets evaluated first.
First case:
x && y || z
false && true || true
false || true
true
Second case:
x && (y || z)
false && (true || true)
false && (true)
false
@@ -458,6 +458,52 @@ public static List<DeleteFile> getDeleteFiles(Table table, String partitionPath) | |||
t -> ((PositionDeletesScanTask) t).file())); | |||
} | |||
|
|||
public static float getFileSizeRatio(Table table, String partitionPath, long fileSizeInBytesThreshold) { | |||
long uncompactedFilesCount = getDataFileCount(table, partitionPath, fileSizeInBytesThreshold, true); | |||
long compactedFilesCount = getDataFileCount(table, partitionPath, fileSizeInBytesThreshold, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the fileSizeInBytesThreshold is the comparator if a file is uncompacted or compacted?
Isn't there a better approach to make this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A file is decided to be compacted or uncompacted based on comparing its actual size with the threshold defined in Conf depending on compaction type:
HIVE_ICEBERG_MAJOR_COMPACTION_FILE_SIZE_THRESHOLD("hive.iceberg.major.compaction.file.size.threshold", "96mb",
new SizeValidator(), "Iceberg data file size in megabytes below which a file needs to be compacted."),
HIVE_ICEBERG_MINOR_COMPACTION_FILE_SIZE_THRESHOLD("hive.iceberg.minor.compaction.file.size.threshold", "16mb",
new SizeValidator(), "Iceberg data file size in megabytes below which a file needs to be compacted."),
fileSizeInBytesThreshold
gets the value from these configs depending on compaction type.
('fn11','ln11', 3, 20, 101), | ||
('fn12','ln12', 3, 20, 101); | ||
|
||
select * from ice_orc where company_id = 100; | ||
select * from ice_orc where dept_id = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you delete those selects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because with the other changes done to this q file in this PR the order of the result records from these selects is not fixed and changes randomly in different executions of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding -- SORT_QUERY_RESULTS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This q-test tests if records get sorted after executing compaction with sort clause.
I removed -- SORT_QUERY_RESULTS
intentionally, otherwise it won't be possible to test if records are sorted after compaction.
if (table.getStorageHandler() != null && | ||
!table.getStorageHandler().canCompact(context.getConf(), table, compactionRequest.getPartitionname(), | ||
compactionRequest.getType())) { | ||
LOG.info("Table=%s%s doesn't meet requirements for compaction", table.getTableName(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more description what requirement didn't met.
e.g.: What was expected, what is the actual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed logging from this place and instead added it to HiveIcebergStorageHandler#canCompact
.
This way it will log the exact reason why a table/partition doesn't meet compaction requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my comments!
Quality Gate passedIssues Measures |
…efore compacting
What changes were proposed in this pull request?
Added checking if Iceberg table/partition requires compaction before starting a manual compaction request.
Why are the changes needed?
Performance improvement to skip tables/partitions which do not require compaction.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Updated existing q-tests, added new unit test.