-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-26582: Cartesian join fails if the query has an empty table when… #5524
base: master
Are you sure you want to change the base?
Conversation
… cartesian product edge is used
/** | ||
* Extends {@link RelMdMaxRowCount} to get max row count for {@link HiveTableScan} | ||
*/ | ||
public class HiveRelMdMaxRowCount extends RelMdMaxRowCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need really need to extend the RelMdMaxRowCount
handler since there are no real inheritance relationships between them; we are adding a genuinely new handler for HiveTableScan
we are not overriding or changing the behavior of the existing ones.
Moreover, the RelMdMaxRowCount
handler is already registered via the DefaultRelMetadataProvider
so its gonna be used anyways for anything that is not a HiveTableScan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I will remove inheritance here. Or actually, I will probably just remove this class as suggested below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I understood that it can't be removed completely right now. But I have moved the logic to RelOptHiveTable
RelOptHiveTable table = (RelOptHiveTable) hiveTableScan.getTable(); | ||
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering( | ||
table.getHiveTableMD(), table.getHiveTableMD().getParameters())) { | ||
return null; | ||
} | ||
// if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty | ||
// super returns infinity. | ||
return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) && | ||
StatsUtils.isTableEmpty(table) ? | ||
Double.valueOf(0.0) : | ||
super.getMaxRowCount(hiveTableScan, mq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could possibly make this logic part of the RelOptHiveTable
. Doing this may allow us to drop completely this MetadataHandler class once we upgrade to a more recent Calcite version with https://issues.apache.org/jira/browse/CALCITE-4223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this can be moved to RelOptHiveTable.
RelOptHiveTable table = (RelOptHiveTable) hiveTableScan.getTable(); | ||
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering( | ||
table.getHiveTableMD(), table.getHiveTableMD().getParameters())) { | ||
return null; | ||
} | ||
// if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty | ||
// super returns infinity. | ||
return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) && | ||
StatsUtils.isTableEmpty(table) ? | ||
Double.valueOf(0.0) : | ||
super.getMaxRowCount(hiveTableScan, mq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the call to super can be replaced with a big constant or null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will return Double.POSITIVE_INFINITY
when table is not empty.
@@ -61,6 +61,7 @@ | |||
import org.apache.hadoop.hive.ql.metadata.Partition; | |||
import org.apache.hadoop.hive.ql.metadata.PartitionIterable; | |||
import org.apache.hadoop.hive.ql.metadata.Table; | |||
import org.apache.hadoop.hive.ql.optimizer.calcite.RelOptHiveTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid coupling this class utility class with RelOptHiveTable
maybe we could put this method inside RelOptHiveTable
.
// if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty | ||
// super returns infinity. | ||
return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) && | ||
StatsUtils.isTableEmpty(table) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the stats are accurate why do we need this check? Can't we get directly the number of rows and return it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a new method mostly because for partitioned tables we need to aggregate BasicStats
from all required partitions.
@@ -0,0 +1,24 @@ | |||
set hive.tez.cartesian-product.enabled=true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the property necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be removed 👍🏼
explain cbo | ||
with | ||
first as ( | ||
select a1 from c where a1 = 3 | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTEs make the query more complex and add another dimension to the problem. If possible let's just skip it.
Can't we repro the problem with just two tables that one of them is empty? Do we need union all + 3 tables?
Since the query was failing at runtime consider adding also a full execution of the query with the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will simplify the test and also add another query without explain.
In test mode, for explain plans, empty tables won't be pruned, so that checking explain plans is easier with empty tables. However, this property can be set to true when pruning is desired. When the query is not an explain or if Hive is not in test mode empty tables will be pruned.
Quality Gate passedIssues Measures |
… cartesian product edge is used
https://issues.apache.org/jira/browse/HIVE-26582
What changes were proposed in this pull request?
Add a new rule to prune empty tables from plans. Currently the config
HiveZeroMaxRowsRuleConfig
is a copy of Calcite'sZeroMaxRowsRuleConfig
- CALCITE-5314, but it can be removed after upgrading to at least Calcite 1.33Why are the changes needed?
It can lead to issues as described in HIVE-26582
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?