Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-26582: Cartesian join fails if the query has an empty table when… #5524

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

soumyakanti3578
Copy link
Contributor

… cartesian product edge is used

https://issues.apache.org/jira/browse/HIVE-26582

What changes were proposed in this pull request?

Add a new rule to prune empty tables from plans. Currently the config HiveZeroMaxRowsRuleConfig is a copy of Calcite's ZeroMaxRowsRuleConfig - CALCITE-5314, but it can be removed after upgrading to at least Calcite 1.33

Why are the changes needed?

It can lead to issues as described in HIVE-26582

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -pl itests/qtest  -Pitests -Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite=true -Dqfile=prune_empty_table.q

/**
* Extends {@link RelMdMaxRowCount} to get max row count for {@link HiveTableScan}
*/
public class HiveRelMdMaxRowCount extends RelMdMaxRowCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need really need to extend the RelMdMaxRowCount handler since there are no real inheritance relationships between them; we are adding a genuinely new handler for HiveTableScan we are not overriding or changing the behavior of the existing ones.

Moreover, the RelMdMaxRowCount handler is already registered via the DefaultRelMetadataProvider so its gonna be used anyways for anything that is not a HiveTableScan.

Copy link
Contributor Author

@soumyakanti3578 soumyakanti3578 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I will remove inheritance here. Or actually, I will probably just remove this class as suggested below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I understood that it can't be removed completely right now. But I have moved the logic to RelOptHiveTable

Comment on lines 42 to 52
RelOptHiveTable table = (RelOptHiveTable) hiveTableScan.getTable();
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(
table.getHiveTableMD(), table.getHiveTableMD().getParameters())) {
return null;
}
// if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty
// super returns infinity.
return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) &&
StatsUtils.isTableEmpty(table) ?
Double.valueOf(0.0) :
super.getMaxRowCount(hiveTableScan, mq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could possibly make this logic part of the RelOptHiveTable. Doing this may allow us to drop completely this MetadataHandler class once we upgrade to a more recent Calcite version with https://issues.apache.org/jira/browse/CALCITE-4223

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be moved to RelOptHiveTable.

Comment on lines 42 to 52
RelOptHiveTable table = (RelOptHiveTable) hiveTableScan.getTable();
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(
table.getHiveTableMD(), table.getHiveTableMD().getParameters())) {
return null;
}
// if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty
// super returns infinity.
return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) &&
StatsUtils.isTableEmpty(table) ?
Double.valueOf(0.0) :
super.getMaxRowCount(hiveTableScan, mq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the call to super can be replaced with a big constant or null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will return Double.POSITIVE_INFINITY when table is not empty.

@@ -61,6 +61,7 @@
import org.apache.hadoop.hive.ql.metadata.Partition;
import org.apache.hadoop.hive.ql.metadata.PartitionIterable;
import org.apache.hadoop.hive.ql.metadata.Table;
import org.apache.hadoop.hive.ql.optimizer.calcite.RelOptHiveTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid coupling this class utility class with RelOptHiveTable maybe we could put this method inside RelOptHiveTable.

// if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty
// super returns infinity.
return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) &&
StatsUtils.isTableEmpty(table) ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stats are accurate why do we need this check? Can't we get directly the number of rows and return it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a new method mostly because for partitioned tables we need to aggregate BasicStats from all required partitions.

@@ -0,0 +1,24 @@
set hive.tez.cartesian-product.enabled=true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the property necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be removed 👍🏼

Comment on lines 10 to 14
explain cbo
with
first as (
select a1 from c where a1 = 3
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CTEs make the query more complex and add another dimension to the problem. If possible let's just skip it.

Can't we repro the problem with just two tables that one of them is empty? Do we need union all + 3 tables?

Since the query was failing at runtime consider adding also a full execution of the query with the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will simplify the test and also add another query without explain.

In test mode, for explain plans, empty tables won't be pruned, so that
checking explain plans is easier with empty tables. However, this
property can be set to true when pruning is desired.

When the query is not an explain or if Hive is not in test mode empty
tables will be pruned.
Copy link

sonarcloud bot commented Oct 31, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants