HIVE-26582: Cartesian join fails if the query has an empty table when… #5524

soumyakanti3578 · 2024-10-29T22:00:10Z

… cartesian product edge is used

https://issues.apache.org/jira/browse/HIVE-26582

What changes were proposed in this pull request?

Add a new rule to prune empty tables from plans. Currently the config HiveZeroMaxRowsRuleConfig is a copy of Calcite's ZeroMaxRowsRuleConfig - CALCITE-5314, but it can be removed after upgrading to at least Calcite 1.33

Why are the changes needed?

It can lead to issues as described in HIVE-26582

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -pl itests/qtest  -Pitests -Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite=true -Dqfile=prune_empty_table.q

… cartesian product edge is used

zabetak · 2024-10-30T13:56:25Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdMaxRowCount.java

+/**
+ * Extends {@link RelMdMaxRowCount} to get max row count for {@link HiveTableScan}
+ */
+public class HiveRelMdMaxRowCount extends RelMdMaxRowCount {


We don't need really need to extend the RelMdMaxRowCount handler since there are no real inheritance relationships between them; we are adding a genuinely new handler for HiveTableScan we are not overriding or changing the behavior of the existing ones.

Moreover, the RelMdMaxRowCount handler is already registered via the DefaultRelMetadataProvider so its gonna be used anyways for anything that is not a HiveTableScan.

I agree, and I will remove inheritance here. Or actually, I will probably just remove this class as suggested below.

Actually, I understood that it can't be removed completely right now. But I have moved the logic to RelOptHiveTable

zabetak · 2024-10-30T14:01:54Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdMaxRowCount.java

+    RelOptHiveTable table = (RelOptHiveTable) hiveTableScan.getTable();
+    if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(
+        table.getHiveTableMD(), table.getHiveTableMD().getParameters())) {
+      return null;
+    }
+    // if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty
+    // super returns infinity. 
+    return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) &&
+        StatsUtils.isTableEmpty(table) ?
+        Double.valueOf(0.0) :
+        super.getMaxRowCount(hiveTableScan, mq);


We could possibly make this logic part of the RelOptHiveTable. Doing this may allow us to drop completely this MetadataHandler class once we upgrade to a more recent Calcite version with https://issues.apache.org/jira/browse/CALCITE-4223

Yes, this can be moved to RelOptHiveTable.

zabetak · 2024-10-30T14:02:37Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdMaxRowCount.java

+    RelOptHiveTable table = (RelOptHiveTable) hiveTableScan.getTable();
+    if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(
+        table.getHiveTableMD(), table.getHiveTableMD().getParameters())) {
+      return null;
+    }
+    // if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty
+    // super returns infinity. 
+    return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) &&
+        StatsUtils.isTableEmpty(table) ?
+        Double.valueOf(0.0) :
+        super.getMaxRowCount(hiveTableScan, mq);


Probably the call to super can be replaced with a big constant or null.

I will return Double.POSITIVE_INFINITY when table is not empty.

zabetak · 2024-10-30T14:05:13Z

ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java

@@ -61,6 +61,7 @@
 import org.apache.hadoop.hive.ql.metadata.Partition;
 import org.apache.hadoop.hive.ql.metadata.PartitionIterable;
 import org.apache.hadoop.hive.ql.metadata.Table;
+import org.apache.hadoop.hive.ql.optimizer.calcite.RelOptHiveTable;


To avoid coupling this class utility class with RelOptHiveTable maybe we could put this method inside RelOptHiveTable.

zabetak · 2024-10-30T14:06:05Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdMaxRowCount.java

+    // if basic stats are up-to-date and the table is not dummy, return 0.0D if table is empty
+    // super returns infinity. 
+    return !table.getName().equals(SemanticAnalyzer.DUMMY_DATABASE + "." + SemanticAnalyzer.DUMMY_TABLE) &&
+        StatsUtils.isTableEmpty(table) ?


If the stats are accurate why do we need this check? Can't we get directly the number of rows and return it?

We need a new method mostly because for partitioned tables we need to aggregate BasicStats from all required partitions.

zabetak · 2024-10-30T14:06:42Z

ql/src/test/queries/clientpositive/prune_empty_table.q

@@ -0,0 +1,24 @@
+set hive.tez.cartesian-product.enabled=true;


Is the property necessary?

It can be removed 👍🏼

zabetak · 2024-10-30T14:11:22Z

ql/src/test/queries/clientpositive/prune_empty_table.q

+explain cbo
+with
+first as (
+select a1 from c where a1 = 3
+),


CTEs make the query more complex and add another dimension to the problem. If possible let's just skip it.

Can't we repro the problem with just two tables that one of them is empty? Do we need union all + 3 tables?

Since the query was failing at runtime consider adding also a full execution of the query with the results.

I will simplify the test and also add another query without explain.

In test mode, for explain plans, empty tables won't be pruned, so that checking explain plans is easier with empty tables. However, this property can be set to true when pruning is desired. When the query is not an explain or if Hive is not in test mode empty tables will be pruned.

sonarcloud · 2024-10-31T07:35:44Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

HIVE-26582: Cartesian join fails if the query has an empty table when…

1b70386

… cartesian product edge is used

asf-ci-hive added tests pending tests unstable and removed tests pending labels Oct 29, 2024

zabetak reviewed Oct 30, 2024

View reviewed changes

soumyakanti3578 added 2 commits October 30, 2024 22:33

Move most of the logic into RelOptHiveTable

6101547

asf-ci-hive added tests pending and removed tests unstable labels Oct 31, 2024

asf-ci-hive added tests unstable and removed tests pending labels Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-26582: Cartesian join fails if the query has an empty table when… #5524

HIVE-26582: Cartesian join fails if the query has an empty table when… #5524

soumyakanti3578 commented Oct 29, 2024

zabetak Oct 30, 2024

soumyakanti3578 Oct 30, 2024 •

edited

Loading

soumyakanti3578 Oct 31, 2024

zabetak Oct 30, 2024

soumyakanti3578 Oct 30, 2024

zabetak Oct 30, 2024

soumyakanti3578 Oct 30, 2024

zabetak Oct 30, 2024

zabetak Oct 30, 2024

soumyakanti3578 Oct 30, 2024

zabetak Oct 30, 2024

soumyakanti3578 Oct 30, 2024

zabetak Oct 30, 2024

soumyakanti3578 Oct 30, 2024

sonarcloud bot commented Oct 31, 2024

		@@ -0,0 +1,24 @@
		set hive.tez.cartesian-product.enabled=true;

HIVE-26582: Cartesian join fails if the query has an empty table when… #5524

Are you sure you want to change the base?

HIVE-26582: Cartesian join fails if the query has an empty table when… #5524

Conversation

soumyakanti3578 commented Oct 29, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Choose a reason for hiding this comment

soumyakanti3578 Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Oct 31, 2024

Quality Gate passed

soumyakanti3578 Oct 30, 2024 •

edited

Loading