Generic stored procedure parsing (#1047)

Here we create an ANTLR grammar library mechanism for including grammars, homogonize certain high level rule names, and implement a shared grammar for stored procedures. The shared grammar caters for TSQL and Snowflake stored procedures, but is intended to be expanded for all future SQL dialects as the standard SQL stored procedures are all very similar and dialect specific extensions such as LANGUAGE xyz in Snowflake do not create unresolvable ambiguities. The lexers for Snowflake and TSQL have been redesigned to share many common tokens, leaving only a few for the specific lexers, where they would take too much time to merge at the moment. But the intention is to eventually have a common lexer for all dialects, once the over-specified tokens (such as option names) are replaced with `id` or `genericOption`. This PR is essentially partially finished as the complete stored procedure grammar is not yet defined and the rules for TSQL and Snowflake are yet to be combined. This is left for a future PR. NB: The ANTLR grammar linter will have to be upgraded to cater for library grammars and to process includes as it is currently only able to process a single grammar. For now, the linting build step is changed to continue on error.
databrickslabs · Nov 8, 2024 · c6baa47 · c6baa47
1 parent 4486e58
commit c6baa47
Show file tree

Hide file tree

Showing 35 changed files with 3,442 additions and 3,551 deletions.
diff --git a/.github/workflows/push.yml b/.github/workflows/push.yml
@@ -278,3 +278,4 @@ jobs:
 
       - name: Run Lint Test with Maven
         run: mvn compile -DskipTests --update-snapshots -B exec:java -pl linter --file pom.xml -Dexec.args="-i core/src/main/antlr4 -o .venv/linter/grammar -c true"
+        continue-on-error: true
diff --git a/.gitignore b/.gitignore
@@ -17,4 +17,5 @@ spark-warehouse/
 remorph_transpile/
 /linter/gen/
 /linter/src/main/antlr4/library/gen/
-.databricks-login.json
+.databricks-login.json
+/core/src/main/antlr4/com/databricks/labs/remorph/parsers/*/gen/
diff --git a/core/pom.xml b/core/pom.xml
@@ -219,6 +219,15 @@
           <listener>false</listener>
           <sourceDirectory>src/main/antlr4</sourceDirectory>
           <treatWarningsAsErrors>true</treatWarningsAsErrors>
+          <libDirectory>${project.basedir}/src/main/antlr4/com/databricks/labs/remorph/parsers/lib</libDirectory>
+          <outputDirectory>${project.build.directory}/generated-sources/antlr4</outputDirectory>
+          <includes>
+            <include>**/*.g4</include> <!-- Include all .g4 files -->
+          </includes>
+          <excludes>
+            <exclude>**/lib/*.g4</exclude> <!-- But exclude the library grammars-->
+            <exclude>**/basesnowflake.g4</exclude>
+          </excludes>
         </configuration>
       </plugin>
       <plugin>

diff --git a/core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/README.md b/core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/README.md
@@ -0,0 +1,14 @@
+# ANTLR Grammar Library
+
+This directory contains ANTLR grammar files that are common to more than one SQL dialect. Such as the grammar that covers stored procedures, which all
+dialects of SQL support in some form, and for which we have a universal grammar.
+
+ANTLR processes included grammars as pure text, in the same way that say the C pre-processor processes `#include` directives. 
+This means that you must be careful to ensure that:
+ - if you define new tokens in an included grammar, that they do not clash with tokens in the including grammar.
+ - if you define new rules in an included grammar, that they do not clash with rules in the including grammar.
+   In particular, you must avoid creating ambiguities in rule/token prediction, where ANTLR will try to create
+   a parser anyway, but generate code that performs extremely long token lookahead, and is therefore very slow.
+
+In other words, you cannot just arbitrarily throw together some common Lexer and Parser rules and expect them
+to just work.
diff --git a/core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/commonlex.g4 b/core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/commonlex.g4