Skip to content

Commit

Permalink
Generic stored procedure parsing (#1047)
Browse files Browse the repository at this point in the history
Here we create an ANTLR grammar library mechanism for including
grammars, homogonize certain high level rule names, and implement a
shared grammar for stored procedures.

The shared grammar caters for TSQL and Snowflake stored procedures, but
is intended to be expanded for all future SQL dialects as the standard
SQL stored procedures are all very similar and dialect specific
extensions such as LANGUAGE xyz in Snowflake do not create unresolvable
ambiguities.

The lexers for Snowflake and TSQL have been redesigned to share many
common tokens, leaving only a few for the specific lexers, where they
would take too much time to merge at the moment. But the intention is to
eventually have a common lexer for all dialects, once the over-specified
tokens (such as option names) are replaced with `id` or `genericOption`.

This PR is essentially partially finished as the complete stored
procedure grammar is not yet defined and the rules for TSQL and
Snowflake are yet to be combined. This is left for a future PR.

NB: The ANTLR grammar linter will have to be upgraded to cater for
library grammars and to process includes as it is currently only able to
process a single grammar. For now, the linting build step is changed to
continue on error.
  • Loading branch information
jimidle authored Nov 8, 2024
1 parent 4486e58 commit c6baa47
Show file tree
Hide file tree
Showing 35 changed files with 3,442 additions and 3,551 deletions.
1 change: 1 addition & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -278,3 +278,4 @@ jobs:

- name: Run Lint Test with Maven
run: mvn compile -DskipTests --update-snapshots -B exec:java -pl linter --file pom.xml -Dexec.args="-i core/src/main/antlr4 -o .venv/linter/grammar -c true"
continue-on-error: true
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ spark-warehouse/
remorph_transpile/
/linter/gen/
/linter/src/main/antlr4/library/gen/
.databricks-login.json
.databricks-login.json
/core/src/main/antlr4/com/databricks/labs/remorph/parsers/*/gen/
9 changes: 9 additions & 0 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,15 @@
<listener>false</listener>
<sourceDirectory>src/main/antlr4</sourceDirectory>
<treatWarningsAsErrors>true</treatWarningsAsErrors>
<libDirectory>${project.basedir}/src/main/antlr4/com/databricks/labs/remorph/parsers/lib</libDirectory>
<outputDirectory>${project.build.directory}/generated-sources/antlr4</outputDirectory>
<includes>
<include>**/*.g4</include> <!-- Include all .g4 files -->
</includes>
<excludes>
<exclude>**/lib/*.g4</exclude> <!-- But exclude the library grammars-->
<exclude>**/basesnowflake.g4</exclude>
</excludes>
</configuration>
</plugin>
<plugin>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# ANTLR Grammar Library

This directory contains ANTLR grammar files that are common to more than one SQL dialect. Such as the grammar that covers stored procedures, which all
dialects of SQL support in some form, and for which we have a universal grammar.

ANTLR processes included grammars as pure text, in the same way that say the C pre-processor processes `#include` directives.
This means that you must be careful to ensure that:
- if you define new tokens in an included grammar, that they do not clash with tokens in the including grammar.
- if you define new rules in an included grammar, that they do not clash with rules in the including grammar.
In particular, you must avoid creating ambiguities in rule/token prediction, where ANTLR will try to create
a parser anyway, but generate code that performs extremely long token lookahead, and is therefore very slow.

In other words, you cannot just arbitrarily throw together some common Lexer and Parser rules and expect them
to just work.
1,469 changes: 1,469 additions & 0 deletions core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/commonlex.g4

Large diffs are not rendered by default.

Loading

0 comments on commit c6baa47

Please sign in to comment.