-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Generic stored procedure parsing (#1047)
Here we create an ANTLR grammar library mechanism for including grammars, homogonize certain high level rule names, and implement a shared grammar for stored procedures. The shared grammar caters for TSQL and Snowflake stored procedures, but is intended to be expanded for all future SQL dialects as the standard SQL stored procedures are all very similar and dialect specific extensions such as LANGUAGE xyz in Snowflake do not create unresolvable ambiguities. The lexers for Snowflake and TSQL have been redesigned to share many common tokens, leaving only a few for the specific lexers, where they would take too much time to merge at the moment. But the intention is to eventually have a common lexer for all dialects, once the over-specified tokens (such as option names) are replaced with `id` or `genericOption`. This PR is essentially partially finished as the complete stored procedure grammar is not yet defined and the rules for TSQL and Snowflake are yet to be combined. This is left for a future PR. NB: The ANTLR grammar linter will have to be upgraded to cater for library grammars and to process includes as it is currently only able to process a single grammar. For now, the linting build step is changed to continue on error.
- Loading branch information
Showing
35 changed files
with
3,442 additions
and
3,551 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
14 changes: 14 additions & 0 deletions
14
core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# ANTLR Grammar Library | ||
|
||
This directory contains ANTLR grammar files that are common to more than one SQL dialect. Such as the grammar that covers stored procedures, which all | ||
dialects of SQL support in some form, and for which we have a universal grammar. | ||
|
||
ANTLR processes included grammars as pure text, in the same way that say the C pre-processor processes `#include` directives. | ||
This means that you must be careful to ensure that: | ||
- if you define new tokens in an included grammar, that they do not clash with tokens in the including grammar. | ||
- if you define new rules in an included grammar, that they do not clash with rules in the including grammar. | ||
In particular, you must avoid creating ambiguities in rule/token prediction, where ANTLR will try to create | ||
a parser anyway, but generate code that performs extremely long token lookahead, and is therefore very slow. | ||
|
||
In other words, you cannot just arbitrarily throw together some common Lexer and Parser rules and expect them | ||
to just work. |
1,469 changes: 1,469 additions & 0 deletions
1,469
core/src/main/antlr4/com/databricks/labs/remorph/parsers/lib/commonlex.g4
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.