-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chore] fixed query coverage report #1160
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Merged
nfx
added a commit
that referenced
this pull request
Nov 7, 2024
* Added IR for stored procedures ([#1161](#1161)). In this release, we have made significant enhancements to the project by adding support for stored procedures. We have introduced a new `CreateVariable` case class to manage variable creation within the intermediate representation (IR), and removed the `SetVariable` case class as it is now redundant. A new `CaseStatement` class has been added to represent SQL case statements with value match, and a `CompoundStatement` class has been implemented to enable encapsulation of a sequence of logical plans within a single compound statement. The `DeclareCondition`, `DeclareContinueHandler`, and `DeclareExitHandler` case classes have been introduced to handle conditional logic and exit handlers in stored procedures. New classes `DeclareVariable`, `ElseIf`, `ForStatement`, `If`, `Iterate`, `Leave`, `Loop`, `RepeatUntil`, `Return`, `SetVariable`, and `Signal` have been added to the project to provide more comprehensive support for procedural language features and control flow management in stored procedures. We have also included SnowflakeCommandBuilder support for stored procedures and updated the `visitExecuteTask` method to handle stored procedure calls using the `SetVariable` method. * Added Variant Support ([#998](#998)). In this commit, support for the Variant datatype has been added to the create table functionality, enhancing the system's compatibility with Snowflake's datatypes. A new VariantType has been introduced, which allows for more comprehensive handling of data during create table operations. Additionally, a `remarks VARIANT` line is added in the CREATE TABLE statement and the corresponding spec test has been updated. The Variant datatype is a flexible datatype that can store different types of data, such as arrays, objects, and strings, offering increased functionality for users working with variant data. Furthermore, this change will enable the use of the Variant datatype in Snowflake tables and improves the data modeling capabilities of the system. * Added `PySpark` generator ([#1026](#1026)). The engineering team has developed a new `PySpark` generator for the `com.databricks.labs.remorph.generators` package. This addition introduces a new parameter, `logical`, of type `Generator[ir.LogicalPlan, String]`, in the `SQLGenerator` for SQL queries. A new abstract class `BasePythonGenerator` has been added, which extends the `Generator` class and generates Python code. A `ExpressionGenerator` class has also been added, which extends `BasePythonGenerator` and is responsible for generating Python code for `ir.Expression` objects. A new `LogicalPlanGenerator` class has been added, which extends `BasePythonGenerator` and is responsible for generating Python code for a given `ir.LogicalPlan`. A new `StatementGenerator` class has been implemented, which converts `Statement` objects into Python code. A new Python-generating class, `PythonGenerator`, has been added, which includes the implementation of an abstract syntax tree (AST) for Python in Scala. This AST includes classes for various Python language constructs. Additionally, new implicit classes for `PythonInterpolator`, `PythonOps`, and `PythonSeqOps` have been added to allow for the creation of PySpark code using the Remorph framework. The `AndOrToBitwise` rule has been implemented to convert `And` and `Or` expressions to their bitwise equivalents. The `DotToFCol` rule has been implemented to transform code that references columns using dot notation in a DataFrame to use the `col` function with a string literal of the column name instead. A new `PySparkStatements` object and `PySparkExpressions` class have been added, which provide functionality for transforming expressions in a data processing pipeline to PySpark equivalents. The `SnowflakeToPySparkTranspiler` class has been added to transpile Snowflake queries to PySpark code. A new `PySpark` generator has been added to the `Transpiler` class, which is implemented as an instance of the `SqlGenerator` class. This change enhances the `Transpiler` class with a new `PySpark` generator and improves serialization efficiency. * Added `debug-bundle` command for folder-to-folder translation ([#1045](#1045)). In this release, we have introduced a `debug-bundle` command to the remorph project's CLI, specifically added to the `proxy_command` function, which already includes `debug-script`, `debug-me`, and `debug-coverage` commands. This new command enhances the tool's debugging capabilities, allowing developers to generate a bundle of translated queries for folder-to-folder translation tasks. The `debug-bundle` command accepts three flags: `dialect`, `src`, and `dst`, specifying the SQL dialect, source directory, and destination directory, respectively. Furthermore, the update includes refactoring the `FileSetGenerator` class in the `orchestration` package of the `com.databricks.labs.remorph.generators` package, adding a `debug-bundle` command to the `Main` object, and updating the `FileQueryHistoryProvider` method in the `ApplicationContext` trait. These improvements focus on providing a convenient way to convert folder-based SQL scripts to other formats like SQL and PySpark, enhancing the translation capabilities of the project. * Added `ruff` Python formatter proxy ([#1038](#1038)). In this release, we have added support for the `ruff` Python formatter in our project's continuous integration and development workflow. We have also introduced a new `FORMAT` stage in the `WorkflowStage` object in the `Result` Scala object to include formatting as a separate step in the workflow. A new `RuffFormatter` class has been added to format Python code using the `ruff` tool, and a `StandardInputPythonSubprocess` class has been included to run a Python subprocess and capture its output and errors. Additionally, we have added a proxy for the `ruff` formatter to the SnowflakeToPySparkTranspilerTest for Scala to improve the readability of the transpiled Python code generated by the SnowflakeToPySparkTranspiler. Lastly, we have introduced a new `ruff` formatter proxy in the test code for the transpiler library to enforce format and style conventions in Python code. These changes aim to improve the development and testing experience for the project and ensure that the code follows the desired formatting and style standards. * Added baseline for translating workflows ([#1042](#1042)). In this release, several new features have been added to the open-source library to improve the translation of workflows. A new dependency for the Jackson YAML data format library, version 2.14.0, has been added to the pom.xml file to enable processing YAML files and converting them to Java objects. A new `FileSet` class has been introduced, which provides an in-memory data structure to manage a set of files, allowing users to add, retrieve, and remove files by name and persist the contents of the files to the file system. A new `FileSetGenerator` class has been added that generates a `FileSet` object from a `JobNode` object, enabling the translation of workflows by generating all necessary files for a workspace. A new `DefineJob` class has been developed to define a new rule for processing `JobNode` objects in the Remorph system, converting instances of `SuccessPy` and `SuccessSQL` into `PythonNotebookTask` and `SqlNotebookTask` objects, respectively. Additionally, various new classes, such as `GenerateBundleFile`, `QueryHistoryToQueryNodes`, `ReformatCode`, `TryGeneratePythonNotebook`, `TryGenerateSQL`, `TrySummarizeFailures`, `InformationFile`, `SuccessPy`, `SuccessSQL`, `FailedQuery`, `Migration`, `PartialQuery`, `QueryPlan`, `RawMigration`, `Comment`, and `PlanComment`, have been introduced to provide a more comprehensive and nuanced job orchestration framework. The `Library` case class has been updated to better separate concerns between library configuration and code assets. These changes address issue [#1042](#1042) and provide a more robust and flexible workflow translation solution. * Added correct generation of `databricks.yml` for `QueryHistory` ([#1044](#1044)). The FileSet class in the FileSet.scala file has been updated to include a new method that correctly generates the `databricks.yml` file for the `QueryHistory` feature. This file is used for orchestrating cross-compiled queries, creating three files in total - two SQL notebooks with translated and formatted queries and a `databricks.yml` file to define an asset bundle for the queries. The new method in the FileSet class writes the content to the file using the `Files.write` method from the `java.nio.file` package instead of the previously used `PrintWriter`. The FileSetGenerator class has been updated to include the new `databricks.yml` file generation, and new rules and methods have been added to improve the accuracy and consistency of schema definitions in the generated orchestration files. Additionally, the DefineJob and DefineSchemas classes have been introduced to simplify the orchestration generation process. * Added documentation around Transformation ([#1043](#1043)). In this release, the Transformation class in our open-source library has been enhanced with detailed documentation, type parameters, and new methods. The class represents a stateful computation that produces an output of type Out while managing a state of type State. The new methods include map and flatMap for modifying the output and chaining transformations, as well as run and runAndDiscardState for executing the computation with a given initial state and producing a Result containing the output and state. Additionally, we have introduced a new trait called TransformationConstructors that provides constructors for successful transformations, error transformations, lifted results, state retrieval, replacement, and updates. The CodeGenerator trait in our code generation library has also been updated with several new methods for more control and flexibility in the code generation process. These include commas and spaces for formatting output, updateGenCtx for updating the GeneratorContext, nest and unnest for indentation, withIndentedBlock for producing nested blocks of code, and withGenCtx for creating transformations that use the current GeneratorContext. * Added tests for Snow ARRAY_REMOVE function ([#979](#979)). In this release, we have added tests for the Snowflake ARRAY_REMOVE function in the SnowflakeToDatabricksTranspilerTest. The tests, currently ignored, demonstrate the usage of the ARRAY_REMOVE function with different data types, such as integers and doubles. A TODO comment is included for a test case involving VARCHAR casting, to be enabled once the necessary casting functionality is implemented. This update enhances the library's capabilities and ensures that the ARRAY_REMOVE function can handle a variety of data types. Software engineers can refer to these tests to understand the usage of the ARRAY_REMOVE function in the transpiler and the planned casting functionality. * Avoid non local return ([#1052](#1052)). In this release, the `render` method of the `generators` package object in the `com.databricks.labs.remorph` package has been updated to avoid using non-local returns and follow recommended coding practices. Instead of returning early from the method, it now uses `Option` to track failures and a `try-catch` block to handle exceptions. In cases of exception during string concatenation, the method sets the `failureOpt` variable to `Some(lift(KoResult(WorkflowStage.GENERATE, UncaughtException(e))))`. Additionally, the test file "CodeInterpolatorSpec.scala" has been modified to fix an issue with exception handling. In the updated code, new variables for each argument are introduced, and the problematic code is placed within an interpolated string, allowing for proper exception handling. This release enhances the robustness and reliability of the code interpolator and ensures that the method follows recommended coding practices. * Collect errors in `Phase` ([#1046](#1046)). The open-source library Remorph has received significant updates, focusing on enhancing error collection and simplifying the Transformation class. The changes include a new method `recordError` in the abstract Phase trait and its concrete implementations for collecting errors during each phase. The Transformation class has been simplified by removing the unused Phase parameter, while the Generator, CodeGenerator, and FileSetGenerator have been specialized to use Transformation without the Phase parameter. The TryGeneratePythonNotebook, TryGenerateSQL, CodeInterpolator, and TBASeqOps classes have been updated for a more concise and focused state. The imports have been streamlined, and the PySparkGenerator, SQLGenerator, and PlanParser have been modified to remove the unused Phase type parameter. A new test file, TransformationTest, has been added to check the error collection functionality in the Transformation class. Overall, these enhancements improve the reliability, readability, and maintainability of the Remorph library. * Correctly generate `F.fn_name` for builtin PySpark functions ([#1037](#1037)). This commit introduces changes to the generation of `F.fn_name` for builtin PySpark functions in the PySparkExpressions.scala file, specifically for PySpark's builtin functions (`fn`). It includes a new case to handle these functions by converting them to their lowercase equivalent in Python using `Locale.getDefault`. Additionally, changes have been made to handle window specifications more accurately, such as using `ImportClassSideEffect` with `windowSpec` and generating and applying a window function (`fn`) over it. The `LAST_VALUE` function has been modified to `LAST` in the SnowflakeToDatabricksTranspilerTest.scala file, and new methods such as `First`, `Last`, `Lag`, `Lead`, and `NthValue` have been added to the SnowflakeCallMapper class. These changes improve the accuracy, flexibility, and compatibility of PySpark when working with built-in functions and window specifications, making the codebase more readable, maintainable, and efficient. * Create Command Extended ([#1033](#1033)). In this release, the open-source library has been updated with several new features related to table management and SQL code generation. A new method `replaceTable` has been added to the `LogicalPlanGenerator` class, which generates SQL code for a `ReplaceTableCommand` IR node and replaces an existing table with the same name if it already exists. Additionally, support has been added for generating SQL code for an `IdentityConstraint` IR node, which specifies whether a column is an auto-incrementing identity column. The `CREATE TABLE` statement has been updated to include the `AUTOINCREMENT` and `REPLACE` constraints, and a new `IdentityConstraint` case class has been introduced to extend the capabilities of the `UnnamedConstraint` class. The `TSqlDDLBuilder` class has also been updated to handle the `IDENTITY` keyword more effectively. A new command implementation with `AUTOINCREMENT` and `REPLACE` constraints has been added, and a new SQL script has been included in the functional tests for testing CREATE DDL statements with identity columns. Finally, the SQL transpiler has been updated to support the `CREATE OR REPLACE PROCEDURE` syntax, providing more flexibility and convenience for users working with stored procedures in their SQL code. These updates aim to improve the functionality and ease of use of the open-source library for software engineers working with SQL code and table management. * Don't draft automated releases ([#995](#995)). In this release, we have made a modification to the release.yml file in the .github/workflows directory by removing the "draft: true" line. This change removes the creation of draft releases in the automated release process, simplifying it and making it more straightforward for users to access new versions of the software. The job section of the release.yml file now only includes the `release` job, with the "release-signing-artifacts: true" still enabled, ensuring that the artifacts are signed. This improvement enhances the overall release process, making it more efficient and user-friendly. * Enhance the Snow ARRAY_SORT function support ([#994](#994)). With this release, the Snowflake ARRAY_SORT function now supports Boolean literals as parameters, improving its functionality. The changes include validating Boolean parameters in SnowflakeCallMapper.scala, throwing a TranspileException for unsupported arguments, and simplifying the IR using the DBSQL SORT_ARRAY function. Additionally, new test cases have been added to SnowflakeCallMapperSpec for the ARRAY_SORT and ARRAY_SLICE functions. The SnowflakeToDatabricksTranspilerTest class has also been updated with new test cases that cover the enhanced ARRAY_SORT function's various overloads and combinations of Boolean literals, NULLs, and a custom sorting function. This ensures that invalid usage is caught during transpilation, providing better error handling and supporting more use cases. * Ensure that unparsable text is not lost in the generated output ([#1012](#1012)). In this release, we have implemented an enhancement to the error handling strategy in the ANTLR-generated parsers for SQL. This change records where parsing failed and gathers un-parsable input, preserving them as custom error nodes in the ParseTree at strategic points. The new custom error strategy allows visitors for higher level rules such as `sqlCommand` in Snowflake and `sqlClauses` in TSQL to check for an error node in the children and generate an Ir node representing the un-parsed text. Additionally, new methods have been introduced to handle error recovery, find the highest context in the tree for the particular parser, and recursively find the context. A separate improvement is planned to ensure the PLanParser no longer stops when syntax errors are discovered, allowing safe traversal of the ParseTree. This feature is targeted towards software engineers working with SQL parsing and aims to improve error handling and recovery. * Fetch table definitions for TSQL ([#986](#986)). This pull request introduces a new `TableDefinition` case class that encapsulates metadata properties for tables in TSQL, such as catalog name, schema name, table name, location, table format, view definition, columns, table size, and comments. A `TSqlTableDefinitions` class has been added with methods to retrieve table definitions, all schemas, and all catalogs from TSQL. The `SnowflakeTypeBuilder` is updated to parse data types from TSQL. The `SnowflakeTableDefinitions` class has been refactored to use the new `TableDefinition` case class and retrieve table definitions more efficiently. The changes also include adding two new test cases to verify the correct retrieval of table definitions and catalogs for TSQL. * Fixed handling of projected expressions in `TreeNode` ([#1159](#1159)). In this release, we have addressed the handling of projected expressions in the `TreeNode` class, resolving issues [#1072](#1072) and [#1159](#1159). The `expressions` method in the `Plan` abstract class has been modified to include the `final` keyword, restricting overriding in subclasses. This method now returns all expressions present in a query from the current plan operator and its descendants. Additionally, we have introduced a new private method, `seqToExpressions`, used for recursively finding all expressions from a given sequence. The `Project` class, representing a relational algebra operation that projects a subset of columns in a table, now utilizes a new `columns` parameter instead of `expressions`. Similar changes have been applied to other classes extending `UnaryNode`, such as `Join`, `Deduplicate`, and `Hint`. The `values` parameter of the `Values` class has also been altered to accurately represent input values. A new test class, `JoinTest`, has been introduced to verify the correct propagation of expressions in join nodes, ensuring intended data transformations. * Handling any_keys_match from presto ([#1048](#1048)). In this commit, we have added support for the `any_keys_match` Presto function in Databricks by implementing it using existing Databricks functions. The `any_keys_match` function checks if any keys in a map match a given condition. Specifically, we have introduced two new classes, `MapKeys` and `ArrayExists`, which allow us to extract keys from the input map and check if any of the keys satisfy the given condition using the `exists` function. This is accomplished by renaming `exists` to `array_exists` to better reflect its purpose. Additionally, we have provided a Databricks SQL query that mimics the behavior of the `any_keys_match` function in Presto and added tests to ensure that it works as expected. These changes enable users to perform equivalent operations with a consistent syntax in Databricks and Presto. * Improve IR for job nodes ([#1041](#1041)). The open-source library has undergone improvements to the Intermediate Representation (IR) for job nodes, as indicated by the commit message "Improve IR for job nodes." This release introduces several significant changes, including: Refactoring of the `JobNode` class to extend the `TreeNode` class and the addition of a new abstract class `LeafJobNode` that overrides the `children` method to always return an empty `Seq`. Enhancements to the `ClusterSpec` case class, which now includes a `toSDK` method that properly initializes and sets the values of the fields in the SDK `ClusterSpec` object. Improvements to the `NewClusterSpec` class, which updates the types of several optional fields and introduces changes to the `toSDK` method for better conversion to the SDK format. Removal of the `Job` class, which previously represented a job in the IR of workflows. Changes to the `JobCluster` case class, which updates the `newCluster` attribute from `ClusterSpec` to `NewClusterSpec`. Update to the `JobEmailNotifications` class, which now extends `LeafJobNode` and includes new methods and overwrites existing ones from `LeafJobNode`. Improvements to the `JobNotificationSettings` class, which replaces the original `toSDK` method with a new implementation for more accurate SDK representation of job notification settings. Refactoring of the `JobParameterDefinition` class, which updates the `toSDK` method for more efficient conversion to the SDK format. These changes simplify the representation of job nodes, align the codebase more closely with the underlying SDK, and improve overall code maintainability and compatibility with other Remorph components. * Query History From Folder ([#991](#991)). The Estimator class in the Remorph project has been updated to enhance the query history interface by adding metadata from reading from a folder, improving its ability to handle queries from different users and increasing the accuracy of estimation reports. The Anonymizer class has also been updated to handle cases where the user field is missing, ensuring the anonymization process can proceed smoothly and preventing potential errors. A new FileQueryHistory class has been added to provide query history functionality by reading metadata from a specified folder. The SnowflakeQueryHistory class has been updated to directly implement the history() method and include new fields in the ExecutedQuery objects, such as 'id', 'source', 'timestamp', 'duration', 'user', and 'filename'. A new ExecutedQuery case class has been introduced, which now includes optional `user` and `filename` fields, and a new QueryHistoryProvider trait has been added with a method history() that returns a QueryHistory object containing a sequence of ExecutedQuery objects, enhancing the query history interface's flexibility and power. Test suites and test data for the Anonymizer and TableGraph classes have been updated to accommodate these changes, allowing for more comprehensive testing of query history functionality. A FileQueryHistorySpec test file has been added to test the FileQueryHistory class's ability to correctly extract queries from SQL files, ensuring the class works as expected. * Rework serialization using circe+jackson ([#1163](#1163)). In pull request [#1163](#1163), the serialization mechanism in the project has been refactored to use the Circe and Jackson libraries, replacing the existing ujson library. This change includes the addition of the Circe, Circe-generic-extras, and Circe-jackson libraries, which are licensed under the Apache 2.0 license. The project now includes the copyright notices and license information for all open-source projects that have contributed code to it, ensuring compliance with open-source licenses. The `CoverageTest` class has been updated to incorporate error encoding using Circe and Jackson libraries, and the `EstimationReport` case classes no longer have implicit `ReadWriter` instances defined using macroRW. Instead, circe and Jackson encode and decode instances are likely defined elsewhere in the codebase. The `BaseQueryRunner` abstract class has been updated to handle both parsing and transpilation errors in a more uniform way, using a `failures` field instead of `transpilation_error` or `parsing_error`. Additionally, a new file, `encoders.scala`, has been introduced, which defines encoders for serializing objects to JSON using the Circe and Jackson libraries. These changes aim to improve serialization and deserialization performance and capabilities, simplify the codebase, and ensure consistent and readable output. * Some window functions does not support window frame conditions ([#999](#999)). The Snowflake expression builder has been updated to correct the default window frame specifications for certain window functions and modify the behavior of the ORDER BY clause in these functions. This change ensures that the expression builder generates the correct SQL syntax for unsupported functions like "LAG", "DENSE_RANK", "LEAD", "PERCENT_RANK", "RANK", and "ROW_NUMBER", improving the compatibility and reliability of the generated queries. Additionally, a unit test for the `SnowflakeExpressionBuilder` has been updated to account for changes in the way window functions are handled, enhancing the accuracy of the builder in generating valid SQL for window functions in Snowflake. * Split workflow definitions into sensible packages ([#1039](#1039)). The AutoScale class has been refactored and moved to a new package, `com.databricks.labs.remorph.intermediate.workflows.clusters`, extending `JobNode` from `com.databricks.labs.remorph.intermediate.workflows`. It now includes a case class for auto-scaling that takes optional integer arguments `maxWorkers` and `minWorkers`, and a single method `apply` that creates and configures a cluster using the Databricks SDK's `ComputeService`. The AwsAttributes and AzureAttributes classes have also been moved to the `com.databricks.labs.remorph.intermediate.workflows.clusters` package and now extend `JobNode`. These classes manage AWS and Azure-related attributes for compute resources in a workflow. The ClientsTypes case class has been moved to a new clusters sub-package within the workflows package and now extends `JobNode`, and the ClusterLogConf class has been moved to a new clusters package. The JobDeployment class has been refactored and moved to `com.databricks.labs.remorph.intermediate.workflows.jobs`, and the JobEmailNotifications, JobsHealthRule, and WorkspaceStorageInfo classes have been moved to new packages and now import `JobNode`. These changes improve the organization and maintainability of the codebase, making it easier to understand and navigate. * TO_NUMBER/TO_DECIMAL/TO_NUMERIC without precision and scale ([#1053](#1053)). This pull request introduces improvements to the transpilation process for handling cases where precision and scale are not specified for TO_NUMBER, TO_DECIMAL, or TO_NUMERIC Snowflake functions. The updated transpiler now automatically applies default values when these parameters are omitted, with precision set to the maximum allowed value of 38 and scale set to 0. A new method has been added to manage these cases, and four new test cases have been included to verify the transpilation of TO_NUMBER and TO_DECIMAL functions without specified precision and scale, and with various input formats. This change ensures consistent behavior across different SQL dialects for cases where precision and scale are not explicitly defined in the conversion functions. * Table comments captured as part of Snowflake Table Definition ([#989](#989)). In this release, we have added support for capturing table comments as part of Snowflake Table Definitions in the remorph library. This includes modifying the TableDefinition case class to include an optional comment field, and updating the SQL query in the SnowflakeTableDefinitions class to retrieve table comments. A new integration test for Snowflake table definitions has also been introduced to ensure the proper functioning of the new feature. This test creates a connection to the Snowflake database, retrieves a list of table definitions, and checks for the presence of table comments. These changes are part of our ongoing efforts to improve metadata capture for Snowflake tables (Note: The commit message references issue [#945](#945) on GitHub, which this pull request is intended to close.) * Use Transformation to get rid of the ctx parameter in generators ([#1040](#1040)). The `Generating` class has undergone significant changes, removing the `ctx` parameter and introducing two new phases, `Parsing` and `BuildingAst`, in the sealed trait `Phase`. The `Parsing` phase extends `Phase` with a previous phase of `Init` and contains the source code and filename. The `BuildingAst` phase extends `Phase` with a previous phase of `Parsing` and contains the parsed tree and the previous phase. The `Optimizing` phase now contains the unoptimized plan and the previous phase. The `Generating` phase now contains the optimized plan, the current node, the total statements, the transpiled statements, the `GeneratorContext`, and the previous phase. Additionally, the `TransformationConstructors` trait has been updated to allow for the creation of Transformation instances specific to a certain phase of a workflow. The `runQuery` method in the `BaseQueryRunner` abstract class has been updated to use a new `transpile` method provided by the `Transpiler` trait, and the `Estimator` class in the `Estimation` module has undergone changes to remove the `ctx` parameter in generators. Overall, these changes simplify the implementation, improve code maintainability, and enable better separation of concerns in the codebase. * With Recursive ([#1000](#1000)). In this release, we have introduced several enhancements for `With Recursive` statements in SQL parsing and processing for the Snowflake database. A new IR (Intermediate Representation) for With Recursive CTE (Common Table Expression) has been implemented in the SnowflakeAstBuilder.scala file. A new case class, WithRecursiveCTE, has been added to the SnowflakeRelationBuilder class in the databricks/labs/remorph project, which extends RelationCommon and includes two members: ctes and query. The buildColumns method in the SnowflakeRelationBuilder class has been updated to handle cases where columnList is null and extract column names differently. Additionally, a new test has been added in SnowflakeAstBuilderSpec.scala that verifies the correct handling of a recursive CTE query. These enhancements improve the support for recursive queries in the Snowflake database, enabling more powerful and flexible querying capabilities for developers and data analysts working with complex data structures. * [chore] fixed query coverage report ([#1160](#1160)). In this release, we have addressed the issue [#1160](#1160) related to the query coverage report. We have implemented changes to the `QueryRunner` abstract class in the `com.databricks.labs.remorph.coverage` package. The `ReportEntryReport` constructor now accepts a new parameter `parsed`, which is set to 1 if there is no transpilation error and 0 otherwise. Previously, `parsed` was always set to 1, regardless of the presence of a transpilation error. We also updated the `extractQueriesFromFile` and `extractQueriesFromFolder` methods in the `FileQueryHistory` class to return a single `ExecutedQuery` instance, allowing for better query coverage reporting. Additionally, we modified the behavior of the `history()` method of the `fileQueryHistory` object in the `FileQueryHistorySpec` test case. The method now returns a query history object with a single query having a `source` including the text "SELECT * FROM table1;" and "SELECT * FROM table2;", effectively merging the previous two queries into one. These changes ensure that the report accurately reflects whether the query was successfully transpiled, parsed, and stored in the query history. It is crucial to test thoroughly any parts of the code that rely on the `history()` method to return separate queries, as the behavior of the method has changed.
nfx
added a commit
that referenced
this pull request
Nov 7, 2024
* Added IR for stored procedures ([#1161](#1161)). In this release, we have made significant enhancements to the project by adding support for stored procedures. We have introduced a new `CreateVariable` case class to manage variable creation within the intermediate representation (IR), and removed the `SetVariable` case class as it is now redundant. A new `CaseStatement` class has been added to represent SQL case statements with value match, and a `CompoundStatement` class has been implemented to enable encapsulation of a sequence of logical plans within a single compound statement. The `DeclareCondition`, `DeclareContinueHandler`, and `DeclareExitHandler` case classes have been introduced to handle conditional logic and exit handlers in stored procedures. New classes `DeclareVariable`, `ElseIf`, `ForStatement`, `If`, `Iterate`, `Leave`, `Loop`, `RepeatUntil`, `Return`, `SetVariable`, and `Signal` have been added to the project to provide more comprehensive support for procedural language features and control flow management in stored procedures. We have also included SnowflakeCommandBuilder support for stored procedures and updated the `visitExecuteTask` method to handle stored procedure calls using the `SetVariable` method. * Added Variant Support ([#998](#998)). In this commit, support for the Variant datatype has been added to the create table functionality, enhancing the system's compatibility with Snowflake's datatypes. A new VariantType has been introduced, which allows for more comprehensive handling of data during create table operations. Additionally, a `remarks VARIANT` line is added in the CREATE TABLE statement and the corresponding spec test has been updated. The Variant datatype is a flexible datatype that can store different types of data, such as arrays, objects, and strings, offering increased functionality for users working with variant data. Furthermore, this change will enable the use of the Variant datatype in Snowflake tables and improves the data modeling capabilities of the system. * Added `PySpark` generator ([#1026](#1026)). The engineering team has developed a new `PySpark` generator for the `com.databricks.labs.remorph.generators` package. This addition introduces a new parameter, `logical`, of type `Generator[ir.LogicalPlan, String]`, in the `SQLGenerator` for SQL queries. A new abstract class `BasePythonGenerator` has been added, which extends the `Generator` class and generates Python code. A `ExpressionGenerator` class has also been added, which extends `BasePythonGenerator` and is responsible for generating Python code for `ir.Expression` objects. A new `LogicalPlanGenerator` class has been added, which extends `BasePythonGenerator` and is responsible for generating Python code for a given `ir.LogicalPlan`. A new `StatementGenerator` class has been implemented, which converts `Statement` objects into Python code. A new Python-generating class, `PythonGenerator`, has been added, which includes the implementation of an abstract syntax tree (AST) for Python in Scala. This AST includes classes for various Python language constructs. Additionally, new implicit classes for `PythonInterpolator`, `PythonOps`, and `PythonSeqOps` have been added to allow for the creation of PySpark code using the Remorph framework. The `AndOrToBitwise` rule has been implemented to convert `And` and `Or` expressions to their bitwise equivalents. The `DotToFCol` rule has been implemented to transform code that references columns using dot notation in a DataFrame to use the `col` function with a string literal of the column name instead. A new `PySparkStatements` object and `PySparkExpressions` class have been added, which provide functionality for transforming expressions in a data processing pipeline to PySpark equivalents. The `SnowflakeToPySparkTranspiler` class has been added to transpile Snowflake queries to PySpark code. A new `PySpark` generator has been added to the `Transpiler` class, which is implemented as an instance of the `SqlGenerator` class. This change enhances the `Transpiler` class with a new `PySpark` generator and improves serialization efficiency. * Added `debug-bundle` command for folder-to-folder translation ([#1045](#1045)). In this release, we have introduced a `debug-bundle` command to the remorph project's CLI, specifically added to the `proxy_command` function, which already includes `debug-script`, `debug-me`, and `debug-coverage` commands. This new command enhances the tool's debugging capabilities, allowing developers to generate a bundle of translated queries for folder-to-folder translation tasks. The `debug-bundle` command accepts three flags: `dialect`, `src`, and `dst`, specifying the SQL dialect, source directory, and destination directory, respectively. Furthermore, the update includes refactoring the `FileSetGenerator` class in the `orchestration` package of the `com.databricks.labs.remorph.generators` package, adding a `debug-bundle` command to the `Main` object, and updating the `FileQueryHistoryProvider` method in the `ApplicationContext` trait. These improvements focus on providing a convenient way to convert folder-based SQL scripts to other formats like SQL and PySpark, enhancing the translation capabilities of the project. * Added `ruff` Python formatter proxy ([#1038](#1038)). In this release, we have added support for the `ruff` Python formatter in our project's continuous integration and development workflow. We have also introduced a new `FORMAT` stage in the `WorkflowStage` object in the `Result` Scala object to include formatting as a separate step in the workflow. A new `RuffFormatter` class has been added to format Python code using the `ruff` tool, and a `StandardInputPythonSubprocess` class has been included to run a Python subprocess and capture its output and errors. Additionally, we have added a proxy for the `ruff` formatter to the SnowflakeToPySparkTranspilerTest for Scala to improve the readability of the transpiled Python code generated by the SnowflakeToPySparkTranspiler. Lastly, we have introduced a new `ruff` formatter proxy in the test code for the transpiler library to enforce format and style conventions in Python code. These changes aim to improve the development and testing experience for the project and ensure that the code follows the desired formatting and style standards. * Added baseline for translating workflows ([#1042](#1042)). In this release, several new features have been added to the open-source library to improve the translation of workflows. A new dependency for the Jackson YAML data format library, version 2.14.0, has been added to the pom.xml file to enable processing YAML files and converting them to Java objects. A new `FileSet` class has been introduced, which provides an in-memory data structure to manage a set of files, allowing users to add, retrieve, and remove files by name and persist the contents of the files to the file system. A new `FileSetGenerator` class has been added that generates a `FileSet` object from a `JobNode` object, enabling the translation of workflows by generating all necessary files for a workspace. A new `DefineJob` class has been developed to define a new rule for processing `JobNode` objects in the Remorph system, converting instances of `SuccessPy` and `SuccessSQL` into `PythonNotebookTask` and `SqlNotebookTask` objects, respectively. Additionally, various new classes, such as `GenerateBundleFile`, `QueryHistoryToQueryNodes`, `ReformatCode`, `TryGeneratePythonNotebook`, `TryGenerateSQL`, `TrySummarizeFailures`, `InformationFile`, `SuccessPy`, `SuccessSQL`, `FailedQuery`, `Migration`, `PartialQuery`, `QueryPlan`, `RawMigration`, `Comment`, and `PlanComment`, have been introduced to provide a more comprehensive and nuanced job orchestration framework. The `Library` case class has been updated to better separate concerns between library configuration and code assets. These changes address issue [#1042](#1042) and provide a more robust and flexible workflow translation solution. * Added correct generation of `databricks.yml` for `QueryHistory` ([#1044](#1044)). The FileSet class in the FileSet.scala file has been updated to include a new method that correctly generates the `databricks.yml` file for the `QueryHistory` feature. This file is used for orchestrating cross-compiled queries, creating three files in total - two SQL notebooks with translated and formatted queries and a `databricks.yml` file to define an asset bundle for the queries. The new method in the FileSet class writes the content to the file using the `Files.write` method from the `java.nio.file` package instead of the previously used `PrintWriter`. The FileSetGenerator class has been updated to include the new `databricks.yml` file generation, and new rules and methods have been added to improve the accuracy and consistency of schema definitions in the generated orchestration files. Additionally, the DefineJob and DefineSchemas classes have been introduced to simplify the orchestration generation process. * Added documentation around Transformation ([#1043](#1043)). In this release, the Transformation class in our open-source library has been enhanced with detailed documentation, type parameters, and new methods. The class represents a stateful computation that produces an output of type Out while managing a state of type State. The new methods include map and flatMap for modifying the output and chaining transformations, as well as run and runAndDiscardState for executing the computation with a given initial state and producing a Result containing the output and state. Additionally, we have introduced a new trait called TransformationConstructors that provides constructors for successful transformations, error transformations, lifted results, state retrieval, replacement, and updates. The CodeGenerator trait in our code generation library has also been updated with several new methods for more control and flexibility in the code generation process. These include commas and spaces for formatting output, updateGenCtx for updating the GeneratorContext, nest and unnest for indentation, withIndentedBlock for producing nested blocks of code, and withGenCtx for creating transformations that use the current GeneratorContext. * Added tests for Snow ARRAY_REMOVE function ([#979](#979)). In this release, we have added tests for the Snowflake ARRAY_REMOVE function in the SnowflakeToDatabricksTranspilerTest. The tests, currently ignored, demonstrate the usage of the ARRAY_REMOVE function with different data types, such as integers and doubles. A TODO comment is included for a test case involving VARCHAR casting, to be enabled once the necessary casting functionality is implemented. This update enhances the library's capabilities and ensures that the ARRAY_REMOVE function can handle a variety of data types. Software engineers can refer to these tests to understand the usage of the ARRAY_REMOVE function in the transpiler and the planned casting functionality. * Avoid non local return ([#1052](#1052)). In this release, the `render` method of the `generators` package object in the `com.databricks.labs.remorph` package has been updated to avoid using non-local returns and follow recommended coding practices. Instead of returning early from the method, it now uses `Option` to track failures and a `try-catch` block to handle exceptions. In cases of exception during string concatenation, the method sets the `failureOpt` variable to `Some(lift(KoResult(WorkflowStage.GENERATE, UncaughtException(e))))`. Additionally, the test file "CodeInterpolatorSpec.scala" has been modified to fix an issue with exception handling. In the updated code, new variables for each argument are introduced, and the problematic code is placed within an interpolated string, allowing for proper exception handling. This release enhances the robustness and reliability of the code interpolator and ensures that the method follows recommended coding practices. * Collect errors in `Phase` ([#1046](#1046)). The open-source library Remorph has received significant updates, focusing on enhancing error collection and simplifying the Transformation class. The changes include a new method `recordError` in the abstract Phase trait and its concrete implementations for collecting errors during each phase. The Transformation class has been simplified by removing the unused Phase parameter, while the Generator, CodeGenerator, and FileSetGenerator have been specialized to use Transformation without the Phase parameter. The TryGeneratePythonNotebook, TryGenerateSQL, CodeInterpolator, and TBASeqOps classes have been updated for a more concise and focused state. The imports have been streamlined, and the PySparkGenerator, SQLGenerator, and PlanParser have been modified to remove the unused Phase type parameter. A new test file, TransformationTest, has been added to check the error collection functionality in the Transformation class. Overall, these enhancements improve the reliability, readability, and maintainability of the Remorph library. * Correctly generate `F.fn_name` for builtin PySpark functions ([#1037](#1037)). This commit introduces changes to the generation of `F.fn_name` for builtin PySpark functions in the PySparkExpressions.scala file, specifically for PySpark's builtin functions (`fn`). It includes a new case to handle these functions by converting them to their lowercase equivalent in Python using `Locale.getDefault`. Additionally, changes have been made to handle window specifications more accurately, such as using `ImportClassSideEffect` with `windowSpec` and generating and applying a window function (`fn`) over it. The `LAST_VALUE` function has been modified to `LAST` in the SnowflakeToDatabricksTranspilerTest.scala file, and new methods such as `First`, `Last`, `Lag`, `Lead`, and `NthValue` have been added to the SnowflakeCallMapper class. These changes improve the accuracy, flexibility, and compatibility of PySpark when working with built-in functions and window specifications, making the codebase more readable, maintainable, and efficient. * Create Command Extended ([#1033](#1033)). In this release, the open-source library has been updated with several new features related to table management and SQL code generation. A new method `replaceTable` has been added to the `LogicalPlanGenerator` class, which generates SQL code for a `ReplaceTableCommand` IR node and replaces an existing table with the same name if it already exists. Additionally, support has been added for generating SQL code for an `IdentityConstraint` IR node, which specifies whether a column is an auto-incrementing identity column. The `CREATE TABLE` statement has been updated to include the `AUTOINCREMENT` and `REPLACE` constraints, and a new `IdentityConstraint` case class has been introduced to extend the capabilities of the `UnnamedConstraint` class. The `TSqlDDLBuilder` class has also been updated to handle the `IDENTITY` keyword more effectively. A new command implementation with `AUTOINCREMENT` and `REPLACE` constraints has been added, and a new SQL script has been included in the functional tests for testing CREATE DDL statements with identity columns. Finally, the SQL transpiler has been updated to support the `CREATE OR REPLACE PROCEDURE` syntax, providing more flexibility and convenience for users working with stored procedures in their SQL code. These updates aim to improve the functionality and ease of use of the open-source library for software engineers working with SQL code and table management. * Don't draft automated releases ([#995](#995)). In this release, we have made a modification to the release.yml file in the .github/workflows directory by removing the "draft: true" line. This change removes the creation of draft releases in the automated release process, simplifying it and making it more straightforward for users to access new versions of the software. The job section of the release.yml file now only includes the `release` job, with the "release-signing-artifacts: true" still enabled, ensuring that the artifacts are signed. This improvement enhances the overall release process, making it more efficient and user-friendly. * Enhance the Snow ARRAY_SORT function support ([#994](#994)). With this release, the Snowflake ARRAY_SORT function now supports Boolean literals as parameters, improving its functionality. The changes include validating Boolean parameters in SnowflakeCallMapper.scala, throwing a TranspileException for unsupported arguments, and simplifying the IR using the DBSQL SORT_ARRAY function. Additionally, new test cases have been added to SnowflakeCallMapperSpec for the ARRAY_SORT and ARRAY_SLICE functions. The SnowflakeToDatabricksTranspilerTest class has also been updated with new test cases that cover the enhanced ARRAY_SORT function's various overloads and combinations of Boolean literals, NULLs, and a custom sorting function. This ensures that invalid usage is caught during transpilation, providing better error handling and supporting more use cases. * Ensure that unparsable text is not lost in the generated output ([#1012](#1012)). In this release, we have implemented an enhancement to the error handling strategy in the ANTLR-generated parsers for SQL. This change records where parsing failed and gathers un-parsable input, preserving them as custom error nodes in the ParseTree at strategic points. The new custom error strategy allows visitors for higher level rules such as `sqlCommand` in Snowflake and `sqlClauses` in TSQL to check for an error node in the children and generate an Ir node representing the un-parsed text. Additionally, new methods have been introduced to handle error recovery, find the highest context in the tree for the particular parser, and recursively find the context. A separate improvement is planned to ensure the PLanParser no longer stops when syntax errors are discovered, allowing safe traversal of the ParseTree. This feature is targeted towards software engineers working with SQL parsing and aims to improve error handling and recovery. * Fetch table definitions for TSQL ([#986](#986)). This pull request introduces a new `TableDefinition` case class that encapsulates metadata properties for tables in TSQL, such as catalog name, schema name, table name, location, table format, view definition, columns, table size, and comments. A `TSqlTableDefinitions` class has been added with methods to retrieve table definitions, all schemas, and all catalogs from TSQL. The `SnowflakeTypeBuilder` is updated to parse data types from TSQL. The `SnowflakeTableDefinitions` class has been refactored to use the new `TableDefinition` case class and retrieve table definitions more efficiently. The changes also include adding two new test cases to verify the correct retrieval of table definitions and catalogs for TSQL. * Fixed handling of projected expressions in `TreeNode` ([#1159](#1159)). In this release, we have addressed the handling of projected expressions in the `TreeNode` class, resolving issues [#1072](#1072) and [#1159](#1159). The `expressions` method in the `Plan` abstract class has been modified to include the `final` keyword, restricting overriding in subclasses. This method now returns all expressions present in a query from the current plan operator and its descendants. Additionally, we have introduced a new private method, `seqToExpressions`, used for recursively finding all expressions from a given sequence. The `Project` class, representing a relational algebra operation that projects a subset of columns in a table, now utilizes a new `columns` parameter instead of `expressions`. Similar changes have been applied to other classes extending `UnaryNode`, such as `Join`, `Deduplicate`, and `Hint`. The `values` parameter of the `Values` class has also been altered to accurately represent input values. A new test class, `JoinTest`, has been introduced to verify the correct propagation of expressions in join nodes, ensuring intended data transformations. * Handling any_keys_match from presto ([#1048](#1048)). In this commit, we have added support for the `any_keys_match` Presto function in Databricks by implementing it using existing Databricks functions. The `any_keys_match` function checks if any keys in a map match a given condition. Specifically, we have introduced two new classes, `MapKeys` and `ArrayExists`, which allow us to extract keys from the input map and check if any of the keys satisfy the given condition using the `exists` function. This is accomplished by renaming `exists` to `array_exists` to better reflect its purpose. Additionally, we have provided a Databricks SQL query that mimics the behavior of the `any_keys_match` function in Presto and added tests to ensure that it works as expected. These changes enable users to perform equivalent operations with a consistent syntax in Databricks and Presto. * Improve IR for job nodes ([#1041](#1041)). The open-source library has undergone improvements to the Intermediate Representation (IR) for job nodes, as indicated by the commit message "Improve IR for job nodes." This release introduces several significant changes, including: Refactoring of the `JobNode` class to extend the `TreeNode` class and the addition of a new abstract class `LeafJobNode` that overrides the `children` method to always return an empty `Seq`. Enhancements to the `ClusterSpec` case class, which now includes a `toSDK` method that properly initializes and sets the values of the fields in the SDK `ClusterSpec` object. Improvements to the `NewClusterSpec` class, which updates the types of several optional fields and introduces changes to the `toSDK` method for better conversion to the SDK format. Removal of the `Job` class, which previously represented a job in the IR of workflows. Changes to the `JobCluster` case class, which updates the `newCluster` attribute from `ClusterSpec` to `NewClusterSpec`. Update to the `JobEmailNotifications` class, which now extends `LeafJobNode` and includes new methods and overwrites existing ones from `LeafJobNode`. Improvements to the `JobNotificationSettings` class, which replaces the original `toSDK` method with a new implementation for more accurate SDK representation of job notification settings. Refactoring of the `JobParameterDefinition` class, which updates the `toSDK` method for more efficient conversion to the SDK format. These changes simplify the representation of job nodes, align the codebase more closely with the underlying SDK, and improve overall code maintainability and compatibility with other Remorph components. * Query History From Folder ([#991](#991)). The Estimator class in the Remorph project has been updated to enhance the query history interface by adding metadata from reading from a folder, improving its ability to handle queries from different users and increasing the accuracy of estimation reports. The Anonymizer class has also been updated to handle cases where the user field is missing, ensuring the anonymization process can proceed smoothly and preventing potential errors. A new FileQueryHistory class has been added to provide query history functionality by reading metadata from a specified folder. The SnowflakeQueryHistory class has been updated to directly implement the history() method and include new fields in the ExecutedQuery objects, such as 'id', 'source', 'timestamp', 'duration', 'user', and 'filename'. A new ExecutedQuery case class has been introduced, which now includes optional `user` and `filename` fields, and a new QueryHistoryProvider trait has been added with a method history() that returns a QueryHistory object containing a sequence of ExecutedQuery objects, enhancing the query history interface's flexibility and power. Test suites and test data for the Anonymizer and TableGraph classes have been updated to accommodate these changes, allowing for more comprehensive testing of query history functionality. A FileQueryHistorySpec test file has been added to test the FileQueryHistory class's ability to correctly extract queries from SQL files, ensuring the class works as expected. * Rework serialization using circe+jackson ([#1163](#1163)). In pull request [#1163](#1163), the serialization mechanism in the project has been refactored to use the Circe and Jackson libraries, replacing the existing ujson library. This change includes the addition of the Circe, Circe-generic-extras, and Circe-jackson libraries, which are licensed under the Apache 2.0 license. The project now includes the copyright notices and license information for all open-source projects that have contributed code to it, ensuring compliance with open-source licenses. The `CoverageTest` class has been updated to incorporate error encoding using Circe and Jackson libraries, and the `EstimationReport` case classes no longer have implicit `ReadWriter` instances defined using macroRW. Instead, circe and Jackson encode and decode instances are likely defined elsewhere in the codebase. The `BaseQueryRunner` abstract class has been updated to handle both parsing and transpilation errors in a more uniform way, using a `failures` field instead of `transpilation_error` or `parsing_error`. Additionally, a new file, `encoders.scala`, has been introduced, which defines encoders for serializing objects to JSON using the Circe and Jackson libraries. These changes aim to improve serialization and deserialization performance and capabilities, simplify the codebase, and ensure consistent and readable output. * Some window functions does not support window frame conditions ([#999](#999)). The Snowflake expression builder has been updated to correct the default window frame specifications for certain window functions and modify the behavior of the ORDER BY clause in these functions. This change ensures that the expression builder generates the correct SQL syntax for unsupported functions like "LAG", "DENSE_RANK", "LEAD", "PERCENT_RANK", "RANK", and "ROW_NUMBER", improving the compatibility and reliability of the generated queries. Additionally, a unit test for the `SnowflakeExpressionBuilder` has been updated to account for changes in the way window functions are handled, enhancing the accuracy of the builder in generating valid SQL for window functions in Snowflake. * Split workflow definitions into sensible packages ([#1039](#1039)). The AutoScale class has been refactored and moved to a new package, `com.databricks.labs.remorph.intermediate.workflows.clusters`, extending `JobNode` from `com.databricks.labs.remorph.intermediate.workflows`. It now includes a case class for auto-scaling that takes optional integer arguments `maxWorkers` and `minWorkers`, and a single method `apply` that creates and configures a cluster using the Databricks SDK's `ComputeService`. The AwsAttributes and AzureAttributes classes have also been moved to the `com.databricks.labs.remorph.intermediate.workflows.clusters` package and now extend `JobNode`. These classes manage AWS and Azure-related attributes for compute resources in a workflow. The ClientsTypes case class has been moved to a new clusters sub-package within the workflows package and now extends `JobNode`, and the ClusterLogConf class has been moved to a new clusters package. The JobDeployment class has been refactored and moved to `com.databricks.labs.remorph.intermediate.workflows.jobs`, and the JobEmailNotifications, JobsHealthRule, and WorkspaceStorageInfo classes have been moved to new packages and now import `JobNode`. These changes improve the organization and maintainability of the codebase, making it easier to understand and navigate. * TO_NUMBER/TO_DECIMAL/TO_NUMERIC without precision and scale ([#1053](#1053)). This pull request introduces improvements to the transpilation process for handling cases where precision and scale are not specified for TO_NUMBER, TO_DECIMAL, or TO_NUMERIC Snowflake functions. The updated transpiler now automatically applies default values when these parameters are omitted, with precision set to the maximum allowed value of 38 and scale set to 0. A new method has been added to manage these cases, and four new test cases have been included to verify the transpilation of TO_NUMBER and TO_DECIMAL functions without specified precision and scale, and with various input formats. This change ensures consistent behavior across different SQL dialects for cases where precision and scale are not explicitly defined in the conversion functions. * Table comments captured as part of Snowflake Table Definition ([#989](#989)). In this release, we have added support for capturing table comments as part of Snowflake Table Definitions in the remorph library. This includes modifying the TableDefinition case class to include an optional comment field, and updating the SQL query in the SnowflakeTableDefinitions class to retrieve table comments. A new integration test for Snowflake table definitions has also been introduced to ensure the proper functioning of the new feature. This test creates a connection to the Snowflake database, retrieves a list of table definitions, and checks for the presence of table comments. These changes are part of our ongoing efforts to improve metadata capture for Snowflake tables (Note: The commit message references issue [#945](#945) on GitHub, which this pull request is intended to close.) * Use Transformation to get rid of the ctx parameter in generators ([#1040](#1040)). The `Generating` class has undergone significant changes, removing the `ctx` parameter and introducing two new phases, `Parsing` and `BuildingAst`, in the sealed trait `Phase`. The `Parsing` phase extends `Phase` with a previous phase of `Init` and contains the source code and filename. The `BuildingAst` phase extends `Phase` with a previous phase of `Parsing` and contains the parsed tree and the previous phase. The `Optimizing` phase now contains the unoptimized plan and the previous phase. The `Generating` phase now contains the optimized plan, the current node, the total statements, the transpiled statements, the `GeneratorContext`, and the previous phase. Additionally, the `TransformationConstructors` trait has been updated to allow for the creation of Transformation instances specific to a certain phase of a workflow. The `runQuery` method in the `BaseQueryRunner` abstract class has been updated to use a new `transpile` method provided by the `Transpiler` trait, and the `Estimator` class in the `Estimation` module has undergone changes to remove the `ctx` parameter in generators. Overall, these changes simplify the implementation, improve code maintainability, and enable better separation of concerns in the codebase. * With Recursive ([#1000](#1000)). In this release, we have introduced several enhancements for `With Recursive` statements in SQL parsing and processing for the Snowflake database. A new IR (Intermediate Representation) for With Recursive CTE (Common Table Expression) has been implemented in the SnowflakeAstBuilder.scala file. A new case class, WithRecursiveCTE, has been added to the SnowflakeRelationBuilder class in the databricks/labs/remorph project, which extends RelationCommon and includes two members: ctes and query. The buildColumns method in the SnowflakeRelationBuilder class has been updated to handle cases where columnList is null and extract column names differently. Additionally, a new test has been added in SnowflakeAstBuilderSpec.scala that verifies the correct handling of a recursive CTE query. These enhancements improve the support for recursive queries in the Snowflake database, enabling more powerful and flexible querying capabilities for developers and data analysts working with complex data structures. * [chore] fixed query coverage report ([#1160](#1160)). In this release, we have addressed the issue [#1160](#1160) related to the query coverage report. We have implemented changes to the `QueryRunner` abstract class in the `com.databricks.labs.remorph.coverage` package. The `ReportEntryReport` constructor now accepts a new parameter `parsed`, which is set to 1 if there is no transpilation error and 0 otherwise. Previously, `parsed` was always set to 1, regardless of the presence of a transpilation error. We also updated the `extractQueriesFromFile` and `extractQueriesFromFolder` methods in the `FileQueryHistory` class to return a single `ExecutedQuery` instance, allowing for better query coverage reporting. Additionally, we modified the behavior of the `history()` method of the `fileQueryHistory` object in the `FileQueryHistorySpec` test case. The method now returns a query history object with a single query having a `source` including the text "SELECT * FROM table1;" and "SELECT * FROM table2;", effectively merging the previous two queries into one. These changes ensure that the report accurately reflects whether the query was successfully transpiled, parsed, and stored in the query history. It is crucial to test thoroughly any parts of the code that rely on the `history()` method to return separate queries, as the behavior of the method has changed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.