Skip to content

Commit

Permalink
Bug/event enhanced unique key (#19)
Browse files Browse the repository at this point in the history
* initial commit

* update incremental

* update yml

* add tests

* revisions

* update date_spine

* update daily_history

* update changelog

* update & regen docs

* add databricks warehouse support

* Apply suggestions from code review

Co-authored-by: Joe Markiewicz <[email protected]>

* update changelog

---------

Co-authored-by: Joe Markiewicz <[email protected]>
  • Loading branch information
1 parent 30d8ec2 commit 869076b
Show file tree
Hide file tree
Showing 25 changed files with 384 additions and 229 deletions.
5 changes: 4 additions & 1 deletion .buildkite/hooks/pre-command
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,7 @@ export CI_SNOWFLAKE_DBT_USER=$(gcloud secrets versions access latest --secret="C
export CI_SNOWFLAKE_DBT_WAREHOUSE=$(gcloud secrets versions access latest --secret="CI_SNOWFLAKE_DBT_WAREHOUSE" --project="dbt-package-testing-363917")
export CI_DATABRICKS_DBT_HOST=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_HOST" --project="dbt-package-testing-363917")
export CI_DATABRICKS_DBT_HTTP_PATH=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_HTTP_PATH" --project="dbt-package-testing-363917")
export CI_DATABRICKS_DBT_TOKEN=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_TOKEN" --project="dbt-package-testing-363917")
export CI_DATABRICKS_DBT_TOKEN=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_TOKEN" --project="dbt-package-testing-363917")
export CI_DATABRICKS_DBT_CATALOG=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_CATALOG" --project="dbt-package-testing-363917")
export CI_DATABRICKS_SQL_DBT_HTTP_PATH=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_SQL_DBT_HTTP_PATH" --project="dbt-package-testing-363917")
export CI_DATABRICKS_SQL_DBT_TOKEN=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_SQL_DBT_TOKEN" --project="dbt-package-testing-363917")
20 changes: 18 additions & 2 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ steps:
commands: |
bash .buildkite/scripts/run_models.sh redshift
- label: ":bricks: Run Tests - Databricks"
- label: ":databricks: Run Tests - Databricks"
key: "run_dbt_databricks"
plugins:
- docker#v3.13.0:
Expand All @@ -69,5 +69,21 @@ steps:
- "CI_DATABRICKS_DBT_HOST"
- "CI_DATABRICKS_DBT_HTTP_PATH"
- "CI_DATABRICKS_DBT_TOKEN"
- "CI_DATABRICKS_DBT_CATALOG"
commands: |
bash .buildkite/scripts/run_models.sh databricks
bash .buildkite/scripts/run_models.sh databricks
- label: ":databricks: :database: Run Tests - Databricks SQL Warehouse"
key: "run_dbt_databricks_sql"
plugins:
- docker#v3.13.0:
image: "python:3.8"
shell: [ "/bin/bash", "-e", "-c" ]
environment:
- "BASH_ENV=/tmp/.bashrc"
- "CI_DATABRICKS_DBT_HOST"
- "CI_DATABRICKS_SQL_DBT_HTTP_PATH"
- "CI_DATABRICKS_SQL_DBT_TOKEN"
- "CI_DATABRICKS_DBT_CATALOG"
commands: |
bash .buildkite/scripts/run_models.sh databricks-sql
10 changes: 10 additions & 0 deletions .buildkite/scripts/run_models.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,19 @@ db=$1
echo `pwd`
cd integration_tests
dbt deps
if [ "$db" = "databricks-sql" ]; then
dbt seed --vars '{amplitude_schema: amplitude_sqlw_tests}' --target "$db" --full-refresh
dbt compile --vars '{amplitude_schema: amplitude_sqlw_tests}' --target "$db"
dbt run --vars '{amplitude_schema: amplitude_sqlw_tests}' --target "$db" --full-refresh
dbt test --vars '{amplitude_schema: amplitude_sqlw_tests}' --target "$db"
dbt run --vars '{amplitude_schema: amplitude_sqlw_tests}' --target "$db"
dbt test --vars '{amplitude_schema: amplitude_sqlw_tests}' --target "$db"
else
dbt seed --target "$db" --full-refresh
dbt compile --target "$db"
dbt run --target "$db" --full-refresh
dbt test --target "$db"
dbt run --target "$db"
dbt test --target "$db"
fi
dbt run-operation fivetran_utils.drop_schemas_automation --target "$db"
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,31 @@
# dbt_amplitude v0.5.0
[PR #19](https://github.com/fivetran/dbt_amplitude/pull/19) includes the following updates:

## Breaking Changes
Users should perform a `--full-refresh` when upgrading to ensure all changes are applied correctly. This includes updates to unique key generation, materialization, and incremental strategies, which may affect existing records.

- Revised `unique_key` generation for `amplitude__event_enhanced` using `unique_event_id` and `unique_event_type_id` to prevent duplicate records.
- The unique key was previously generated from `unique_event_id` and `event_day`, which caused duplicate keys for some users and prevented incremental runs.
- Made the `int_amplitude__date_spine` materialization ephemeral to reduce the number of tables and simplify incremental model dependencies.
- Updated incremental loading strategies:
- **BigQuery** and **Databricks All-Purpose Clusters**: `insert_overwrite` for compute efficiency
- For **Databricks SQL Warehouses**, incremental materialization will not be used due to the incompatibility of the `insert_overwrite` strategy.
- **Snowflake**, **Redshift**, and **Postgres**: `delete+insert`

## Features
- Added a default 3-day lookback period for incremental models to handle late-arriving records. Customize the lookback duration by setting the `lookback_window` variable in `dbt_project.yml`. For more information, refer to the [Lookback Window section of the README](https://github.com/fivetran/dbt_amplitude/blob/main/README.md#lookback-window).
- Added the `amplitude_lookback` macro to simplify lookback calculations across models.
- Changed the data type of `session_started_at` and `session_ended_at` in the `amplitude__sessions` model from `timestamp` to `date` to support incremental calculations.

## Documentation updates
- Updated outdated or missing field definitions in dbt documentation.

## Under the hood
- Adjusted the `event_time` field in the `event_data` seed file to ensure records are not automatically excluded during test runs.
- Added consistency tests for end models.
- Added a new macro `is_incremental_compatible()` to identify if the Databricks SQL Warehouse runtime is being used. This macro returns `false` if the runtime is SQL Warehouse, and `true` for any other Databricks runtime or supported destination.
- Added testing for Databricks SQL Warehouses.

# dbt_amplitude v0.4.0

## Breaking Changes
Expand Down
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,23 @@ dispatch:
search_order: ['spark_utils', 'dbt_utils']
```
### Database Incremental Strategies
This package's incremental models are configured to leverage the different incremental strategies for each supported warehouse.
For **BigQuery** and **Databricks All Purpose Cluster runtime** destinations, we have chosen insert_overwrite as the default strategy, which benefits from the partitioning capability.
> For **Databricks SQL Warehouse** destinations, models are materialized as tables without support for incremental runs.
For **Snowflake**, **Redshift**, and **Postgres** databases, we have chosen delete+insert as the default strategy.
> Regardless of strategy, we recommend that users periodically run a --full-refresh to ensure a high level of data quality.
### Step 2: Install the package
Include the following Amplitude package version in your `packages.yml` file:
> TIP: Check [dbt Hub](https://hub.getdbt.com/) for the latest installation instructions, or [read the dbt docs](https://docs.getdbt.com/docs/package-management) for more information on installing packages.
```yaml
packages:
- package: fivetran/amplitude
version: [">=0.4.0", "<0.5.0"] # we recommend using ranges to capture non-breaking changes automatically
version: [">=0.5.0", "<0.6.0"] # we recommend using ranges to capture non-breaking changes automatically
```

Do NOT include the `amplitude_source` package in this file. The transformation package itself has a dependency on it and will install the source package as well.
Expand Down Expand Up @@ -86,7 +96,18 @@ vars:
```
If you adjust the date range variables, we recommend running `dbt run --full-refresh` to ensure no data quality issues within the adjusted date range.
### (Optional) Step 5: Additional configurations
<details><summary>Expand for configurations</summary>
<details open><summary>Expand/collapse configurations</summary>

#### Lookback Window
Records from the source can sometimes arrive late. Since several of the models in this package are incremental, by default we look back 3 days from new records to ensure late arrivals are captured and avoiding the need for frequent full refreshes. While the frequency can be reduced, we still recommend running `dbt --full-refresh` periodically to maintain data quality of the models.

To change the default lookback window, add the following variable to your `dbt_project.yml` file:

```yml
vars:
amplitude:
lookback_window: number_of_days # default is 3
```

#### Change source table references
If an individual source table has a different name than the package expects, add the table name as it appears in your destination to the respective variable:
Expand Down
2 changes: 1 addition & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
config-version: 2
name: 'amplitude'
version: '0.4.0'
version: '0.5.0'
require-dbt-version: [">=1.3.0", "<2.0.0"]
models:
amplitude:
Expand Down
2 changes: 1 addition & 1 deletion docs/catalog.json

Large diffs are not rendered by default.

Loading

0 comments on commit 869076b

Please sign in to comment.