- Objective
- Note
- Pre-requisite
- Set-up
- Make New Jobs
- Deployment
- Run Test & Coverage
- Documentation
- Reference
- Common Errors
- To run the same ETL code in multiple cloud services based on your preference, thus saving time.
- To develop ETL scripts for different environments and clouds.
- This repository currently supports Azure Databricks + AWS Glue.
- Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
- For AWS Glue we will set up a local environment using Glue Docker image or shell script, then deploying it to AWS glue using github actions.
- The "tasks.txt" file contains the details of transformations done in the main file.
- Python3.7 with PIP
- AWS CLI configured locally
- Install Java 8.
# Make sure to export JAVA_HOME like this: export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home
-
Clone this repo (for Windows use WSL).
-
For setting up required libraries and packages locally, run:
# If default SHELL is zsh use
make setup-glue-local SOURCE_FILE_PATH=~/.zshrc
# If default SHELL is bash use
make setup-glue-local SOURCE_FILE_PATH=~/.bashrc
- Source SHELL profile using:
# For zsh
source ~/.zshrc
# For bash
source ~/.bashrc
- Install Dependencies:
make install
-
Enter your S3 & ADLS paths in the
app/.custom_env
file for Databricks. This file will be used by Databricks. -
Similarly, we'll make
.evn
file in the root folder for Local Glue. To create the required file run:
make glue-demo-env
This command will copy your paths from app/.custom_env
to .env
file.
- (Optional) If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in
.evn
file only. Note: Don't enter any sensitive keys inapp/.custom_env
file.
Finally, check if everything is working correctly by running:
gluesparksubmit jobs/demo.py
Ensure "Execution Complete" is printed.
Write your jobs in the jobs
folder. Refer demo.py
file. One example is the jobs/main.py
file.
- Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
S3_BUCKET_NAME
S3_SCRIPTS_PATH
AWS_REGION
AWS_GLUE_ROLE
Rest all the key-value pairs that entered in the .env
file. make sure to pass them using automation/deploy_glue_jobs.sh
file.
- For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:
kaggle_username
kaggle_token
storage_account_name
datalake_access_key
To run tests & coverage report, run the following commands in the root folder of the project:
make test
# To see the coverage report
make coverage-report