Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: update documentation #12

Merged
merged 11 commits into from
Feb 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.PHONY: build

setup-glue-local:
chmod +x automation/glue_setup.sh
. automation/glue_setup.sh $(SOURCE_FILE_PATH)

glue-demo-env:
cp app/.custom_env .env

install:
pip3 install -r requirements.txt

type-check:
mypy ./ --ignore-missing-imports

lint:
pylint app tests jobs setup.py

test:
export KAGGLE_KEY=MOCKKEY
export KAGGLE_USERNAME=MOCKUSERNAME
coverage run --source=app -m unittest discover -s tests

coverage-report:
coverage report
coverage html
112 changes: 77 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,115 @@
# Multi-cloud ETL Pipeline

## Main Objective
## Objective

To run the same ETL code in multiple cloud services based on your preference, thus saving time & to develop the ETL scripts for different environments & clouds. Currently supports Azure Databricks + AWS Glue
- To run the same ETL code in multiple cloud services based on your preference, thus saving time.
- To develop ETL scripts for different environments and clouds.

## Note

- Azure Databricks can't be configured locally, you can only connect your local IDE to running cluster in databricks. Push your code in Github repo then make a workflow in databricks with URL of the repo & file.
- For AWS Glue I'm setting up a local environment using the Docker image, then deploying it to AWS glue using github actions.
- The "tasks.txt" file contents the details of transformations done in the main file.
- This repository currently supports Azure Databricks + AWS Glue.
- Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
- For AWS Glue we will set up a local environment using glue Docker image, then deploying it to AWS glue using github actions.
- The "tasks.txt" file contains the details of transformations done in the main file.

## Requirements for Azure Databricks (for local connect only)
- [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Running cluster.
## Pre-requisite

## Requirements for AWS Glue (local setup)
1. [Python3.7 with PIP](https://www.python.org/downloads/)
2. [AWS CLI configured locally](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)
3. [Install Java 8](https://www.oracle.com/in/java/technologies/downloads/#java8-mac).
```bash
# Make sure to export JAVA_HOME like this:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home
```


## Quick Start

1. Clone this repo _(for Windows use WSL)_.

2. For setting up required libraries and packages locally, run:
```bash
# If default SHELL is zsh use
make setup-glue-local SOURCE_FILE_PATH=~/.zshrc

# If default SHELL is bash use
make setup-glue-local SOURCE_FILE_PATH=~/.bashrc
```

- For Unix-based systems you can refer: [Data Engineering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)
3. Source SHELL profile using:

- For Windows-based systems you can refer: [AWS Glue Developing using a Docker image](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image)
```bash
# For zsh
source ~/.zshrc

## Steps
# For bash
source ~/.bashrc
```

1. Clone this repo in your own repo. For Windows recommend use WSL.
4. Install Dependencies:
```bash
make install
```

2. Give your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. Make ```.evn``` file in the root folder for local Docker Glue to use.
Make sure to pass KAGGLE_KEY & KAGGLE_USERNAME values if you are going to use Kaggle. Else make the kaggle_extraction flag as False.
## Change Your Paths

3. Run ```automation/init_docker.sh``` passing your aws credential location & project root location. If you are using Windows Powershell or CommandPrompt then run the commands manually by copy-pasting.
1. Enter your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. This file will be used by Databricks.

4. Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.
2. Similarly, we'll make ```.evn``` file in the root folder. This file will be used by local glue job. To create the required files run:
```bash
make glue-demo-env
```
This command will copy your paths from in the ```.env``` file.

5. Check your setup is correct, by running scripts in the docker container locally using ```spark-submit jobs/demo.py```. Make sure you see the "Execution Complete" statement printed.
3. _(Optional)_ If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in ```.evn``` file only. Note: Don't enter any sensitive keys in ```app/.custom_env``` file.

## Setup Check
Finally, check if everything is working correctly by running:
```bash
gluesparksubmit jobs/demo.py
```
Ensure "Execution Complete" is printed.

## Make New Jobs

Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.

## Deployment

1. In your your GitHub Actions Secrets, setup the following keys with their values:
```
1. Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:

```
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
S3_BUCKET_NAME
S3_SCRIPTS_PATH
AWS_REGION
AWS_GLUE_ROLE
```
Rest all the key-value pairs that you wrote in your .env file, make sure you pass them using the ```automation/deploy_glue_jobs.sh``` file.
```

2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
Rest all the key-value pairs that entered in the `.env` file. make sure to pass them using `automation/deploy_glue_jobs.sh` file.

```
2. For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:

```
kaggle_username
kaggle_token
storage_account_name
datalake_access_key
```
```

## Documentation
## Run Tests & Coverage Report

[Multi-cloud Pipeline Documentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
To run tests & coverage report, run the following commands in the root folder of the project:

## References
```bash
make test

[Glue Programming libraries](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html)

## Run Tests & Coverage Report
# To see the coverage report
make coverage-report
```

To run tests in the root of the directory use:
## References

coverage run --source=app -m unittest discover -s tests
coverage report
[Glue Programming libraries](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html)

Note that AWS Glue libraries are not available to download, so use AWS Glue 4 Docker container.
5 changes: 4 additions & 1 deletion app/.custom_env
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# this is my custom file for read & write path based on environment
# this is env file for paths, read only on databricks
# for local glue, make a similar one in root named as ".env"

GLUE_READ_PATH="s3://glue-bucket-vighnesh/rawdata/"
GLUE_WRITE_PATH="s3://glue-bucket-vighnesh/transformed/"
Expand All @@ -7,3 +8,5 @@ DATABRICKS_READ_PATH="/mnt/rawdata/"
DATABRICKS_WRITE_PATH="/mnt/transformed/"

KAGGLE_PATH="mastmustu/insurance-claims-fraud-data"

# Give KAGGLE_KEY & KAGGLE_USERNAME Below
51 changes: 51 additions & 0 deletions automation/glue_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Parameter 1 --> Shell profile path
SOURCE_FILE=$1
echo $SOURCE_FILE

echo -e "FIRST RUN TIME ESTIMATION: 30-45 MINS\nPlease do NOT exit"

export PROJECT_ROOT=$(pwd)

# Doing all the work in separate folder "glue-libs"
cd ~
mkdir glue-libs
cd glue-libs

# Clone AWS Glue Python Lib
git clone https://github.com/awslabs/aws-glue-libs.git
export AWS_GLUE_HOME=$(pwd)/aws-glue-libs

# Install Apache Maven
curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz -o apache-maven-3.6.0-bin.tar.gz
tar -xvf apache-maven-3.6.0-bin.tar.gz
ln -s apache-maven-3.6.0 maven
export MAVEN_HOME=$(pwd)/maven

# Install Apache Spark
curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz -o spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
tar -xvf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
ln -s spark-3.1.1-amzn-0-bin-3.2.1-amzn-3 spark
export SPARK_HOME=$(pwd)/spark

# Export Path
export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
export PYTHONPATH=$PROJECT_ROOT

# Download Glue ETL .jar files
cd $AWS_GLUE_HOME
chmod +x bin/glue-setup.sh
./bin/glue-setup.sh
mvn install dependency:copy-dependencies
cp $AWS_GLUE_HOME/jarsv1/AWSGlue*.jar $SPARK_HOME/jars/
cp $AWS_GLUE_HOME/jarsv1/aws*.jar $SPARK_HOME/jars/

echo "export AWS_GLUE_HOME=$AWS_GLUE_HOME
export MAVEN_HOME=$MAVEN_HOME
export SPARK_HOME=$SPARK_HOME
export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
export PYTHONPATH=$PROJECT_ROOT" >> $SOURCE_FILE


cd $PROJECT_ROOT

echo -e "\nGLUE LOCAL SETUP COMPLETE"
3 changes: 2 additions & 1 deletion jobs/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
from dotenv import load_dotenv
import app.environment as env

load_dotenv("../app/.custom-env")
load_dotenv("../app/.custom_env") # Loading env for databricks
load_dotenv() # Loading env for glue

# COMMAND ----------

Expand Down
3 changes: 2 additions & 1 deletion jobs/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
import app.environment as env
import app.spark_wrapper as sw

load_dotenv("../app/.custom_env")
load_dotenv("../app/.custom_env") # Loading env for databricks
load_dotenv() # Loading env for glue

# COMMAND ----------

Expand Down
10 changes: 5 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
mypy~=1.7.1
pylint~=3.0.2
coverage~=7.3.2
python-dotenv~=1.0.0
mypy
pylint
coverage
python-dotenv
kaggle~=1.5.16
pre-commit~=3.6.0
pre-commit
Loading