Apache PySpark in Docker

PySpark docker container based on OpenJDK and Miniconda 3. PySpark 3+ uses OpenJDK 11, PySpark 2 uses OpenJDK 8.

Running the container

By default spark-submit --help is run:

docker run godatadriven/pyspark

To run your own job, make the job accessible through a volume and pass the necessary arguments:

docker run -v /local_folder:/job godatadriven/pyspark [options] /job/<python file> [app arguments]

Samples

The folder samples contain some PySpark jobs, how to obtain a spark session and crunch some data. The current directory is mapped as /job. So run the docker command from the root directory of this project.

# Self word counter:
docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py

# Self word counter with spark extra options
docker run -v $(pwd):/job godatadriven/pyspark \
	--name "I count myself" \
	--master "local[1]" \
	--conf "spark.ui.showConsoleProgress=True" \
	--conf "spark.ui.enabled=False" \
	/job/samples/word_counter.py "jobSampleArgument1"

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
samples		samples
tests		tests
.project		.project
Dockerfile		Dockerfile
README.md		README.md
docker-compose.test.yml		docker-compose.test.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache PySpark in Docker

Running the container

Samples

About

Releases

Packages

Languages

acmh/pyspark

Folders and files

Latest commit

History

Repository files navigation

Apache PySpark in Docker

Running the container

Samples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages