Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently Bundle uploads can take a significant amount of time (on the order of 10s of seconds) for larger repos. A significant amount of time is spent checking if the files being uploaded to the bundle match a glob pattern in a chefignores file.
The time complexity of checking chefignores ends up being roughly O(2n*m) where n is the number of files in the repo and m is the number of glob patterns in the chef ignore file. This is only roughly correct since chefignore can be overridden in subdirectories which allows m to be variable.
The previous implementation serially performed a DFS of the repo, checked the chefignore file, and wrote the files to the tgz. This implementation still serially performs a DFS, but it parallelizes the chef ignore check and initial archive generation. It generates intermediate tars in tempfiles, then combines them into a tgz. The user can specify the number of processes used and the compression level via the "bundle_generation_processes" and "bundle_compression_level" config keys respectively.
One other notable change is that rather than using just the chefignore file in the base repo, this implementation checks for chefignore files per file which allows it to correctly pick up chefignore files in subdirectories. This has a significant performance penalty, but this is regression if offset by the parallelization changes.
For testing I started with a repo consisting of the cookbooks in https://github.com/facebook/chef-cookbooks
To increase the size of the repo I made fake files using
for i in {1..n}; do base64 /dev/urandom | head -c 10000 > file$i.txt; done