Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.0.0 release PR - "Ardent adenine" #10

Merged
merged 77 commits into from
Aug 16, 2023
Merged

Version 1.0.0 release PR - "Ardent adenine" #10

merged 77 commits into from
Aug 16, 2023

Conversation

MatthiasZepper
Copy link
Member

@MatthiasZepper MatthiasZepper commented Jul 18, 2023

This pull request represents an essentially full rewrite of umi-transfer by Johannes Alneberg (@alneberg) and me (@MatthiasZepper).

New and Improved Features:

  • Code organization: The code base has been split into separate files, with each file representing a subcommand and its associated CLI configuration. This improves clarity and allows for easy integration of additional subcommands and functionalities in the future.
  • Enhanced CLI options: The CLI arguments have been revamped for improved usability. Previously, specifying the output directly was not possible, hindering the creation of a nf-core Nextflow module. Specifying an output is still optional, but now the output file names are derived from the input file names rather than from a constant base provided as CLI argument. Furthermore, the delimiter used to join the UMIs can be customized now. The --edit_nr flag has been renamed to --correct_numbers and applies to both files for better consistency.
  • Improved output file handling: The output file name will automatically include a .gz suffix if the -z/--compress flag is enabled. Conversely, an eventual suffix will be removed if no compression was requested. Additionally, the tool verifies that the output file does not exist yet and prompts for overwrite confirmation (unless -f/--force is specified).
  • Enhanced error handling: Functions have been rewritten to utilize Results and Options, enabling proper error handling. Before, many functions simply panicked and the program crashed, for example if a non-existing input file was specified.
  • UMI ID validation: The tool now compares the ID of the UMI to that of the read, ensuring that the tool terminates upon encountering a mismatch. This prevents incorrect UMIs from being added to the read IDs due to differently sorted files.
  • Automated tests: Several unit tests and extensive integration tests have been implemented to enhance the reliability of the tool.
  • Continuous integration pipelines: The CI pipelines have been refactored, and a new release pipeline builds the tool for seven common architectures.

Discontinued Previous Features:

  • Support for inline UMIs: The previous inline functionality for transferring fixed-length UMIs was limited and did not support offsets or regular expressions. Since there are existing tools like umitools that already serve this purpose, we decided to prioritize the development of novel functionality. However, the new subcommand structure in the code paves the way for future support of inline UMIs.
  • Progress bar: The progress bar provided a helpful visual aid, but it required counting one of the files to determine the total number of reads, resulting in the need to read the file twice. Considering performance reasons, we made the decision to remove this feature, especially since most runs are expected to be non-interactive in workflow systems like Nextflow.
  • Multi-threading: In the previous version (0.1) of umi-transfer, it was possible to run the tool on two cores when processing paired FastQ files, with each file assigned to a separate thread. However, the tool's performance was primarily limited by output compression, and multi-threading caused significant overhead. A future version of umi-transfer will be designed to run fully asynchronous and efficiently scale over multiple threads. In the meantime, we recommend utilizing FIFOs and external compression with tools like pigz.
  • Support for singletons: To simplify the code structure, we made the second FastQ file mandatory. For running on singletons, you can provide the same input twice and redirect one of the output files to /dev/null using a FIFO.

alneberg and others added 30 commits April 13, 2023 11:05
Make Johannes' simplification the base for the new dev.
Some amendments to Johannes simplification
…mented an output overwrite check and prompt.
…odify the input file names if no output file names were given.
… for the ReadFile enum by introducing a <Box> around the compressed input. Also used a BufReader for the plain text file.
@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

src/umi_errors.rs Fixed Show fixed Hide fixed
@MatthiasZepper MatthiasZepper changed the title Version 0.2 release Version 0.2.0 release PR - "Ardent adenine" Jul 18, 2023
@MatthiasZepper MatthiasZepper changed the title Version 0.2.0 release PR - "Ardent adenine" Version 1.0.0 release PR - "Ardent adenine" Jul 27, 2023
src/file_io.rs Fixed Show fixed Hide fixed
@MatthiasZepper
Copy link
Member Author

During code review, I'd specifically appreciate feedback on the arguments --in and --in2. Should it rather be --in1 instead of --in for consistency?

--in and --in2 were adapted from BBTools, still one of my most favourite tool suites. For a tool that supports singletons and may only take one input, --in seems preferable, but to simplify the code structure, we made the second FastQ file mandatory in umi-transfer. Accordingly, --in1 and --in2 seem the more appropriate choice.

On the other hand, there are vague plans to rewrite it in an entirely asynchronous manner for better performance and with an optional --in2. In that sense, --in appears more future-proof.

README.md Outdated Show resolved Hide resolved
Copy link
Member

@alneberg alneberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Especially impressed you got around to write so many tests!

@MatthiasZepper
Copy link
Member Author

Version 1.0.0 - "Ardent adenine" ready to be released!

@MatthiasZepper MatthiasZepper merged commit 1835689 into main Aug 16, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants