Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for mzMLb #113

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

mobiusklein
Copy link

This PR adds a new optional dependency on mzdata to read the mzMLb file format.

mzMLb support was recently added to ProteoWizard, and it provides fast random access to compressed data, and comparable sequential access speed to mzML.

Building with the mzMLb feature statically links the HDF5 library with the binary, so there is no impact on the end-user beyond the increased binary size, but it does require that the HDF5 library is available at build time. The tests were done with libhdf5 10.3 which can be gotten via the system package manager on Linux systems and using vcpkg on Windows.

@lazear
Copy link
Owner

lazear commented Jan 13, 2024

I appreciate the PR! Unfortunately relying on an external library to be installed on the system is a non-starter. Not requiring any external dependencies (e.g. having a fully portable binary) is one of the primary goals of Sage.

@lazear
Copy link
Owner

lazear commented Jan 13, 2024

I somehow missed the part about static linking - in that case, it could be possible.

How big are the binaries? Is there actually any benefit to adding mzMLb support? Is anyone actually using the format? Do other tools support it?

If enough people actually are using mzMLb, I will merge it in... but I am loathe to support yet another XML based format and complicate the build process unless it will actually get used by more than 1 person.

@mobiusklein
Copy link
Author

libhdf5 itself is around 10MB, but after enabling feature flag, the final artefact when compiled increased in size by only 0.5MB for sage when built in release mode.

At the moment, we don't know how many people are using mzMLb since it's a bit of a chicken vs. egg problem. If tools don't support it, then nobody would bother using it, and if nobody uses it tools won't bother supporting it. There's a forthcoming Comet release with mzMLb support in MSToolkit, ProteoWizard supports it so presumably Skyline supports it. If Sage supports it, then there will be more motivation for users who aren't just deleting their mzML files after running a search against them to convert to mzMLb instead. Any new format has to overcome the inertia of existing formats to get people to use it.

If you still don't want to add the optional dependency to your codebase, that's fine too. Getting the build time dependencies working on CI can be aggravating, especially on Windows because vcpkg seemed to have been designed to be all plumbing and no porcelain. And then mzdata is still not well exercised beyond files generated by msconvert and my own toolchains.

There's a poll going around discussing a community-driven design of a new format at https://qualtricsxm5l7r46fqr.qualtrics.com/jfe/form/SV_esUoW2IDzZalb6u and any feedback about what features it needs to have would be appreciated. If you're still working on mz_parquet, that experience could be useful for overcoming some of the concerns people might have using Parquet as a container. I know I didn't like it because I A) didn't understand how record shredding and compression would interact, B) didn't like the idea of columns being mostly null and C) hadn't learned that the Arrow C++ library made appending and scanning easier in many languages.

@lazear
Copy link
Owner

lazear commented Jan 16, 2024

Even 10 MB isn't bad - just want to make sure it's not ballooning to hundreds of MB. Let me test out the fork with mzMLb support and see how it works.

I think I am most likely the person that compiles sage the most - and I regularly swap between windows, macOS and linux just to make my life more difficult. I will personally probably never use mzMLb and don't particularly feel like having to install vcpkg et al just to compile sage. But this shouldn't be a gating factor for the project if other people will use it - perhaps we could gate it behind a compile-time feature flag that is only turned on during GH Actions or if users want to manually. I think this is what you've done, right?

@mobiusklein
Copy link
Author

That's correct. The mzmlb feature flag on sage-cli turns on the mzdata flag in sage-cloudpath which pulls in mzdata and its mzmlb feature enabled which in turn pulls in hdf5-rs which requires libhdf5 to build. If you don't turn on those features, you don't need mzdata et al or libhdf5.

@mobiusklein
Copy link
Author

There's a detail I missed about the linking in hdf5-rs. I'll be updating this PR soon.

@mobiusklein
Copy link
Author

I've updated the PR to build the hdf5 crate with the required feature to make it actually get static linkage. It will compile the HDF5 library from source during build time with CMake, so the HDF5 shared library doesn't need to be present. The resulting binary is ~3.5MB heavier than the original without the feature enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants