-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DINOv2 model #334
Add DINOv2 model #334
Conversation
@joelpaulkoch it's the interpolation that causes the difference, I need to investigate more to figure out specifically why. |
FTR the primary part of the difference in the interpolation was a bug elixir-nx/axon#554. There are still at least two other differences in the implementation (antialiasing and scaling), I will keep looking into this. |
Regarding interpolation I added anti-aliasing to Axon (elixir-nx/axon#555), which makes the numbers closer to PyTorch. They are still far from an exact match, because the anti-aliasing behaves slightly differently and there is a small extra scaling in the interpolation call. That said, I think we are fine at this point. The model is trained for a specific size and the interpolation is then used at inference when using a different input size, so my understanding is that as long as the interpolation does its job we should expect reasonable output. In hf/transformers ViT interpolation is implemented in PyTorch and Tensorflow, and they are definitely not identical (our interpolation should basically match Tensorflow though).
I changed the input height/width to unspecified ( |
This is great, thanks! |
I extended
I actually changed it so that we only use the indices. The stages names are not really much more meaningful than indices since they are always I also removed the option to reshape feature maps, in favour of always doing it. Reshaping to the flat shape is easy to do if the caller wants that (which would be a wrapping model, rather than the end user anyway). This PR is ready to ship, I will wait for elixir-nx/axon#555 and then merge :) Thanks @joelpaulkoch :) |
Yeah, sounds reasonable. Thank you very much for getting this into proper shape. |
This is the current state of my work on DINOv2.
facebook/dinov2-base
usesBitImageProcessor
, so I've copiedVitFeaturizer
->BitFeaturizer
and made the following changes:rescale_factor
(and removeNxImage.to_continuous
inprocess_batch
)For the model itself I've copied
Vit
toDinoV2
and basically changed three blocks:interpolation of positional encodings
The pre-trained positional encodings must be interpolated to apply them to other image resolutions (
interpolate_position_encoding
).For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using
Axon.get_inputs
. Is there a better way to do it?The current implementation of the interpolation is not exactly the same as in the transformers library, so this probably introduces a difference in the calculation.
Does
Axon.resize
return exactly the same for the same input astorch.nn.functional.interpolate
?Encoding blocks
In DINOv2 the ffn is either mlp or swiglu, depending on the configuration. I could pass the corresponding function to the blocks.
I still copied
blocks
,block
,block_impl
fromBumblebee.Layers.Transformer
because I needed two additional scaling layers inblock_impl
. This brings quite some duplicated code into the DINOv2 implementation for a small change, so I'm wondering whether it would make sense to introduce ablock_impl
parameter toBumblebee.Layers.Transformer.blocks
.map encoder output depending on architecture
For :base, in comparison to
Vit
the pooled output is simply the first token on axis 1.For :backbone, you can pass a list of
stage_names
and specify which you want asoutput_features
in the configuration. Then the corresponding hidden state will be included in the output. Intransformers
they have another option to specifyout_indices
instead which I did not implement.For :for_image_classification architecture, there is a header on top of the class token and mean of patch embeddings.
I've tried to follow the naming conventions but I could have missed that at some places.
Parts of the documentation are still just copy/paste from
transformers
.At the moment the tests are configured to run the model from
facebook/dinov2-base
and won't pass.