Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DINOv2 model #334

Merged
merged 35 commits into from
Feb 21, 2024
Merged

Add DINOv2 model #334

merged 35 commits into from
Feb 21, 2024

Conversation

joelpaulkoch
Copy link
Contributor

This is the current state of my work on DINOv2.

facebook/dinov2-base uses BitImageProcessor, so I've copied VitFeaturizer -> BitFeaturizer and made the following changes:

  1. add configuration options
  2. support for "shortest_edge" resizing (I've noticed this changed with Change image size to maps in image featurizers #329)
  3. center crop the image
  4. rescale the image with rescale_factor (and remove NxImage.to_continuous in process_batch)

For the model itself I've copied Vit to DinoV2 and basically changed three blocks:

  1. interpolation of positional encodings
    The pre-trained positional encodings must be interpolated to apply them to other image resolutions (interpolate_position_encoding).
    For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it?
    The current implementation of the interpolation is not exactly the same as in the transformers library, so this probably introduces a difference in the calculation.
    Does Axon.resize return exactly the same for the same input as torch.nn.functional.interpolate?

  2. Encoding blocks
    In DINOv2 the ffn is either mlp or swiglu, depending on the configuration. I could pass the corresponding function to the blocks.
    I still copied blocks, block, block_impl from Bumblebee.Layers.Transformer because I needed two additional scaling layers in block_impl. This brings quite some duplicated code into the DINOv2 implementation for a small change, so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks.

  3. map encoder output depending on architecture
    For :base, in comparison to Vit the pooled output is simply the first token on axis 1.
    For :backbone, you can pass a list of stage_names and specify which you want as output_features in the configuration. Then the corresponding hidden state will be included in the output. In transformers they have another option to specify out_indices instead which I did not implement.
    For :for_image_classification architecture, there is a header on top of the class token and mean of patch embeddings.

I've tried to follow the naming conventions but I could have missed that at some places.
Parts of the documentation are still just copy/paste from transformers.
At the moment the tests are configured to run the model from facebook/dinov2-base and won't pass.

@jonatanklosko
Copy link
Member

@joelpaulkoch it's the interpolation that causes the difference, I need to investigate more to figure out specifically why.

@jonatanklosko
Copy link
Member

FTR the primary part of the difference in the interpolation was a bug elixir-nx/axon#554. There are still at least two other differences in the implementation (antialiasing and scaling), I will keep looking into this.

@jonatanklosko
Copy link
Member

Regarding interpolation I added anti-aliasing to Axon (elixir-nx/axon#555), which makes the numbers closer to PyTorch. They are still far from an exact match, because the anti-aliasing behaves slightly differently and there is a small extra scaling in the interpolation call. That said, I think we are fine at this point. The model is trained for a specific size and the interpolation is then used at inference when using a different input size, so my understanding is that as long as the interpolation does its job we should expect reasonable output. In hf/transformers ViT interpolation is implemented in PyTorch and Tensorflow, and they are definitely not identical (our interpolation should basically match Tensorflow though).


For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it?

I changed the input height/width to unspecified ({nil, nil, nil, spec.num_channels}). When building the model we don't know the size (Axon.get_inputs(pixel_values)["pixel_values"]), but in this case we can make all of the interpolation an Axon layer, so that we can read the input tensor shape when the model is compiled with a concrete input size :)

@joelpaulkoch
Copy link
Contributor Author

This is great, thanks!

@jonatanklosko
Copy link
Member

jonatanklosko commented Feb 21, 2024

so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks

I extended :block_type to allow for a function. I was also considering supporting "hooks" between each block layer, but that's more indirect and less flexible, so I went with the custom function. The function receives the various layers as anonymous functions and can apply them in any order and with any custom layers on top. This feels a bit coupled, but fwiw it's all internal anyway.

In transformers they have another option to specify out_indices instead which I did not implement.

I actually changed it so that we only use the indices. The stages names are not really much more meaningful than indices since they are always "stem", "stage1", "stage2", ..., stageN. By using the indices we don't need to worry about generating/loading the stages names, so I think it's a win.

I also removed the option to reshape feature maps, in favour of always doing it. Reshaping to the flat shape is easy to do if the caller wants that (which would be a wrapping model, rather than the end user anyway).


This PR is ready to ship, I will wait for elixir-nx/axon#555 and then merge :) Thanks @joelpaulkoch :)

@joelpaulkoch
Copy link
Contributor Author

Yeah, sounds reasonable. Thank you very much for getting this into proper shape.

@jonatanklosko jonatanklosko merged commit f739e0a into elixir-nx:main Feb 21, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants