Add DINOv2 model #334

joelpaulkoch · 2024-02-14T12:35:47Z

This is the current state of my work on DINOv2.

facebook/dinov2-base uses BitImageProcessor, so I've copied VitFeaturizer -> BitFeaturizer and made the following changes:

add configuration options
support for "shortest_edge" resizing (I've noticed this changed with Change image size to maps in image featurizers #329)
center crop the image
rescale the image with rescale_factor (and remove NxImage.to_continuous in process_batch)

For the model itself I've copied Vit to DinoV2 and basically changed three blocks:

interpolation of positional encodings
The pre-trained positional encodings must be interpolated to apply them to other image resolutions (interpolate_position_encoding).
For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it?
The current implementation of the interpolation is not exactly the same as in the transformers library, so this probably introduces a difference in the calculation.
Does Axon.resize return exactly the same for the same input as torch.nn.functional.interpolate?
Encoding blocks
In DINOv2 the ffn is either mlp or swiglu, depending on the configuration. I could pass the corresponding function to the blocks.
I still copied blocks, block, block_impl from Bumblebee.Layers.Transformer because I needed two additional scaling layers in block_impl. This brings quite some duplicated code into the DINOv2 implementation for a small change, so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks.
map encoder output depending on architecture
For :base, in comparison to Vit the pooled output is simply the first token on axis 1.
For :backbone, you can pass a list of stage_names and specify which you want as output_features in the configuration. Then the corresponding hidden state will be included in the output. In transformers they have another option to specify out_indices instead which I did not implement.
For :for_image_classification architecture, there is a header on top of the class token and mean of patch embeddings.

I've tried to follow the naming conventions but I could have missed that at some places.
Parts of the documentation are still just copy/paste from transformers.
At the moment the tests are configured to run the model from facebook/dinov2-base and won't pass.

lib/bumblebee/vision/bit_featurizer.ex

jonatanklosko · 2024-02-15T11:59:01Z

@joelpaulkoch it's the interpolation that causes the difference, I need to investigate more to figure out specifically why.

jonatanklosko · 2024-02-16T17:42:20Z

FTR the primary part of the difference in the interpolation was a bug elixir-nx/axon#554. There are still at least two other differences in the implementation (antialiasing and scaling), I will keep looking into this.

jonatanklosko · 2024-02-19T16:28:11Z

Regarding interpolation I added anti-aliasing to Axon (elixir-nx/axon#555), which makes the numbers closer to PyTorch. They are still far from an exact match, because the anti-aliasing behaves slightly differently and there is a small extra scaling in the interpolation call. That said, I think we are fine at this point. The model is trained for a specific size and the interpolation is then used at inference when using a different input size, so my understanding is that as long as the interpolation does its job we should expect reasonable output. In hf/transformers ViT interpolation is implemented in PyTorch and Tensorflow, and they are definitely not identical (our interpolation should basically match Tensorflow though).

For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it?

I changed the input height/width to unspecified ({nil, nil, nil, spec.num_channels}). When building the model we don't know the size (Axon.get_inputs(pixel_values)["pixel_values"]), but in this case we can make all of the interpolation an Axon layer, so that we can read the input tensor shape when the model is compiled with a concrete input size :)

joelpaulkoch · 2024-02-19T16:36:53Z

This is great, thanks!

jonatanklosko · 2024-02-21T09:23:53Z

so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks

I extended :block_type to allow for a function. I was also considering supporting "hooks" between each block layer, but that's more indirect and less flexible, so I went with the custom function. The function receives the various layers as anonymous functions and can apply them in any order and with any custom layers on top. This feels a bit coupled, but fwiw it's all internal anyway.

In transformers they have another option to specify out_indices instead which I did not implement.

I actually changed it so that we only use the indices. The stages names are not really much more meaningful than indices since they are always "stem", "stage1", "stage2", ..., stageN. By using the indices we don't need to worry about generating/loading the stages names, so I think it's a win.

I also removed the option to reshape feature maps, in favour of always doing it. Reshaping to the flat shape is easy to do if the caller wants that (which would be a wrapping model, rather than the end user anyway).

This PR is ready to ship, I will wait for elixir-nx/axon#555 and then merge :) Thanks @joelpaulkoch :)

joelpaulkoch · 2024-02-21T10:21:18Z

Yeah, sounds reasonable. Thank you very much for getting this into proper shape.

joelpaulkoch added 22 commits February 6, 2024 14:49

Add BitFeaturizer

91cce58

Refactor to pipeline

434b854

Add test for BitFeaturizer

bb016ef

Base version of DINOv2, with pooling

981920a

Interpolate positional encodings

e0b635d

Rename variables

397f79c

Return hidden_states as feature map

6cb14fe

Apply layernorm and reshape feature maps

3fc47e3

Image classification with dinov2

6acadf8

Update tests

04e679f

Remove comments

f695c04

Clean up configuration

11514bc

Make pipeline

49f6133

Add mapping for DinoV2Backbone

f1f36eb

Add swiglu ffn layer

ede347b

Extracted scale layer from ffn

657f950

Refactor ffns

8b332db

Rename block type

5335d6b

Refactor

95004b8

Update docs

18d8744

Refactor param naming

3cc1622

Floor instead of round in swiglu

9b79e82

jonatanklosko reviewed Feb 15, 2024

View reviewed changes

lib/bumblebee/vision/bit_featurizer.ex Outdated Show resolved Hide resolved

jonatanklosko reviewed Feb 15, 2024

View reviewed changes

lib/bumblebee/vision/bit_featurizer.ex Outdated Show resolved Hide resolved

joelpaulkoch added 3 commits February 16, 2024 10:11

Merge branch 'main' into DinoV2

84f4d9c

Use new size representation

43d7e12

Support interpolation for rectangular input

4502953

Updates

1874e97

jonatanklosko added 8 commits February 20, 2024 00:29

Updates

2da7db4

Refactor transformer blocks

45e36c6

Support custom block function

21c1df0

Refactor ffn

2e56d04

Rename test file

f148ed1

Updates

9e23241

Update feature maps

6cb23cc

Naming

041b8a7

Up

a46083b

jonatanklosko merged commit f739e0a into elixir-nx:main Feb 21, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DINOv2 model #334

Add DINOv2 model #334

joelpaulkoch commented Feb 14, 2024

jonatanklosko commented Feb 15, 2024

jonatanklosko commented Feb 16, 2024

jonatanklosko commented Feb 19, 2024

joelpaulkoch commented Feb 19, 2024

jonatanklosko commented Feb 21, 2024 •

edited

Loading

joelpaulkoch commented Feb 21, 2024

Add DINOv2 model #334

Add DINOv2 model #334

Conversation

joelpaulkoch commented Feb 14, 2024

jonatanklosko commented Feb 15, 2024

jonatanklosko commented Feb 16, 2024

jonatanklosko commented Feb 19, 2024

joelpaulkoch commented Feb 19, 2024

jonatanklosko commented Feb 21, 2024 • edited Loading

joelpaulkoch commented Feb 21, 2024

jonatanklosko commented Feb 21, 2024 •

edited

Loading