Research : Change Attention Transformer Inputs #126

Optimox · 2020-06-04T09:34:10Z

Main Remark

Currently in tabnet architecture, a part of the output of Feature Transformer is used for the predictions (n_d) and the rest (n_a) as input for the next Attentive Transformer.

But I see a flaw in this design, the Feature Transformer (let's call it FT_i) sees masked input from the previous Attentive Transformer (AT_{i-1}), so the input feature of FT_i don't contain all the initial information. How can this help to select other useful features for the next step?

Proposed Solution

I think that attentive transformer should take as input the raw features to select the next step features, using the previous mask as prior to avoid selecting always the same feature as each step would still work.

So an easy way to try this idea would be to use the feature transformer only for predictions. The attentive transformer could be preceded by it's own feature transformer if necessary, but inputs of at attentive block would be initial data + prior of the previous masks.

This could potentially improve the attentive transformer part.

If you find this interesting, don't hesitate to share your ideas in the comment section or open a PR to propose a solution!

mustaphabenhajm · 2021-07-19T10:02:28Z

@Optimox Hello, could you please clear more the idea ? do you mean the input of attentive transformer will be initial data + previous mask which will replace the priors of the previous step ? Thanks

Optimox · 2021-07-19T16:12:31Z

Hello @MustaphaBM

I'll try to rephrase what I meant at that time.

The attentive transformer from step 1 is taking a vector of size n_a as input, which has been computed by the initial feature transformer (number 0). Until here I'm totally fine with the idea of masking certain features from this.

The attentive transformer 2 however gets as input the n_a output of feature transformer 1, but this feature transformer 1 has never seen the full data because it was masked by the attentive transformer 1. And here I think there might be something wrong, how can you chose which feature to use if you have only seen part of them?

Obviously this would be a real problem if the mask did not change at instance level, here the mask can adapt to each instance. However I feel that it would be interesting to try to create the mask from the original data and not from the previous attentive transformer.

This would somehow lower the 'sequential' attention of TabNet but I think that keeping the previous mask as a prior for the update of the next mask could mitigate this.

Actually I think this would be quite easy to implement and try, but I'm not sure on which dataset I should to the benchmark to see whether there is a real improvement.

Hope this is clearer, let me know otherwise. Let me know if you perform some experiments I would be interested to know about the results.

Optimox added enhancement New feature or request Research Research Ideas to improve architecture labels Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research : Change Attention Transformer Inputs #126

Research : Change Attention Transformer Inputs #126

Optimox commented Jun 4, 2020

mustaphabenhajm commented Jul 19, 2021

Optimox commented Jul 19, 2021 •

edited

Loading

Research : Change Attention Transformer Inputs #126

Research : Change Attention Transformer Inputs #126

Comments

Optimox commented Jun 4, 2020

Main Remark

Proposed Solution

mustaphabenhajm commented Jul 19, 2021

Optimox commented Jul 19, 2021 • edited Loading

Optimox commented Jul 19, 2021 •

edited

Loading