You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently in tabnet architecture, a part of the output of Feature Transformer is used for the predictions (n_d) and the rest (n_a) as input for the next Attentive Transformer.
But I see a flaw in this design, the Feature Transformer (let's call it FT_i) sees masked input from the previous Attentive Transformer (AT_{i-1}), so the input feature of FT_i don't contain all the initial information. How can this help to select other useful features for the next step?
Proposed Solution
I think that attentive transformer should take as input the raw features to select the next step features, using the previous mask as prior to avoid selecting always the same feature as each step would still work.
So an easy way to try this idea would be to use the feature transformer only for predictions. The attentive transformer could be preceded by it's own feature transformer if necessary, but inputs of at attentive block would be initial data + prior of the previous masks.
This could potentially improve the attentive transformer part.
If you find this interesting, don't hesitate to share your ideas in the comment section or open a PR to propose a solution!
The text was updated successfully, but these errors were encountered:
@Optimox Hello, could you please clear more the idea ? do you mean the input of attentive transformer will be initial data + previous mask which will replace the priors of the previous step ? Thanks
The attentive transformer from step 1 is taking a vector of size n_a as input, which has been computed by the initial feature transformer (number 0). Until here I'm totally fine with the idea of masking certain features from this.
The attentive transformer 2 however gets as input the n_a output of feature transformer 1, but this feature transformer 1 has never seen the full data because it was masked by the attentive transformer 1. And here I think there might be something wrong, how can you chose which feature to use if you have only seen part of them?
Obviously this would be a real problem if the mask did not change at instance level, here the mask can adapt to each instance. However I feel that it would be interesting to try to create the mask from the original data and not from the previous attentive transformer.
This would somehow lower the 'sequential' attention of TabNet but I think that keeping the previous mask as a prior for the update of the next mask could mitigate this.
Actually I think this would be quite easy to implement and try, but I'm not sure on which dataset I should to the benchmark to see whether there is a real improvement.
Hope this is clearer, let me know otherwise. Let me know if you perform some experiments I would be interested to know about the results.
Main Remark
Currently in tabnet architecture, a part of the output of Feature Transformer is used for the predictions (n_d) and the rest (n_a) as input for the next Attentive Transformer.
But I see a flaw in this design, the Feature Transformer (let's call it FT_i) sees masked input from the previous Attentive Transformer (AT_{i-1}), so the input feature of FT_i don't contain all the initial information. How can this help to select other useful features for the next step?
Proposed Solution
I think that attentive transformer should take as input the raw features to select the next step features, using the previous mask as prior to avoid selecting always the same feature as each step would still work.
So an easy way to try this idea would be to use the feature transformer only for predictions. The attentive transformer could be preceded by it's own feature transformer if necessary, but inputs of at attentive block would be initial data + prior of the previous masks.
This could potentially improve the attentive transformer part.
If you find this interesting, don't hesitate to share your ideas in the comment section or open a PR to propose a solution!
The text was updated successfully, but these errors were encountered: