This is equivalent to the original VAE objective for the unimodal case:
such that
Various value for decoder_scale
was tested (0.05, 0.2, 0.35, 0.5, 0.75, 0.9) and their effectiveness and optimal value were found. setting decoder_scale
to 0.75
gave a good balance between generation quality and reconstruction quality. some of the generated images (from SVHN modality) and reconstructed images (from MNIST modality) for 0.2, 0.75, and 0.9 are show below:
0.2 | 0.75 | 0.9 |
---|---|---|
by looking at batch_size
number of samples from decoder_value
results in dissimilarity of pz and qzx which explains the poor generation from lower values of decoder_scale
:
0.2 | 0.75 |
---|---|
Using elbo
as objective, batch size of 256, latent dimension of 20, decoder_scale
set to 0.75, and setting
Fully Connected | Convolutional |
---|---|
in->512->20, 20->512->out | in->4x4(16)->4x4(32)->4x4(64)->3x3(20, 20)->4x4(64)->3x3(32)->2x2(16)->4x4 out |
in->1024->512->20, 20->1024->512->out | in->4x4(32)->4x4(64)->4x4(128)->3x3(20, 20)->4x4(128)->3x3(64)->2x2(32)->4x4 out |
in->1024->512->128->20, 20->1024->512->128->out | in->4x4(64)->4x4(128)->4x4(256)->3x3(20, 20)->4x4(256)->3x3(128)->2x2(64)->4x4 out |
Best model for MNIST was a fully connected model (input -> 512 -> latent -> 512 -> output) and the best model for svhn is show below.
We use the proposed metrics in the original paper which are depicted below:
Note that synergy metric is proposed by [Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models](https://arxiv.org/abs/2007.01179) and its not used in here.based on Appendix. B, of the original mmvae paper, the basic MOE variational posterior for two modalities is:
expanding:
Using Equation.2 and Equation.3 from the original VAE paper, we have:
so the objective is:
When analyzing the latent space of the MMVAE trained with naive ELBO, the MNIST and SVHN are separated, making cross-modal generation difficult:
This has a peculiar effect where MNIST-generated images appear to be fine and are supported by the MNIST latent accuracy of 93%; the SVHN-generated images, on the other hand, are half decent and half noisy.
The cross-generation from SVHN to MNIST is accurate. However, the SVHN to MNIST transfer does not produce realistic images as the decoder is not adequately trained. The encoder is doing a good job with SVHN, achieving 73% latent accuracy. However, the decoder produces unrealistic images due to the low 14% cross coherence from mnist to svhn.We tried two different solutions to address the issue at hand. Firstly, we decreased the latent dimension from 20 to 10, which restricted the model's ability to separate each modality. Secondly, we attempted to improve the performance of the SVHN decoder by sampling K times from the SVHN posterior in each pass. However, neither of these solutions proved to be effective.
Our proposed setup involves the mutual training of two VAEs that focus on individual modalities and an MMVAE that captures information present in both modalities. The primary objective is to encourage the unimodal VAE of modality A to capture information primarily in modality A and the unimodal VAE of modality B to capture information primarily in modality B. At the same time, the MMVAE is trained to capture information that is present in both modalities.
This factorization of representations lies in the "Representation Fission" sub-challenge presented in Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions: Representation Fission: Learning a new set of representations that reflects multimodal internal structure such as data factorization or clustering.
To achieve this, we utilize infoNCE and some additional terms to elbo_naive
loss (m_infoNCE_naive()
function):
To evaluate the proposed method, we conducted an evaluation based on the same metrics that were used in the original MMVAE paper:
Metric | elbo naive | elbo naive infoNCE (disentangled) | elbo naive infoNCE (Uni) |
---|---|---|---|
Cross Coherence (SVHN -> MNIST) | 77.35% | 76.70% | - |
Cross Coherence (MNIST -> SVHN) | 14.50% | 12.22% | - |
Joint Coherence | 36.63% | 31.90% | - |
latent accuracy (SVHN) | 73.08% | 29.02% | 13.87% |
latent accuracy (MNIST) | 93.13% | 93.24% | 93.20% |
See following for the arguments we used for this results:
- elbo_naive:
python main.py --experiment comparasion_final --model mnist_svhn --obj elbo_naive --batch-size 128 --epochs 10 --fBase 32 --max_d 10000 --dm 30
- m_infoNCE_naive:
python main.py --experiment comparasion_final --model fummvae --obj infoNCE_naive --batch-size 128 --epochs 10 --fBase 32 --max_d 10000 --dm 30