107 - Multi-Modal Transformers, with Hao Tan and Mohit Bansal

In this episode, we invite Hao Tan and Mohit Bans…

38 Minuten

34.4 MB

Podcast

Podcaster

NLP Highlights

**The podcast is currently on hiatus. For more ac…

Wissenschaft

Beschreibung

vor 5 Jahren

In this episode, we invite Hao Tan and Mohit Bansal to talk about
multi-modal training of transformers, focusing in particular on
their EMNLP 2019 paper that introduced LXMERT, a vision+language
transformer. We spend the first third of the episode talking about
why you might want to have multi-modal representations. We then
move to the specifics of LXMERT, including the model structure, the
losses that are used to encourage cross-modal representations, and
the data that is used. Along the way, we mention latent alignments
between images and captions, the granularity of captions, and
machine translation even comes up a few times. We conclude with
some speculation on the future of multi-modal representations.
Hao's website: http://www.cs.unc.edu/~airsplay/ Mohit's website:
http://www.cs.unc.edu/~mbansal/ LXMERT paper:
https://www.aclweb.org/anthology/D19-1514/