We will discuss the present and future of multimodal architectures based on language models in the task of describing videos and answering questions about them.
Company: AIRI