Recently, many large language models (LLMs), both closed and open source, have become available, leading to the creation of combined models known as Multimodal LLM (MLLM). Yet few or none of them reveal what design choices were made to build them, say Apple researchers who have extracted principles and lessons for the design of state-of-the-art (SOTA) multimodal LLMs.
Multimodal large language models are built by combining a large language model and a vision foundation model into a single model. MMLMs, which according to Apple researchers are “emerging as the next frontier in foundational models,” aim to consume input images and text to generate textual data in a way that surpasses the foundational models on which they are built.
The Apple researchers focused on two aspects of the process that lead to the creation of MLLMs: model architecture decisions and pre-training data choices.
On the first front, they found that image resolution, visual encoder loss and capacity, and visual encoder pre-training data were the three most important design aspects. On the contrary, architectural decisions about how visual data are fed into the LLM do not appear to affect the resulting model performance.
In terms of prior training, the researchers analyzed three different approaches—image description, interlaced image text, and text-only data—in the context of few snapshots, zero snapshots, and text only. Zero-shot models are trained to recognize and classify objects or concepts without necessarily seeing any examples beforehand. In few-shot training, the focus is instead on models that can make accurate predictions based on training that includes only a very small number of labeled examples.
The outcome was that the interlaced and text-only training data were key to the performance of the multi-shot and text-only models, while the description data was key to the no-shot models.
To prove their results, the researchers built a family of models, called MM1, that outperform current state-of-the-art models, including Emu2, Flamingo, and IDEFICS. Benchmarking was done on subtitles, where the model provides a descriptive description of the image, and visual question answers, where the model answers questions about the image and helps to understand its content.
Thanks to extensive multimodal pretraining […] MM1 has attractive features such as in-context predictions, multiple images, and chain-of-mind thinking. MM1 also provides a powerful ability to learn on several occasions after tuning the instructions. These strong results demonstrate that the presented recipe for building MLLMs translates the design principles into a competitive scale model.
As the researchers explain in their paper, to get these levels of performance with the MM1, they explored different image encoders as well as ways to interface them with LLMs; different types of data and how to set weights; and how to train the MLLM, including its hyperparameters. Their results include insights such as the importance of image resolution, model size, training data composition, and so on, which they hope can provide a solid foundation for the community to build stronger models across multiple architectures and data strategies.