Apple researchers reveal new AI breakthrough for teaching LLMs images and text

In a new paper published this month, Apple researchers say they have developed new methods for training large language models using both texts. and visual information. According to Apple researchers, this is the way to get cutting-edge results.

As first noted by VentureBeat, the idea behind the study is to demonstrate “how carefully different types of training data and model architectures fit together.” can lead to superior performance in a range of artificial intelligence tests.”

The document was published last week and is called “MM1: Methods, Analysis & Results of pre-training in multimodal LLM.” Apple researchers explain this in the abstract of the paper:

In this paper, we discuss creating performant multimodal large language models (MLLMs). In particular, we study the importance of various architectural components and data selection. By carefully and comprehensively removing the image encoder, vision language connector, and various data selection options before training, we uncovered several important design lessons.

For example, we demonstrate that for large-scale multimodal pretraining, using a careful combination of image captions, image-text interleaving, and text-only data is critical to achieving the state-of-the-art. art (SOTA) multi-bench results compared to other published pre-training results.

MM1 is described as a “family of multimodal models”; which are state-of-the-art and have “attractive properties such as enhanced contextual learning and multi-pattern reasoning to suggest chains of thought in multiple steps.”

The contextual learning capabilities of the MM1 model are particularly impressive:

MM1 can perform contextual predictions thanks to its large-scale multimodal pretraining. This allows MM1 to (a) count objects and follow custom formatting, (b) refer to parts of images and perform text recognition, (c) demonstrate common sense and verbal knowledge of everyday objects, and (d) perform basic mathematical functions. Images are from the COCO 2014 validation set.

The researchers conclude that this family of models “delivers competitive performance across a wide range of benchmarks while enabling reasoning using multiple images and hints with little number of steps.”

More details:

  • Apple AI work continues: editing photos using text commands
  • Apple Keyframer generates AI animation from a still image and a text tip.
  • New AI features in iOS 18: everything we know so far.

Leave a Reply

Your email address will not be published. Required fields are marked *