APPLE

Apple is training an artificial intelligence system to understand app screens – this could become the basis for an advanced Siri

Apple's research paper describes how the company developed Ferret-UI, a generative AI system specifically designed to make sense of screens applications.

The document is somewhat vague about potential applications for this (probably intentionally), but the most exciting possibility would be to use a much more advanced Siri …

Problems of going beyond ChatGPT

Big language models (LLM) are the basis of systems such as ChatGPT. The training material for them is text, mainly taken from websites.

MLLM – or multimodal large language models – are aimed at expanding the ability of an artificial intelligence system to also understand non-textual information: images, videos. and audio.

MLLMs currently do not have a good understanding of mobile app results. There are several reasons for this, starting with the simple one: the aspect ratio of a smartphone's screen is different from those used in most training images.

In particular, many of the images they need to recognize, such as icons and buttons, are very small .

Additionally, instead of being able to perceive information with a single click, as when interpreting a static image, they need to be able to interact with the application.

Apple’s Ferret-UI

These are the problems that Apple researchers believe they have solved with the help of the MLLM system, which they call Ferret-UI (UI standing for user interface).

Given that UI screens typically have a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, text) than natural images, we enable “any resolution” on top of Ferret to enhance detail and take advantage of advanced visual features [… ]

We carefully collect training samples for a wide range of elementary UI tasks, such as icon recognition, text search, and widget listing. These samples are formatted to follow instructions with region annotations for easy handling and justification. To enhance the model's reasoning ability, we further compile the dataset for complex tasks, including rich description, perception/interaction talk, and feature inference.

The result, they say, is better than both options. GPT-4V and other existing UI-focused MLLMs.

From user interface development to highly advanced Siri

The document describes what they have achieved, not how it can be used. This is typical for many research projects, and there may be several reasons for this.

First, the researchers themselves may not know how their work can be used. They focus on solving a technical problem rather than a potential application. A product specialist may need to see potential uses for the product.

Second, especially as it relates to Apple, he may be asked not to disclose the intended use or to deliberately not disclose it.

But we could see three potential uses for this feature:

First, it could be a useful tool for assessing the effectiveness of a user interface. A developer can create a rough version of an application and then let Ferret-UI determine how easy or difficult it is to understand and use. This can be both faster and cheaper than human usability testing.

Second, it can be accessibility applications. For example, instead of a simple screen reader reading everything on the iPhone screen to a blind person, it summarizes what the screen is showing and lists the available options. The user can then tell iOS what they want to do and let the system do it for them.

Apple provides an example of this where Ferret-UI is represented by a screen with podcasts. The output of the system is as follows: “The screen is designed for a podcast application where users can browse and play new and famous podcasts, with the ability to play, download and search for specific podcasts.”

Third – and most interesting – it can be used to run a very advanced version of Siri, where the user can give Siri an instruction like “Check tomorrow's flights from JFK to Boston and book a ticket.” a seat on a flight that will get me there by 10am for a total ticket price of less than $200.” Siri will then interact with the airline app to complete the task.

Thanks, AK. Composite image of 9to5Mac by Solen Feyissa on Unsplash and Apple.

Leave a Reply