Siri recently tried to describe images received in messages when using CarPlay or the ad notifications feature. In typical Siri fashion, the feature works inconsistently and produces mixed results.
Still, Apple is moving forward with AI promises. In a recently published research paper, Apple's AI gurus describe a system in which Siri can do much more than just try to recognize what's in an image. The best part? The company believes that one of its models is better than ChatGPT 4.0 for performing these tests.
In the paper (ReALM: Reference Resolution as Language Modeling), Apple describes something that could improve the usefulness of a voice assistant with enhanced language model capabilities. ReALM takes into account both what's on your screen and what tasks are active. Here is a fragment of an article describing this work:
1. Screen objects: These are the objects that are currently displayed on the user's screen
2. Conversational entities: These are entities related to conversation. These entities can come from the user's previous turn (for example, when the user says “Call Mom”, the contact for Mom will be the corresponding entity in question) or from the virtual assistant (for example, when an agent provides the user with a list of locations or alarms to choose from).
3. Background objects: These are relevant objects that arise from background processes that may not necessarily be a direct part of what the user sees on his screen or his interaction with the virtual agent; for example, an alarm clock ringing or music playing in the background.
If everything works well, this sounds like a recipe for a smarter and more useful Siri. Apple is also confident in its ability to complete such a task with impressive speed. Benchmarking compared with OpenAI ChatGPT 3.5 and ChatGPT 4.0:
As another baseline, we use the ChatGPT variants GPT-3.5 (Brown et al., 2020; Ouyang et al., 2022) and GPT-4 (Achiam et al., 2023), available as of January 24, 2024, with contextual learning. As in our setting, we aim to force both variants to predict a list of entities from the available set. In the case of GPT-3.5, which only accepts text, our input consists of just a prompt; however, in the case of GPT-4, which also has the ability to contextualize images, we provide the system with a screenshot for the reference resolution task on the screen, which we believe helps significantly improve performance.
How the model is doing Apple?
We demonstrate significant improvements over an existing system with similar functionality across different link types, with our smallest model achieving an absolute gain of over 5% for link to screens. We also compare GPT-3.5 and GPT-4: our smallest model achieves performance comparable to GPT-4, and our larger models significantly outperform it.
Significantly superior, you would say ? The paper concludes in part as follows:
We show that RealLM outperforms previous approaches and performs approximately as well as state-of-the-art approaches. art LLM today, GPT-4, despite the fact that it contains far fewer parameters, even for screen links, despite being solely in the text area. It also outperforms GPT-4 for domain-specific user statements, making RealLM an ideal choice for a practical link resolution system that can exist on a device without sacrificing performance.
On a device without performance damage seems to be a key issue for Apple. We hope the next few years of platform development will be interesting, starting with iOS 18 and WWDC 2024, which will take place on June 10.