The VL Model Behind the Doubao AI Phone

According to public reporting, the model used in the Doubao AI phone is a closed-source version of UI-TARS optimized for phones. UI-TARS itself is the result of SFT on top of Alibaba’s Qwen2 VL, with the 7B version currently open-sourced (Qwen2 VL has open-sourced models from 3B to 72B). Rather than dwelling on Qwen here (Qwen2 VL already has UI Operation capabilities), this post focuses on how UI-TARS improves further on top of Qwen2 VL, split into data and training.

Data

The most central part lies in more fine-grained data construction: each UI screenshot is annotated in detail from the bottom up, from the smallest button all the way to the overall layout, and even captions describing the state before and after a transition.

Data TypeDescription
Element DescriptionInformation about a single element, including its type (button, input box, etc.—much like component categories on the frontend), visual description (color, appearance, etc.), positional information (relative spatial information such as above/below/left/right), and the element’s function (e.g. deleting an email)
Dense CaptionA long passage of detailed text describing the entire interface
State Transition CaptionPlacing a set of images together to describe the change between before and after, and whether an action such as a key press was performed
QAQuestions and answers about the UI
Set of MarkAdding some marks within the UI (for example, boxing off a portion), and constructing QA based on those marks
UI-TARS data construction illustration: multi-layer annotation from individual element descriptions to overall layout and state transitions
UI-TARS data construction illustration: multi-layer annotation from individual element descriptions to overall layout and state transitions

The paper mentions that a total of 50 billion tokens of data were constructed to train the 7B and 72B models (the pretraining of Qwen2 VL already used 1.4 trillion tokens of data).

Beyond this data, the later SFT training stage additionally constructed error-correction data pairs (error + correction), telling the agent how to recover after it has already clicked the wrong thing on a UI. This is a major highlight too (constructing these complex, deeply annotated datasets looks like it cost a lot of money…).

Training

The UI-TARS training process can be divided into four steps: pretraining, SFT, and DPO.

Pretraining

All of the data mentioned above is used for pretraining—essentially continuing to train on top of Qwen2 VL with this specific data. Using ChatGPT to estimate, pretraining the 7B and 72B models with 50 billion tokens, converted into H200 compute, comes out to roughly:

  • 7B: ≈ 49.2 – 70.2 H200 GPU-days
  • 72B: ≈ 505.6 – 722.2 H200 GPU-days

That looks fine—with 128 GPUs it’s actually quite fast.

SFT

This stage is more fine-grained. It not only uses the high-quality portion of the data above, but also semi-automatically generated trace data + error-correction data to further strengthen the ability to perform sequential operations.

Trace

For sequential data like traces, native datasets are quite scarce, so a semi-automatic approach is used to generate data + iterate on the model. Each iteration creates a batch of tasks for the model to run, and then through methods such as human annotation and model scoring, high-quality trace data is filtered out for the next round of model training—repeatedly iterating with the high-quality data the model itself produces.

Reflection Tuning

Error-correction data is obtained by taking the model’s erroneous traces and re-labeling them into positive samples, which are then used as SFT training data. There are two ways to construct positive samples:

  • Directly changing the erroneous action into the correct action, so that the model tries its best not to make mistakes.

    {T=(instruction,(o1,t1,a1),(o2,t2,a2),,(oτ,tτ,aτ))T+=(instruction,(o1,t1,a1),(o2,t2,a2),,(oτ,tτ,aτ))\left\{ \begin{aligned} \mathcal{T}_{-} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{red}{t_\tau}, \textcolor{red}{a_\tau}) \bigr) \\[6pt] \mathcal{T}_{+} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{green}{t_\tau^{*}}, \textcolor{green}{a_\tau^{*}}) \bigr) \end{aligned} \right.
  • Changing the step after the erroneous action into a corrective action, so that the model knows how to fix the mistake after making it.

{T=(instruction,(o1,t1,a1),(o2,t2,a2),,(oτ,tτ,aτ),(oτ+1,tτ+1,aτ+1))T+=(instruction,(o1,t1,a1),(o2,t2,a2),,(oτ,tτ,aτ),(oτ+1,tτ+1,aτ+1))\left\{ \begin{aligned} \mathcal{T}_{-} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{red}{t_\tau}, \textcolor{red}{a_\tau}), (o_{\tau+1}, t_{\tau+1}, a_{\tau+1}) \bigr) \\[6pt] \mathcal{T}_{+} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{red}{t_\tau}, \textcolor{red}{a_\tau}), (o_{\tau+1}, \textcolor{green}{t_{\tau+1}^{*}}, \textcolor{green}{a_{\tau+1}^{*}}) \bigr) \end{aligned} \right.

DPO

In the preceding SFT, the erroneous samples were merely converted into positive samples for training, without leveraging the information in the negative samples themselves. The idea of DPO is a bit like an SVM: it not only separates the positive and negative samples, but also pushes the distance between them as far apart as possible.

LDPO(θ)=Eτ[logσ(βlogπθ(aτsτ)πSFT(aτsτ)βlogπθ(aτsτ)πSFT(aτsτ))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_\tau \Big[ \log \sigma\big( \beta \log \tfrac{\pi_\theta(a'_\tau|s_\tau)}{\pi_{\text{SFT}}(a'_\tau|s_\tau)} - \beta \log \tfrac{\pi_\theta(a_\tau|s_\tau)}{\pi_{\text{SFT}}(a_\tau|s_\tau)} \big) \Big]

DPO constructs the loss function above for training. The ratios inside the log represent the preference of the model being trained, πθ\pi_\theta, relative to the SFT model, πSFT\pi_{SFT}. The former is the comparison of preference on positive samples, and the latter on negative samples. The optimization objective is to make the former as large as possible and the latter as small as possible—that is, the model being trained should favor positive samples more than the old SFT model does, and stay away from negative samples. This makes the model’s preference between positive and negative samples clearer and more distinct.

I won’t go into much detail on the experiments. The original paper has very thorough experiments on perception, grounding, and more. Overall, there is a considerable improvement over the original Qwen2 VL. The paper also uses methods such as reasoning to further boost performance, which are relatively general-purpose techniques and won’t be repeated here.