The VL Model Behind the Doubao AI Phone
According to public reporting, the model used in the Doubao AI phone is a closed-source version of UI-TARS optimized for phones. UI-TARS itself is the result of SFT on top of Alibaba’s Qwen2 VL, with the 7B version currently open-sourced (Qwen2 VL has open-sourced models from 3B to 72B). Rather than dwelling on Qwen here (Qwen2 VL already has UI Operation capabilities), this post focuses on how UI-TARS improves further on top of Qwen2 VL, split into data and training.
Data
The most central part lies in more fine-grained data construction: each UI screenshot is annotated in detail from the bottom up, from the smallest button all the way to the overall layout, and even captions describing the state before and after a transition.
| Data Type | Description |
|---|---|
| Element Description | Information about a single element, including its type (button, input box, etc.—much like component categories on the frontend), visual description (color, appearance, etc.), positional information (relative spatial information such as above/below/left/right), and the element’s function (e.g. deleting an email) |
| Dense Caption | A long passage of detailed text describing the entire interface |
| State Transition Caption | Placing a set of images together to describe the change between before and after, and whether an action such as a key press was performed |
| QA | Questions and answers about the UI |
| Set of Mark | Adding some marks within the UI (for example, boxing off a portion), and constructing QA based on those marks |

The paper mentions that a total of 50 billion tokens of data were constructed to train the 7B and 72B models (the pretraining of Qwen2 VL already used 1.4 trillion tokens of data).
Beyond this data, the later SFT training stage additionally constructed error-correction data pairs (error + correction), telling the agent how to recover after it has already clicked the wrong thing on a UI. This is a major highlight too (constructing these complex, deeply annotated datasets looks like it cost a lot of money…).
Training
The UI-TARS training process can be divided into four steps: pretraining, SFT, and DPO.
Pretraining
All of the data mentioned above is used for pretraining—essentially continuing to train on top of Qwen2 VL with this specific data. Using ChatGPT to estimate, pretraining the 7B and 72B models with 50 billion tokens, converted into H200 compute, comes out to roughly:
- 7B: ≈ 49.2 – 70.2 H200 GPU-days
- 72B: ≈ 505.6 – 722.2 H200 GPU-days
That looks fine—with 128 GPUs it’s actually quite fast.
SFT
This stage is more fine-grained. It not only uses the high-quality portion of the data above, but also semi-automatically generated trace data + error-correction data to further strengthen the ability to perform sequential operations.
Trace
For sequential data like traces, native datasets are quite scarce, so a semi-automatic approach is used to generate data + iterate on the model. Each iteration creates a batch of tasks for the model to run, and then through methods such as human annotation and model scoring, high-quality trace data is filtered out for the next round of model training—repeatedly iterating with the high-quality data the model itself produces.
Reflection Tuning
Error-correction data is obtained by taking the model’s erroneous traces and re-labeling them into positive samples, which are then used as SFT training data. There are two ways to construct positive samples:
-
Directly changing the erroneous action into the correct action, so that the model tries its best not to make mistakes.
-
Changing the step after the erroneous action into a corrective action, so that the model knows how to fix the mistake after making it.
DPO
In the preceding SFT, the erroneous samples were merely converted into positive samples for training, without leveraging the information in the negative samples themselves. The idea of DPO is a bit like an SVM: it not only separates the positive and negative samples, but also pushes the distance between them as far apart as possible.
DPO constructs the loss function above for training. The ratios inside the log represent the preference of the model being trained, , relative to the SFT model, . The former is the comparison of preference on positive samples, and the latter on negative samples. The optimization objective is to make the former as large as possible and the latter as small as possible—that is, the model being trained should favor positive samples more than the old SFT model does, and stay away from negative samples. This makes the model’s preference between positive and negative samples clearer and more distinct.
I won’t go into much detail on the experiments. The original paper has very thorough experiments on perception, grounding, and more. Overall, there is a considerable improvement over the original Qwen2 VL. The paper also uses methods such as reasoning to further boost performance, which are relatively general-purpose techniques and won’t be repeated here.