
Alibaba’s QWEN research team released another open source artificial intelligence (AI) model in the preview. It is called QVQ-72B and is a vision-based inference model that analyzes visual information in images and understands the context behind it. The tech giant also shared benchmark scores for AI models and emphasized its ability to surpass OpenAI’s O1 model in a specific test. It is worth noting that Alibaba recently released several open source AI models, including the Large Language Model (LLMS) of QWQ-32B and MARCO-O1 Inference.
Launched Alibaba’s vision-based QVQ-72B AI model
In a list of faces embracing, Alibaba’s QWEN team details the new open source AI model. The researchers call it an experimental research model, emphasizing that QVQ-72B has enhanced visual reasoning capabilities. Interestingly, these are two separate performance branches that researchers merged in this model.
Vision-based AI models are sufficient. These include image encoders that can analyze the visual information and context behind it. Similarly, inference-focused models (e.g. O1 and QWQ-32B) have test time calculation scaling capabilities, allowing them to increase the processing time of the model. This enables the model to decompose problems, solve them step by step, evaluate the output and correct them for validators.
With the QVQ-72B preview model, Alibaba combines these two functions together. It can now analyze information in images and answer complex queries by using inference-centric structures. The team stressed that it has greatly improved the performance of the model.
The researchers shared EVAL from internal tests, claiming that the QVQ-72B was able to score 71.4% on the MathVista (Mini) benchmark, outperforming the O1 model (71.0). It is said to score 70.3% on the Multimodal Large Multitasking Understanding (MMMU) benchmark.
Despite the improvement in performance, most experimental models also have some limitations. The QWEN team said that AI models occasionally mix different languages or accidentally switch between them. The code conversion problem is also prominent in the model. Furthermore, the model is easily involved in a recursive reasoning loop, which affects the final output.