
Last week, Embrace Face introduced two new variants to its Smolvlm visual language model. The new artificial intelligence (AI) model can provide 256 million and 500 million parameter sizes, the former being called the world’s smallest visual model by the company. The new variant focuses on retaining the efficiency of the 200 billion parameter model while significantly reducing the size. The company stressed that the new models could run locally on restricted devices, consumer laptops, and perhaps even browser-based reasoning.
Embrace the Face to introduce smaller Smolvlm AI model
In a blog post, the company announced the Smolvlm-256m and Smolvlm-500m visual language models, in addition to the existing 2 billion parameter models. This version brings two basic models and two instruction fine-tuning models in the above parameter sizes.
The embrace says that these models can be loaded directly into transformers, machine learning exchanges (MLX), open neural network exchanges (ONNX) platforms and developers can build on top of the basic models. It is worth noting that these are open source models with the Apache 2.0 license and are available for personal and commercial use.
With the new AI model, Hugging Face aims to bring multi-model models focused on computer vision to portable devices. For example, the 256 million parameter model can run on less than one GB of GPU memory and 15GB of RAM, processing 16 images per second (batch size 64).
“For a mid-sized company, processing 1 million images per month means a significant savings in annual computational cost,” Andrés Marafioti, a machine learning research engineer at Hugging Face, told VentureBeat.
To reduce the size of the AI model, the researchers switched the visual encoder from the previous Siglip 400m to the 93m parameter siglip Base patch. In addition, symbolization has also been optimized. The New Vision model encodes the image at a rate of 4096 pixels per token, while each token in the 2B model is 1820 pixels.
It is worth noting that the smaller models also lag slightly behind the 2B model in terms of performance, but the company says that trade-off is at least maintained. Depending on the hugged face, the 256m variant can be used for subtitle images or short videos, answering questions about the document, and basic visual reasoning tasks.
Developers can use Transformers and MLX for inference and fine-tune AI models, as they can use old Smolvlm code. These models are also listed on the Hug Face.