SkyworkAny-to-Any

Skywork Releases UniPic, a Unified 1.5B Vision Model

The new autoregressive model from the Chinese AI lab can understand, generate, and edit images within a single, compact framework.

Jul 29, 2025

UpdateOther

AI lab Skywork, known for its Chinese-language LLMs, has released Skywork-UniPic-1.5B, a compact multimodal model designed to handle a variety of vision tasks within a single architecture. At just 1.5 billion parameters, UniPic uses a unified autoregressive approach to process and create images, a departure from more specialized, single-task models.

The key feature of UniPic is its versatility. Instead of requiring separate models for different functions, it integrates several core capabilities into one system. This multi-task design represents a growing trend towards more efficient and generalized AI systems that can reason about and manipulate visual data more holistically.

A Unified Vision Framework

UniPic is capable of performing three primary functions:

Image Understanding: The model can interpret the content of an image and answer questions about it.
Image Generation: It can create new images from descriptive text prompts.
Image Editing: It can modify existing images based on user instructions.

The model, code, and further details are available on the project's Hugging Face repository. It is released under the Skywork License Agreement, which allows for research and commercial use with certain restrictions. Its relatively small size could make it an accessible tool for researchers and developers experimenting with unified vision architectures.

Sources

Skywork/Skywork-UniPic-1.5B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

A Unified Vision Framework

UniPic is capable of performing three primary functions:

Image Understanding: The model can interpret the content of an image and answer questions about it.

Image Generation: It can create new images from descriptive text prompts.

Image Editing: It can modify existing images based on user instructions.