Qwen-VL

Qwen-VL is Alibaba Cloud’s vision-language large model family, designed to integrate visual and linguistic modalities. It accepts image inputs (with optional bounding boxes) and text, and produces text (and sometimes bounding boxes) as output. The model variants (VL-Plus, VL-Max, etc.) have been upgraded for better visual reasoning, text recognition from images, fine-grained understanding, and support for high image resolutions / extreme aspect ratios. Qwen-VL supports multilingual inputs and conversation (e.g. Chinese, English), and is aimed at tasks like image captioning, question answering on images (VQA, DocVQA), grounding (detecting objects or regions from textual queries), etc.

Features

Strong performance on many vision-language tasks: image captioning, VQA, DocVQA, grounding, text recognition in images etc.
Very high resolution image input support (up to millions of pixels), and handling extreme aspect ratios for detailed visual content
Multilingual: supports Chinese, English, and other languages in image text / conversation tasks
Variants (VL-Plus, VL-Max) offer increasing capability: VL-Max is more capable in instruction following, visual reasoning, better model of cognition etc.
Fine-tuning and quantization options (e.g. Int4 modes, Q-LoRA) for lower resource usage
Supports multi-image interleaved conversations: comparing multiple images, storytelling, multi-image inputs in dialogues

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Qwen-VL

Qwen-VL Web Site

User Reviews

Be the first to post a review of Qwen-VL!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Large Language Models (LLM), Python AI Models

Registered

2025-09-23

Similar Business Software

LM-Kit.NET

LM-Kit.NET is a cutting-edge, high-level inference SDK designed specifically to bring the advanced capabilities of Large Language Models (LLM) into the C# ecosystem. Tailored for developers working within .NET, LM-Kit.NET provides a comprehensive suite of powerful Generative AI tools, making...

See Software
Qwen3-VL

Qwen3-VL is the newest vision-language model in the Qwen family (by Alibaba Cloud), designed to fuse powerful text understanding/generation with advanced visual and video comprehension into one unified multimodal model. It accepts inputs in mixed modalities, text, images, and video, and handles...

See Software
Google AI Studio

Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use...

See Software
Vertex AI

Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery...

See Software
GLM-4.6V

GLM-4.6V is a state-of-the-art open source multimodal vision-language model from the Z.ai (GLM-V) family designed for reasoning, perception, and action. It ships in two variants: a full-scale version (106B parameters) for cloud or high-performance clusters, and a lightweight “Flash” variant (9B)...

See Software
Qwen2-VL

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of: SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding...

See Software