CogVLM2

CogVLM2 is the second generation of the CogVLM vision-language model series, developed by ZhipuAI and released in 2024. Built on Meta-Llama-3-8B-Instruct, CogVLM2 significantly improves over its predecessor by providing stronger performance across multimodal benchmarks such as TextVQA, DocVQA, and ChartQA, while introducing extended context length support of up to 8K tokens and high-resolution image input up to 1344×1344. The series includes models for both image understanding and video understanding, with CogVLM2-Video supporting up to 1-minute videos by analyzing keyframes. It supports bilingual interaction (Chinese and English) and has open-source versions optimized for dialogue and video comprehension. Notably, the Int4 quantized version allows efficient inference on GPUs with only 16GB of memory. The repository offers demos, API servers, fine-tuning examples, and integration with OpenAI API-compatible endpoints, making it accessible for both researchers and developers.

Features

Supports both image and video understanding with multimodal reasoning
Up to 8K context length and 1344×1344 image resolution input
Bilingual support for English and Chinese interactions
Quantized Int4 version for efficient inference on 16GB GPUs
Outperforms previous open-source models on TextVQA, DocVQA, and ChartQA
Provides demos for CLI, Gradio, API server, and fine-tuning workflows

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow CogVLM2

CogVLM2 Web Site

User Reviews

Be the first to post a review of CogVLM2!

Additional Project Details

Operating Systems

Linux

Programming Language

Python

Related Categories

Python Large Language Models (LLM), Python AI Models

Registered

2025-10-04

Similar Business Software

Pixtral Large

Pixtral Large is a 124-billion-parameter open-weight multimodal model developed by Mistral AI, building upon their Mistral Large 2 architecture. It integrates a 123-billion-parameter multimodal decoder with a 1-billion-parameter vision encoder, enabling advanced understanding of documents,...

See Software
CodeQwen

CodeQwen is the code version of Qwen, the large language model series developed by the Qwen team, Alibaba Cloud. It is a transformer-based decoder-only language model pre-trained on a large amount of data of codes. Strong code generation capabilities and competitive performance across a series...

See Software
DeepSeek-V2

DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length...

See Software
Qwen2.5-VL

Qwen2.5-VL is the latest vision-language model from the Qwen series, representing a significant advancement over its predecessor, Qwen2-VL. This model excels in visual understanding, capable of recognizing a wide array of objects, including text, charts, icons, graphics, and layouts within...

See Software
Llama 4 Scout

Llama 4 Scout is a powerful 17 billion active parameter multimodal AI model that excels in both text and image processing. With an industry-leading context length of 10 million tokens, it outperforms its predecessors, including Llama 3, in tasks such as multi-document summarization and parsing...

See Software
GLM-4.6

GLM-4.6 advances upon its predecessor with stronger reasoning, coding, and agentic capabilities: it demonstrates clear improvements in inferential performance, supports tool use during inference, and more effectively integrates into agent frameworks. In benchmark tests spanning reasoning,...

See Software

Report inappropriate content

CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B

Get an email when there's a new version of CogVLM2

Features

Project Samples

Project Activity

Categories

License

Follow CogVLM2

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered