Qwen2-VL - Vision Language Model

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Qwen-VL (Qwen Large Vision Language Model) is the visual multimodal variant of the Qwen model series, developed by Alibaba Cloud. Qwen-VL processes inputs such as images, text, and bounding boxes, and produces text and bounding box outputs.

Key features of Qwen-VL include:

Exceptional Performance: It outperforms existing open-source Large Vision Language Models (LVLMs) of similar scale across multiple English evaluation benchmarks, including Zero-shot Captioning, VQA, DocVQA, and Grounding.
Multilingual LVLM with Text Recognition: Qwen-VL supports multilingual conversations and excels at end-to-end recognition of bilingual text, particularly in Chinese and English, within images.
Multi-Image Interleaved Conversations: This capability allows for the comparison of multiple images, posing specific questions about them, and engaging in multi-image storytelling.
First Generalist Model Supporting Grounding in Chinese: It can detect bounding boxes through open-domain language expressions in both Chinese and English.
Fine-Grained Recognition and Understanding: With a resolution of 448, compared to the 224 used by other open-source LVLMs, Qwen-VL offers enhanced fine-grained text recognition, document QA, and bounding box annotation.

Based on text prompts and input images, Qwen-VL can produce high-quality images in various styles and genres for different industry-specific scenarios.

Qwen-VL learns and analyzes objects and texts in images, and creates new content based on its learning.

The Qwen Dev Team developed two models in the Qwen-VL series:

Qwen-VL

This pre-trained LVLM model uses Qwen-7B as the base language model and Openclip ViT-bigG as the visual encoder, connected by a randomly initialized cross-attention layer. It was trained on approximately 1.5 billion image-text pairs, with a final image input resolution of 448.

Qwen-VL-Chat

A multimodal LLM-based AI assistant, enhanced with alignment techniques for improved performance in interactive tasks.

Evaluation

The model’s capabilities were evaluated from two perspectives:

Standard Benchmarks

The model’s performance was assessed across four major multimodal task categories:

Zero-shot Caption: Evaluates the model’s ability to generate image captions without prior exposure to specific datasets.
General VQA: Assesses the model’s skill in answering general visual questions, including judgments about color, number, and categories.
Text-based VQA: Tests the model’s capacity to recognize and respond to text in images, such as document-based and chart-based questions.
Referring Expression Comprehension: Measures the model’s ability to localize a target object in an image based on a descriptive phrase.

TouchStone Benchmark

A comprehensive evaluation tool designed to assess the text-image dialogue capability and alignment with human responses. TouchStone includes:

300+ images, 800+ questions, and 27 categories: Covering a range of tasks like attribute-based Q&A, celebrity recognition, poetry writing, multi-image summarization, product comparison, and math problem-solving.
Fine-grained image annotations: Provides detailed human-labeled image annotations, questions, and model outputs, which are then scored by GPT-4 to gauge the LVLM model’s accuracy and alignment.
English and Chinese versions: Ensures broad evaluation across different languages.

Zero-shot Captioning & General VQA

Model type	Model	Zero-shot Captioning		General VQA
Model type	Model	NoCaps	Flickr30K	VQAv2^dev	OK-VQA	GQA	SciQA-Img (0-shot)	VizWiz (0-shot)
Generalist Models	Flamingo-9B	–	61.5	51.8	44.7	–	–	28.8
	Flamingo-80B	–	67.2	56.3	50.6	–	–	31.6
	Unified-IO-XL	100.0	–	77.9	54.0	–	–	–
	Kosmos-1	–	67.1	51.0	–	–	–	29.2
	Kosmos-2	–	66.7	45.6	–	–	–	–
	BLIP-2 (Vicuna-13B)	103.9	71.6	65.0	45.9	32.3	61.0	19.6
	InstructBLIP (Vicuna-13B)	121.9	82.8	–	–	49.5	63.1	33.4
	Shikra (Vicuna-13B)	–	73.9	77.36	47.16	–	–	–
	Qwen-VL (Qwen-7B)	121.4	85.8	78.8	58.6	59.3	67.1	35.2
	Qwen-VL-Chat	120.2	81.0	78.2	56.6	57.5	68.2	38.9
Previous SOTA (Per Task Fine-tuning)	–	127.0 (PALI-17B)	84.5 (InstructBLIP -FlanT5-XL)	86.1 (PALI-X -55B)	66.1 (PALI-X -55B)	72.1 (CFR)	92.53 (LLaVa+ GPT-4)	70.9 (PALI-X -55B)

Qwen2-VL Model Benchmark Evaluation

Text-oriented VQA

Text-oriented VQA (focuse on text understanding capabilities in images):

Model type	Model	TextVQA	DocVQA	ChartQA	AI2D	OCR-VQA
Generalist Models	BLIP-2 (Vicuna-13B)	42.4	–	–	–	–
	InstructBLIP (Vicuna-13B)	50.7	–	–	–	–
	mPLUG-DocOwl (LLaMA-7B)	52.6	62.2	57.4	–	–
	Pic2Struct-Large (1.3B)	–	76.6	58.6	42.1	71.3
	Qwen-VL (Qwen-7B)	63.8	65.1	65.7	62.3	75.7
Specialist SOTAs (Specialist/Finetuned)	PALI-X-55B (Single-task FT) (Without OCR Pipeline)	71.44	80.0	70.0	81.2	75.0

Text-oriented VQA (focuse on text understanding capabilities in images)

Chat

TouchStone is a benchmark designed to evaluate the abilities of the LVLM model in text-image dialogue and alignment with human responses, using GPT-4 for scoring. It encompasses over 300 images, 800 questions, and 27 categories, including tasks like attribute-based Q&A, celebrity recognition, poetry writing, multi-image summarization, product comparison, and math problem solving. For more detailed information, please refer to eval/EVALUATION.md.

Model	Score
PandaGPT	488.5
MiniGPT4	531.7
InstructBLIP	552.4
LLaMA-AdapterV2	590.1
mPLUG-Owl	605.4
LLaVA	602.7
Qwen-VL-Chat	645.2

Qwen2-VL Model Architecture

Qwen2-VL now supports arbitrary image resolutions, dynamically converting them into visual tokens. This enhancement provides a more human-like visual processing experience, allowing the model to adapt seamlessly to different image sizes and clarity levels.

Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.

Requirements

Qwen2-VL is integrated into the latest version of Hugging Face Transformers. To avoid errors such as KeyError: 'qwen2_vl', it is recommended to install the package directly from the source using the following command:

pip install git+https://github.com/huggingface/transformers

Quickstart

A toolkit is available to simplify handling various types of visual inputs, including base64, URLs, and interleaved images and videos. You can install it with the following command:

pip install qwen-vl-utils

Below is a code snippet demonstrating how to use the chat model with Transformers and qwen_vl_utils:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Video Interface:

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Limitations

While Qwen2-VL is versatile across various visual tasks, it has several known limitations:

Lack of Audio Support: The model cannot process audio information within videos.
Data Timeliness: The image dataset is current up to June 2023, and may not reflect information beyond that date.
Constraints in Individuals and IP Recognition: The model’s ability to identify specific individuals or intellectual property is limited and may not cover all well-known figures or brands comprehensively.
Limited Capacity for Complex Instructions: The model struggles with understanding and executing complex, multi-step instructions.
Insufficient Counting Accuracy: In complex scenes, the model’s object counting accuracy is low and requires further enhancement.
Weak Spatial Reasoning: The model has difficulty inferring object positions in 3D spaces, making it challenging to accurately assess spatial relationships.

These limitations highlight areas for ongoing optimization, and efforts are continuously made to improve the model’s performance and applicability.

Demo

You can try Qwen2-VL in action below: