Qwen2 VL - Quick Start

Tongyi Qianwen open source visual understanding large model Qwen-VL released a major update on December 1, 2023, which not only greatly improves the basic ability of general OCR, visual reasoning and Chinese text understanding, but also can process images of various resolutions and specifications, and even “look at pictures and do questions”.

The upgraded Qwen-VL (qwen-vl-plus/qwen-vl-max) model has several major features:

It greatly enhances the word processing ability in pictures, and can become a small helper in productivity. Extracting, sorting out and summarizing text information is out of the question.
Increase the range of processable resolution. Figures with resolution and aspect ratio can be processed, and large and long pictures can be seen clearly.
Enhance visual reasoning and decision-making ability, which is suitable for building visual Agents and further expands the imagination of large model Agents.
Upgrade the ability of looking at pictures and doing questions, take a picture of the exercises and send it to Qwen-VL. The big model can help users solve problems step by step.

Start quickly

Prerequisites

The service has been opened and API-KEY has been obtained:Open DashScope and create API-KEY.
The latest version of SDK has been installed: Install DashScope SDK.
At the same time, DashScope provides interface access services compatible with OpenAI. For details, please refer to OpenAI interface compatibility.

Sample code

The following example shows the code that calls Tongyi Qianwen VL API to respond to a user instruction.

Description

You need to use your API-KEY to replace YOUR_DASHSCOPE_API_KEY in the example for the code to work properly.

Set up API-KEY

export DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY

Python and

Java

from http import HTTPStatus
import dashscope


def simple_multimodal_conversation_call():
    """Simple single round multimodal conversation call.
    """
    messages = [
        {
            "role": "user",
            "content": [
                {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                {"text": "这是什么?"}
            ]
        }
    ]
    response = dashscope.MultiModalConversation.call(model='qwen-vl-plus',
                                                     messages=messages)
    # The response status_code is HTTPStatus.OK indicate success,
    # otherwise indicate request is failed, you can get error code
    # and message from code and message.
    if response.status_code == HTTPStatus.OK:
        print(response)
    else:
        print(response.code)  # The error code.
        print(response.message)  # The error message.


if __name__ == '__main__':
    simple_multimodal_conversation_call()

After the python call is successful, the following sample results will be returned.

JSON.

{
    "status_code": 200,
    "request_id": "cd828016-bcf5-94c7-82ed-5b45bf06886c",
    "code": "",
    "message": "",
    "output": {
        "text": null,
        "finish_reason": null,
        "choices": [
            {
                "finish_reason": null,
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "图中是一名女子在沙滩上和狗玩耍，旁边有海水，应该是在海边。"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "input_tokens": 1276,
        "output_tokens": 19
    }
}

Learn more

You can go to the API details page for detailed call documents about Tongyi Qianwen VL API.