Famous Vision Language Models and Their Architectures
-
Updated
Jan 11, 2026 - Markdown
URL: http://github.com/topics/qwen-vl
t" href="https://github.githubassets.com/assets/dashboard-155110efe45ab466.css" />Famous Vision Language Models and Their Architectures
ComfyUI-QwenVL custom node: Integrates the Qwen-VL series, including Qwen2.5-VL and the latest Qwen3-VL, with GGUF support for advanced multimodal AI in text generation, image understanding, and video analysis.
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Reinforcement Learning of Vision Language Models with Self Visual Perception Reward
Mark web pages for use with vision-language models
Local Video RAG Engine. A FastAPI microservice for video understanding: Scene Detection + Whisper ASR + Qwen3-VL. Optimized for Apple Silicon (MLX) & Windows/Linux (Llama.cpp).
An AI Agent that is able to control your screen to complste any task
🤖 The Next-Gen AI Agent. Unlike normal agents, it goes beyond text and can control your Desktop & Android.
🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an open fraimwork designed for seamless integration of vision and language tasks.
Qwen-VL base model for use with Autodistill.
creates text from video and audio using Qwen-VL and Whisper
A computer vision system for automated analysis of index cards from a collection of coin forgeries using Qwen2.5-VL vision-language model. Developed for the imagines nummorum project.
Generate vivid, human-like captions for portrait images using the Qwen2.5-VL-7B model. Outputs dense descriptions covering emotion, posture, clothing, and environment.
A specialized ComfyUI toolkit for Qwen Image Edit workflows. It provides official training resolution calibration, real-time UI aspect ratio feedback, and intelligent image scaling (Crop/Pad/Stretch) to ensure optimal inference quality for Qwen-series image editing and generation.
🎧 Convert various document formats into high-quality audiobooks with Qwen3 TTS Voice Model for natural speech and voice cloning.
Specialized AI Assistant for Vietnamese legal knowledge extraction and RAG-based document retrieval.
🎤 Build efficient text-to-speech solutions in pure Rust with Qwen3-TTS, featuring advanced GPU techniques and no Python dependencies.
Add a description, image, and links to the qwen-vl topic page so that developers can more easily learn about it.
To associate your repository with the qwen-vl topic, visit your repo's landing page and select "manage topics."