[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
-
Updated
Aug 12, 2024 - Python
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Effortless data labeling with AI support from Segment Anything and other awesome models.
🚀 「大模型」1小时从0训练67M参数的视觉多模态VLM!🌏 Train a 67M-parameter VLM from scratch in just 1 hours!
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
MineContext is your proactive context-aware AI partner(Context-Engineering+ChatGPT Pulse)
Align Anything: Training All-modality Model with Feedback
DeepSeek-VL: Towards Real-World Vision-Language Understanding
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Collection of AWESOME vision-language models for vision tasks
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
A curated list of awesome LLM/VLM/VLA/World Model for Autonomous Driving(LLM4AD) resources (continually updated)
Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."