Vision Agent

Vision Language Model for Structured Scene Understanding

VLMComputer VisionONNX Runtime

Tech Stack

Qwen2-VL • Transformers • Python • Gradio • Docker • ONNX Runtime

VLMMultimodal
5Categories
Real-TimeInference

Overview

Qwen2-VL pipeline generating structured assessments with severity classification. Reasons about context, not just objects.

Problem

Traditional object detection (YOLO, Faster R-CNN) classifies objects but misses behavioral context. Real-world scene understanding requires reasoning about actions and environment, not just bounding boxes.

Solution

Vision Language Model pipeline using Qwen2-VL that processes images and generates structured assessments across multiple categories with severity classification and actionable recommendations.

Architecture

Image Input → Preprocessing → Qwen2-VL Inference (ONNX Runtime) → Structured Reasoning → Severity Classification → Report

Explore This Project

View the source code and architecture.