Vision Agent
Vision Language Model for Structured Scene Understanding
Tech Stack
Qwen2-VL • Transformers • Python • Gradio • Docker • ONNX Runtime
Overview
Qwen2-VL pipeline generating structured assessments with severity classification. Reasons about context, not just objects.
Problem
Traditional object detection (YOLO, Faster R-CNN) classifies objects but misses behavioral context. Real-world scene understanding requires reasoning about actions and environment, not just bounding boxes.
Solution
Vision Language Model pipeline using Qwen2-VL that processes images and generates structured assessments across multiple categories with severity classification and actionable recommendations.
Architecture
Image Input → Preprocessing → Qwen2-VL Inference (ONNX Runtime) → Structured Reasoning → Severity Classification → Report