Computer Vision

32 articles in this category (Page 1 of 2)

AI NewsComputer VisionMachine Learning

Zero-Shot Object Detection: Replacing YOLO Retraining with Generative VLMs

Generative VLMs enable zero-shot detection, reducing the 150x latency gap between YOLOv8 and Phi-3.5 for semantic industrial inspection.

May 22, 2026

AI NewsComputer VisionArtificial Intelligence

World-R1: Enhancing Video Foundation Models with Flow-GRPO and 3D-Aware Rewards

Microsoft Research's World-R1 achieves a 7.91 dB PSNR gain in geometric consistency for video generation without architectural changes.

Apr 30, 2026

AI NewsComputer VisionMachine Learning

Best of WACV 2026: Advances in Zero-Shot Sampling and OOD Detection

Join Voxel51 on April 30 for the Best of WACV 2026 virtual event featuring four technical talks on subspace sampling and MLLM robustness.

Apr 28, 2026

AI NewsMachine LearningComputer Vision

Meta AI Sapiens2: Scaling Human-Centric Vision Models to 5B Parameters and 4K Resolution

Meta AI's Sapiens2 scales to 5B parameters and 1B images, achieving 82.3 mAP in pose estimation and 82.5 mIoU in segmentation across 1K and 4K resolutions.

Apr 27, 2026

AI NewsAgentic AIComputer Vision

Building VLA-Inspired Embodied Agents via Latent World Modeling and MPC

Learn to build a lightweight Vision-Language-Action agent using NumPy-rendered RGB observations and PyTorch to perform latent state prediction and real-time MPC planning.

Apr 27, 2026

AI NewsComputer VisionArtificial Intelligence

Vision Banana: Google DeepMind’s Instruction-Tuned Model Outperforms SAM 3 and Depth Anything V3

Vision Banana beats SAM 3 on segmentation and Depth Anything V3 on metric depth by treating vision tasks as image generation problems.

Apr 25, 2026

AI NewsEdge AIComputer Vision

Liquid AI LFM2.5-VL-450M: Sub-250ms Edge Inference and Bounding Box Prediction

Liquid AI releases LFM2.5-VL-450M, a 450M-parameter VLM achieving sub-250ms latency on NVIDIA Jetson Orin with new bounding box prediction.

Apr 11, 2026

AI NewsComputer VisionPhysical AI

Mastering Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim

Implement a complete Pose2Sim pipeline to convert multi-camera video into biomechanical data using RTMPose for 2D estimation and OpenSim for 3D joint angles.

Apr 10, 2026

AI NewsComputer VisionMachine Learning

Meta AI's EUPE: A <100M Parameter Universal Vision Encoder Rivaling Specialists

Meta AI introduces EUPE, a compact vision encoder under 100M parameters that matches domain-expert models in classification and dense prediction, achieving 55.2ms latency on iPhone 15 Pro.

Apr 6, 2026

AI NewsComputer VisionArtificial Intelligence

Building a Netflix VOID Video Object Removal Pipeline with CogVideoX

Implement Netflix's VOID model for advanced video object removal requiring 40GB+ VRAM and utilizing CogVideoX-Fun-V1.5-5b-InP.

Apr 5, 2026

AI NewsComputer VisionOpen Source

Netflix AI Open-Sources VOID: Physics-Aware Video Object Removal

Netflix AI and INSAIT release VOID, a 5B parameter model that removes video objects and their physical interactions using a novel quadmask and physics-aware conditioning.

Apr 4, 2026

AI NewsComputer VisionLarge Language Model

TII Releases Falcon Perception: A Unified 0.6B-Parameter Early-Fusion Transformer

TII’s Falcon Perception 0.6B model achieves a +21.9 point gain in spatial understanding over SAM 3 using a unified early-fusion architecture.

Apr 3, 2026

AI NewsComputer Vision

Self-Hosting Vision Models on Datacenter GPUs

BAGEL-7B-MoT vision model on Tesla V100

Feb 19, 2026

AI NewsComputer VisionML & Data Engineering

Google Enhances Gemini 3 Flash with Agentic Vision

Google adds agentic vision to Gemini 3 Flash, improving accuracy by 5-10% on vision tasks and unlocking new AI-driven behaviors.

Feb 6, 2026

AI NewsComputer VisionDeep Learning

Training Text-to-Image Models: Key Takeaways from Ablations

Researchers achieve significant gains in text-to-image model training with representation alignment and better latents/tokenizers, improving quality and reducing training time.

Feb 3, 2026

AI NewsRoboticsComputer Vision

Introducing NVIDIA Cosmos Policy for Advanced Robot Control

NVIDIA introduces Cosmos Policy, a state-of-the-art robot control policy that achieves SOTA performance on LIBERO and RoboCasa benchmarks with 98.5% average success rate.

Jan 29, 2026

AI NewsComputer VisionArtificial Intelligence

Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework

FOFPred, a new framework from Salesforce AI, achieves state-of-the-art results on robot manipulation benchmarks, reaching a 78.7% Task 5 success rate on CALVIN.

Jan 21, 2026

AI NewsComputer VisionRobotics

NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

NVIDIA released Cosmos Reason 2, a vision language model achieving #1 open model status on the Physical AI Bench and Physical Reasoning leaderboards.

Jan 6, 2026

AI NewsMultimodal AIComputer Vision

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

Meta AI released PE-AV, a multimodal encoder achieving state-of-the-art performance on audio and video benchmarks with a 10.4 R@1 improvement on AudioCaps.

Dec 22, 2025

AI NewsLanguage ModelComputer Vision

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling

Zhipu AI launched GLM-4.6V, a 106B parameter multimodal model with a 128K token context window, enabling native multimodal function calling for improved agent capabilities.

Dec 9, 2025

AI NewsDeep LearningComputer Vision

My Model Cheated: How Grad-CAM Exposed a 95% Accuracy Lie

A 95% accuracy Deep Learning model for car damage classification was exposed as biased by Grad-CAM analysis.

Nov 30, 2025

AI NewsSmart CitiesComputer Vision

Unlocking Gridlock: AI That Sees Problems Before They Happen

AI predicts traffic bottlenecks before they occur, using hybrid neural networks for real-time anomaly detection.

Nov 29, 2025

AI NewsComputer VisionPyTorch

Meta's SAM 3 Enhances Segmentation Accuracy and Speed for Vision Workflows

Meta's SAM 3 improves segmentation accuracy and reduces inference latency for real-world vision tasks.

Nov 26, 2025

AI NewsComputer VisionNatural Language Processing

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM

Tencent’s HunyuanOCR, a 1B parameter vision language model, achieves state-of-the-art OCR performance on OmniDocBench with a score of 94.1.

Nov 26, 2025