Skip to main content
← All Tags

Computer Vision

32 articles in this category (Page 1 of 2)

AI NewsComputer VisionMachine Learning

Zero-Shot Object Detection: Replacing YOLO Retraining with Generative VLMs

Generative VLMs enable zero-shot detection, reducing the 150x latency gap between YOLOv8 and Phi-3.5 for semantic industrial inspection.

Read more
AI NewsComputer VisionArtificial Intelligence

World-R1: Enhancing Video Foundation Models with Flow-GRPO and 3D-Aware Rewards

Microsoft Research's World-R1 achieves a 7.91 dB PSNR gain in geometric consistency for video generation without architectural changes.

Read more
AI NewsComputer VisionMachine Learning

Best of WACV 2026: Advances in Zero-Shot Sampling and OOD Detection

Join Voxel51 on April 30 for the Best of WACV 2026 virtual event featuring four technical talks on subspace sampling and MLLM robustness.

Read more
AI NewsMachine LearningComputer Vision

Meta AI Sapiens2: Scaling Human-Centric Vision Models to 5B Parameters and 4K Resolution

Meta AI's Sapiens2 scales to 5B parameters and 1B images, achieving 82.3 mAP in pose estimation and 82.5 mIoU in segmentation across 1K and 4K resolutions.

Read more
AI NewsAgentic AIComputer Vision

Building VLA-Inspired Embodied Agents via Latent World Modeling and MPC

Learn to build a lightweight Vision-Language-Action agent using NumPy-rendered RGB observations and PyTorch to perform latent state prediction and real-time MPC planning.

Read more
AI NewsComputer VisionArtificial Intelligence

Vision Banana: Google DeepMind’s Instruction-Tuned Model Outperforms SAM 3 and Depth Anything V3

Vision Banana beats SAM 3 on segmentation and Depth Anything V3 on metric depth by treating vision tasks as image generation problems.

Read more
AI NewsEdge AIComputer Vision

Liquid AI LFM2.5-VL-450M: Sub-250ms Edge Inference and Bounding Box Prediction

Liquid AI releases LFM2.5-VL-450M, a 450M-parameter VLM achieving sub-250ms latency on NVIDIA Jetson Orin with new bounding box prediction.

Read more
AI NewsComputer VisionPhysical AI

Mastering Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim

Implement a complete Pose2Sim pipeline to convert multi-camera video into biomechanical data using RTMPose for 2D estimation and OpenSim for 3D joint angles.

Read more
AI NewsComputer VisionMachine Learning

Meta AI's EUPE: A <100M Parameter Universal Vision Encoder Rivaling Specialists

Meta AI introduces EUPE, a compact vision encoder under 100M parameters that matches domain-expert models in classification and dense prediction, achieving 55.2ms latency on iPhone 15 Pro.

Read more
AI NewsComputer VisionArtificial Intelligence

Building a Netflix VOID Video Object Removal Pipeline with CogVideoX

Implement Netflix's VOID model for advanced video object removal requiring 40GB+ VRAM and utilizing CogVideoX-Fun-V1.5-5b-InP.

Read more
AI NewsComputer VisionOpen Source

Netflix AI Open-Sources VOID: Physics-Aware Video Object Removal

Netflix AI and INSAIT release VOID, a 5B parameter model that removes video objects and their physical interactions using a novel quadmask and physics-aware conditioning.

Read more
AI NewsComputer VisionLarge Language Model

TII Releases Falcon Perception: A Unified 0.6B-Parameter Early-Fusion Transformer

TII’s Falcon Perception 0.6B model achieves a +21.9 point gain in spatial understanding over SAM 3 using a unified early-fusion architecture.

Read more
AI NewsComputer Vision

Self-Hosting Vision Models on Datacenter GPUs

BAGEL-7B-MoT vision model on Tesla V100

Read more
AI NewsComputer VisionML & Data Engineering

Google Enhances Gemini 3 Flash with Agentic Vision

Google adds agentic vision to Gemini 3 Flash, improving accuracy by 5-10% on vision tasks and unlocking new AI-driven behaviors.

Read more
AI NewsComputer VisionDeep Learning

Training Text-to-Image Models: Key Takeaways from Ablations

Researchers achieve significant gains in text-to-image model training with representation alignment and better latents/tokenizers, improving quality and reducing training time.

Read more
AI NewsRoboticsComputer Vision

Introducing NVIDIA Cosmos Policy for Advanced Robot Control

NVIDIA introduces Cosmos Policy, a state-of-the-art robot control policy that achieves SOTA performance on LIBERO and RoboCasa benchmarks with 98.5% average success rate.

Read more
AI NewsComputer VisionArtificial Intelligence

Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework

FOFPred, a new framework from Salesforce AI, achieves state-of-the-art results on robot manipulation benchmarks, reaching a 78.7% Task 5 success rate on CALVIN.

Read more
AI NewsComputer VisionRobotics

NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

NVIDIA released Cosmos Reason 2, a vision language model achieving #1 open model status on the Physical AI Bench and Physical Reasoning leaderboards.

Read more
AI NewsMultimodal AIComputer Vision

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

Meta AI released PE-AV, a multimodal encoder achieving state-of-the-art performance on audio and video benchmarks with a 10.4 R@1 improvement on AudioCaps.

Read more
AI NewsLanguage ModelComputer Vision

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling

Zhipu AI launched GLM-4.6V, a 106B parameter multimodal model with a 128K token context window, enabling native multimodal function calling for improved agent capabilities.

Read more
AI NewsDeep LearningComputer Vision

My Model Cheated: How Grad-CAM Exposed a 95% Accuracy Lie

A 95% accuracy Deep Learning model for car damage classification was exposed as biased by Grad-CAM analysis.

Read more
AI NewsSmart CitiesComputer Vision

Unlocking Gridlock: AI That Sees Problems Before They Happen

AI predicts traffic bottlenecks before they occur, using hybrid neural networks for real-time anomaly detection.

Read more
AI NewsComputer VisionPyTorch

Meta's SAM 3 Enhances Segmentation Accuracy and Speed for Vision Workflows

Meta's SAM 3 improves segmentation accuracy and reduces inference latency for real-world vision tasks.

Read more
AI NewsComputer VisionNatural Language Processing

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM

Tencent’s HunyuanOCR, a 1B parameter vision language model, achieves state-of-the-art OCR performance on OmniDocBench with a score of 94.1.

Read more