CVPR2026 论文笔记 TODO¶

总计: 2198 篇 | 已完成: 2198 | 待更新: 0

\(\varphi\)-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models | arXiv: 2602.22601
2ndmatch finetuning pruned diffusion models via second-order jacobian matching | arXiv: 2506.05398
3d gaussian splatting with self-constrained priors for high fidelity surface rec | arXiv: 2603.19682
3d sans 3d scans scalable pre-training from video-generated point clouds | arXiv: 2512.23042
3d-fixer coarse-to-fine in-place completion for 3d scenes from a single image | arXiv: 2604.04406
3d-ide 3d implicit depth emergent | arXiv: 2604.03296
3drawagent teaching llm to draw in 3d with early contrastive experience | arXiv: 2604.08042
3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion | arXiv: 2511.19117
4c4d 4 camera 4d gaussian splatting | arXiv: 2604.04063
4dequine disentangling motion and appearance for 4d equine reconstruction from m
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video | arXiv: 2603.10125
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks | arXiv: 2603.12998
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks | arXiv: 2603.12998
a closer look at cross-domain few-shot object detection fine-tuning matters and | arXiv: 2603.28182
a frame is worth one token efficient generative world modeling with delta tokens | arXiv: 2604.04913
a mixed diet makes dino an omnivorous vision encoder | arXiv: 2602.24181
A Mixed Diet Makes DINO An Omnivorous Vision Encoder | arXiv: 2602.24181
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning | arXiv: 2603.14052
a paradigm shift fully end-to-end training for temporal sentence grounding in vi | arXiv: 2604.02860
A Prediction-as-Perception Framework for 3D Object Detection | arXiv: 2603.12599
A Prediction-as-Perception Framework for 3D Object Detection | arXiv: 2603.12599
A protocol for evaluating robustness to H&E staining variation in computational pathology models | arXiv: 2603.12886
a semantically disentangled unified model for multi-category 3d anomaly detectio | arXiv: 2603.25159
A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement | arXiv: 2603.06167
A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement | arXiv: 2603.06167
a unified perspective on adversarial membership manipulation in vision models | arXiv: 2604.02780
A2P: From 2D Alignment to 3D Plausibility for Occlusion-Robust Two-Hand Reconstruction | arXiv: 2503.17788
a2z-10m geometric deep learning with a-to-z brep annotations for ai-assisted cad | arXiv: 2603.12605
a3 towards advertising aesthetic assessment | arXiv: 2603.24037
A4VL: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning | arXiv: 2603.14052
ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection | arXiv: 2603.12409
ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection | arXiv: 2603.12409
Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective | arXiv: 2507.05914
Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning | arXiv: 2603.13007
Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning | arXiv: 2603.13007
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation | arXiv: 2603.02945
acetone bridging words and colors for conditional image grading | arXiv: 2604.00530
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery | arXiv: 2603.16616
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning | arXiv: 2603.00667
action-guided generation of 3d functionality segmentation data | arXiv: 2511.23230
actionmesh animated 3d mesh generation with temporal 3d diffusion | arXiv: 2601.16148
Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation | arXiv: 2602.23814
activation matters test-time activated negative labels for ood detection with vi | arXiv: 2603.25250
Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning | arXiv: 2603.07559
activityforensics a comprehensive benchmark for localizing manipulated activity | arXiv: 2604.03819
actta rethinking test-time adaptation via dynamic activation | arXiv: 2603.26096
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation | arXiv: 2603.11984
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation | arXiv: 2603.11984
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks | arXiv: 2510.03101
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation | arXiv: 2603.19157
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions | arXiv: 2603.12468
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions | arXiv: 2603.12468
adapting a pre-trained single-cell foundation model to spatial gene expression g | arXiv: 2603.19766
adapting point cloud analysis via multimodal bayesian distribution learning | arXiv: 2603.22070
adaptive action chunking at inference-time for vision-language-action models | arXiv: 2604.04161
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation | arXiv: 2603.19158
adaptive confidence regularization for multimodal failure detection | arXiv: 2603.02200
adaptive learned image compression with graph neural networks | arXiv: 2603.25316
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration | arXiv: 2603.01623
Adaptive Vision-Language Model Routing for Computer Use Agents | arXiv: 2603.12823
adaptvision efficient vision-language models via adaptive visual acquisition | arXiv: 2512.03794
adaradar rate adaptive spectral compression for radar-based perception | arXiv: 2603.17979
adasformer adaptive serialized transformers for monocular semantic scene complet | arXiv: 2603.25494
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding | arXiv: 2603.12514
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding | arXiv: 2603.12514
AdvMark: Decoupling Defense Strategies for Robust Image Watermarking | arXiv: 2602.20053
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction | arXiv: 2602.22376
affordgrasp cross-modal diffusion for affordance-aware grasp synthesis | arXiv: 2603.08021
affordmatcher affordance learning in 3d scenes from visual signifiers | arXiv: 2603.27970
AFRO: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning | arXiv: 2512.00074
Agentic Retoucher for Text-To-Image Generation | arXiv: 2601.02046
Agentic Retoucher for Text-To-Image Generation | arXiv: 2601.02046
agft alignment-guided fine-tuning for zero-shot adversarial robustness of vision | arXiv: 2603.29410
AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution | arXiv: 2603.00589
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark | arXiv: 2602.23523
All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference | arXiv: 2603.08498
All-in-One Slider for Attribute Manipulation in Diffusion Models | arXiv: 2508.19195
All-in-One Slider for Attribute Manipulation in Diffusion Models | arXiv: 2508.19195
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS | arXiv: 2603.10671
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS | arXiv: 2603.10671
an instance-centric panoptic occupancy prediction benchmark for autonomous drivi | arXiv: 2603.27238
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening | arXiv: 2603.17651
anchorsplat feed-forward 3d gaussian splatting with 3d geometric priors | arXiv: 2604.07053
ani3dhuman photorealistic 3d human animation with self-guided stochastic samplin | arXiv: 2602.19089
anomalyvfm -- transforming vision foundation models into zero-shot anomaly detec | arXiv: 2601.20524
anthrotap learning point tracking with real-world motion | arXiv: 2507.06233
anti-i2v safeguarding your photos from malicious image-to-video generation | arXiv: 2603.24570
Anticipatory Planning for Multimodal AI Agents | arXiv: 2603.16777
anydoc enhancing document generation via large-scale htmlcss data synthesis and | arXiv: 2603.25118
AnyPcc: Compressing Any Point Cloud with a Single Universal Model | arXiv: 2510.20331
ApET: Approximation-Error Guided Token Compression for Efficient VLMs | arXiv: 2602.19870
ApET: Approximation-Error Guided Token Compression for Efficient VLMs | arXiv: 2602.19870
apple attribute-preserving pseudo-labeling for diffusion-based face swapping | arXiv: 2601.15288
ar2can an architect and an artist leveraging a canvas for multi-human generation | arXiv: 2511.22690
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation | arXiv: 2603.10188
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation | arXiv: 2603.10188
Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study | arXiv: 2603.13044
Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study | arXiv: 2603.13044
arthoi taming foundation models for monocular 4d reconstruction of hand-articula | arXiv: 2603.25791
artllm generating articulated assets via 3d llm | arXiv: 2603.01142
AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos | arXiv: 2603.07758
as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
As Language Models Scale, Low-order Linear Depth Dynamics Emerge | arXiv: 2603.12541
As Language Models Scale, Low-order Linear Depth Dynamics Emerge | arXiv: 2603.12541
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys | arXiv: 2603.11928
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys | arXiv: 2603.11928
asking like socrates socrates helps vlms understand remote sensing images | arXiv: 2511.22396
AssistMimic: Physics-Grounded Humanoid Assistance via Multi-Agent RL | arXiv: 2603.11346
association and consolidation evolutionary memory-enhanced incremental multi-vie | arXiv: 2509.14544
Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts | arXiv: 2603.09531
Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts | arXiv: 2603.09531
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots | arXiv: 2603.07648
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots | arXiv: 2603.07648
attend before attention efficient and scalable video understanding via autoregre | arXiv: 2603.12254
attention may i have your decision localizing generative choices in diffusion mo | arXiv: 2604.06052
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution | arXiv: 2603.10583
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution | arXiv: 2603.10583
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors | arXiv: 2603.15656
autocut end-to-end advertisement video editing based on multimodal discretizatio | arXiv: 2603.28366
AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models | arXiv: 2508.00445
AutoGaze: Attend Before Attention — Efficient and Scalable Video Understanding via Autoregressive Gazing | arXiv: 2603.12254
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI | arXiv: 2603.11818
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI | arXiv: 2603.11818
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models | arXiv: 2506.09082
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models | arXiv: 2506.09082
avatar reinforcement learning to see hear and reason over video | arXiv: 2508.03100
avatarpointillist autoregressive 4d gaussian avatarization | arXiv: 2604.04787
AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network | arXiv: 2603.12659
AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network | arXiv: 2603.12659
AVR: Adaptive VLM Routing for Computer Use Agents | arXiv: 2603.12823
babyvlm-v2 toward developmentally grounded pretraining and benchmarking of visio | arXiv: 2512.10932
back to point exploring point-language models for zero-shot 3d anomaly detection | arXiv: 2603.21511
balm a model-agnostic framework for balanced multimodal learning under imbalance | arXiv: 2603.19718
banana100 breaking nr-iqa metrics by 100 iterative image replications with nano | arXiv: 2604.03400
bases of steerable kernels for equivariant cnns from 2d rotations to the lorentz | arXiv: 2603.12459
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group | arXiv: 2603.12459
bd-merging bias-aware dynamic model merging with evidence-guided contrastive lea | arXiv: 2603.03920
beautygrpo aesthetic alignment for face retouching via dynamic path guidance and | arXiv: 2603.01163
Benchmarking Endoscopic Surgical Image Restoration and Beyond | arXiv: 2505.19161
benchmarking phd-level coding in 3d geometric computer vision | arXiv: 2603.30038
benchmarking vision-language models under contradictory virtual content attacks | arXiv: 2604.05510
BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending | arXiv: 2603.13102
BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending | arXiv: 2603.13102
better than average spatially-aware aggregation of segmentation uncertainty impr | arXiv: 2603.29941
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images | arXiv: 2603.17159
Beyond Caption-Based Queries for Video Moment Retrieval | arXiv: 2603.02363
Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D | arXiv: 2603.05906
beyond global similarity towards fine-grained multi-condition multimodal retriev | arXiv: 2603.01082
beyond ground-truth leveraging image quality priors for real-world image restora | arXiv: 2603.29773
beyond heuristic prompting a concept-guided bayesian framework for zero-shot ima | arXiv: 2603.07911
beyond loss values robust dynamic pruning via loss trajectory alignment | arXiv: 2604.07306
Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control | arXiv: 2512.21058
Beyond Prompt Degradation: Prototype-Guided Dual-Pool Prompting for Incremental Object Detection | arXiv: 2603.02286
beyond recognition evaluating visual perspective taking in vision language model | arXiv: 2505.03821
beyond semantic search towards referential anchoring in composed image retrieval | arXiv: 2604.05393
beyond semantics disentangling information scope in sparse autoencoders for clip | arXiv: 2604.05724
beyond single-sample reliable multi-sample distillation for video understanding | arXiv: 2603.11423
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding | arXiv: 2603.11423
beyond static artifacts a forensic benchmark for video deepfake reasoning in vis | arXiv: 2602.21779
beyond the fold quantifying split-level noise and the case for leave-one-dataset | arXiv: 2604.02162
beyond the golden data resolving the motion-vision quality dilemma via timestep | arXiv: 2603.25527
beyond the ground truth enhanced supervision for image restoration | arXiv: 2512.03932
beyond the mean modelling annotation distributions in continuous affect predicti | arXiv: 2604.07198
bhcast unlocking black hole plasma dynamics from a single blurry image with long | arXiv: 2603.26777
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation | arXiv: 2603.00156
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation | arXiv: 2603.00156
bidirectional multimodal prompt learning with scale-aware training for few-shot | arXiv: 2408.13516
bigain unified token compression for joint generation and classification | arXiv: 2603.12240
BiGain: Unified Token Compression for Joint Generation and Classification | arXiv: 2603.12240
Bilevel Layer-Positioning LoRA for Real Image Dehazing | arXiv: 2603.10872
Bilevel Layer-Positioning LoRA for Real Image Dehazing | arXiv: 2603.10872
bimotion b-spline motion for text-guided dynamic 3d character generation | arXiv: 2602.18873
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers | arXiv: 2603.09582
biovita biological dataset model and benchmark for visual-textual-acoustic align | arXiv: 2603.23883
bipremanip learning affordance-based bimanual preparatory manipulation through a | arXiv: 2603.21679
blackmirror black-box backdoor detection for text-to-image models via instructio | arXiv: 2603.05921
blazefl fast and deterministic federated learning simulation | arXiv: 2604.03606
blink dynamic visual token resolution for enhanced multimodal understanding | arXiv: 2512.10548
BluRef: Unsupervised Image Deblurring with Dense-Matching References | arXiv: 2603.14176
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting | arXiv: 2603.16129
boosting vision-language-action finetuning with feasible action neighborhood pri | arXiv: 2604.01570
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning | arXiv: 2603.13109
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning | arXiv: 2603.13109
Bounds on Agreement between Subjective and Objective Measurements | arXiv: 2603.13204
bounds on agreement between subjective and objective measurements | arXiv: 2603.13204
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors | arXiv: 2603.13092
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors | arXiv: 2603.13092
BRepGaussian: CAD Reconstruction from Multi-View Images with Gaussian Splatting | arXiv: 2602.21105
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation | arXiv: 2602.19863
bridge multimodal-to-text retrieval via reinforcement-learned query alignment | arXiv: 2604.07201
bridging pixels and words mask-aware local semantic fusion for multimodal media | arXiv: 2603.26052
bridging the perception gap in image super-resolution evaluation | arXiv: 2503.13074
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD | arXiv: 2603.10933
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD | arXiv: 2603.10933
brima bridged modality adaptation for multi-modal continual action quality asses | arXiv: 2602.19170
BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy | arXiv: 2603.14361
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds | arXiv: 2602.23645
bulletgen improving 4d reconstruction with bullet-time generation | arXiv: 2506.18601
bussard normalizing flows for bijective universal scene-specific anomalous relat | arXiv: 2603.16645
ca-lora concept-aware lora for domain-aligned segmentation dataset generation | arXiv: 2503.22172
can natural image autoencoders compactly tokenize fmri volumes for long-range dy | arXiv: 2604.03619
can vision-language models count a synthetic benchmark and analysis of attention | arXiv: 2511.17722
capt confusion-aware prompt tuning for reducing vision-language misalignment | arXiv: 2603.02557
care a molecular-guided foundation model with adaptive region modeling for whole | arXiv: 2602.21637
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing | arXiv: 2603.08589
CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion | arXiv: 2602.19140
carepilot a multi-agent framework for long-horizon computer task automation in h | arXiv: 2603.24157
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation | arXiv: 2603.12766
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation | arXiv: 2603.12766
Causal Motion Diffusion Models for Autoregressive Motion Generation | arXiv: 2602.22594
causalvad de-confounding end-to-end autonomous driving via causal intervention | arXiv: 2603.18561
cc-vqa conflict- and correlation-aware method for mitigating knowledge conflict | arXiv: 2602.23952
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning | arXiv: 2602.21655
ccf complementary collaborative fusion for domain generalized multi-modal 3d obj | arXiv: 2603.23276
cd-buffer complementary dual-buffer framework for test-time adaptation in advers | arXiv: 2603.26092
CDA-VSR: Compressed-Domain-Aware Online Video Super-Resolution | arXiv: 2603.07694
CDG: Guiding Diffusion Models with Semantically Degraded Conditions | arXiv: 2603.10780
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images | arXiv: 2603.18461
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance | arXiv: 2603.03281
cghair compact gaussian hair reconstruction with card clustering | arXiv: 2604.03716
Chain of Event-Centric Causal Thought for Physically Plausible Video Generation | arXiv: 2603.09094
Chain of World: World Model Thinking in Latent Motion (CoWVLA) | arXiv: 2603.03195
changebridge spatiotemporal image generation with multimodal controls for remote | arXiv: 2507.04678
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion | arXiv: 2511.12370
chartnet a million-scale high-quality multimodal dataset for robust chart unders | arXiv: 2603.27064
cheem continual learning by reuse new adapt and skip -- a hierarchical explorati | arXiv: 2303.08250
chips efficient clip adaptation via curvature-aware hybrid influence-based data | arXiv: 2511.18519
CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection | arXiv: 2511.18519
chordedit one-step low-energy transport for image editing | arXiv: 2602.19083
CI-ICE: Intrinsic Concept Extraction Based on Compositional Interpretability | arXiv: 2603.11795
cigpose causal intervention graph neural network for whole-body pose estimation | arXiv: 2603.09418
cinematic audio source separation using visual cues | arXiv: 2603.26113
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization | arXiv: 2603.16966
CIPHER: 用反事实对抗幻觉——扩散引导的LVLM幻觉抑制 | arXiv: 2603.10470
circuit mechanisms for spatial relation generation in diffusion transformers | arXiv: 2601.06338
circuit tracing in vision-language models understanding the internal mechanisms | arXiv: 2602.20330
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning | arXiv: 2602.19605
cleaning the pool progressive filtering of unlabeled pools in deep active learni | arXiv: 2511.22344
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data | arXiv: 2512.02686
clip is shortsighted paying attention beyond the first sentence | arXiv: 2602.22419
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence | arXiv: 2602.22419
CLIP-Free, Label-Free, Unsupervised Concept Bottleneck Models | arXiv: 2503.10981
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation | arXiv: 2602.20409
CLoE: Expert Consistency Learning for Missing Modality Segmentation | arXiv: 2603.09316
CLoE: Expert Consistency Learning for Missing Modality Segmentation | arXiv: 2603.09316
cluster-wise spatio-temporal masking for efficient video-language pretraining | arXiv: 2603.22953
cmhanet a cross-modal hybrid attention network for point cloud registration | arXiv: 2603.12721
CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration | arXiv: 2603.12721
CoD: A Diffusion Foundation Model for Image Compression | arXiv: 2511.18706
CodeBrain: Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code | arXiv: 2501.18328
coded-e2lf coded aperture light field imaging from events | arXiv: 2602.22620
codedance a dynamic tool-integrated mllm for executable visual reasoning | arXiv: 2512.17312
codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
CodePercept: Code-Grounded Visual STEM Perception for MLLMs | arXiv: 2603.10757
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation | arXiv: 2603.12829
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation | arXiv: 2603.12829
cog confidence-aware optimal geometric correspondence for unsupervised single-re | arXiv: 2603.00493
CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment | arXiv: 2603.12722
CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment | arXiv: 2603.12722
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass | arXiv: 2603.12789
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass | arXiv: 2603.12789
coin3d revisiting configuration-invariant multi-camera 3d object detection | arXiv: 2603.05042
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving | arXiv: 2512.22939
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion | arXiv: 2603.00682
CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation | arXiv: 2602.22150
color when it counts grayscale-guided online triggering for always-on streaming | arXiv: 2603.22466
como learning continuous latent motion from internet videos for scalable robot l | arXiv: 2505.17006
compagent an agentic framework for visual compliance verification | arXiv: 2511.00171
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging | arXiv: 2603.04796
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper | arXiv: 2603.04796
Competition-Aware CPC Forecasting with Near-Market Coverage | arXiv: 2603.13059
Competition-Aware CPC Forecasting with Near-Market Coverage# Competition-Aware CPC Forecasting with Near-Market Coverage | arXiv: 2603.13059
CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation | arXiv: 2603.12864
Composing Concepts from Images and Videos via Concept-prompt Binding | arXiv: 2512.09824
Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation | arXiv: 2603.12864
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression | arXiv: 2603.13795
concept-guided fine-tuning steering vits away from spurious correlations to impr | arXiv: 2603.08309
conceptprism concept disentanglement in personalized diffusion models via residu | arXiv: 2602.19575
conditional factuality controlled llms with generalization certificates via conf | arXiv: 2603.27403
consistcompose unified multimodal layout control for image composition | arXiv: 2511.18333
ConsistCompose: Unified Multimodal Layout Control for Image Composition | arXiv: 2511.18333
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation | arXiv: 2603.09506
continual learning with vision-language models via semantic-geometry preservatio | arXiv: 2603.12055
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation | arXiv: 2603.12055
Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis | arXiv: 2603.01398
COT-FM: Cluster-wise Optimal Transport Flow Matching | arXiv: 2603.13395
covft context-aware visual fine-tuning for multimodal large language models | arXiv: 2603.21077
covr-rreason-aware composed video retrieval | arXiv: 2603.20190
craterbench-r instance-level crater retrieval for planetary scale | arXiv: 2604.06245
crft consistent-recurrent feature flow transformer for cross-modal image registr | arXiv: 2604.05689
crit graph-based automatic data synthesis to enhance cross-modal multi-hop reaso | arXiv: 2604.01634
critical patch-aware sparse prompting with decoupled training for continual lear | arXiv: 2604.07399
cross-domain demo-to-code via neurosymbolic counterfactual reasoning | arXiv: 2603.18495
cross-instance gaussian splatting registration via geometry-aware feature-guided | arXiv: 2603.21936
cross-modal emotion transfer for emotion editing in talking face video | arXiv: 2604.07786
cross-modal fuzzy alignment network for text-aerial person retrieval and a large | arXiv: 2603.20721
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning | arXiv: 2603.01696
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark | arXiv: 2603.00543
cross-slice knowledge transfer via masked multi-modal heterogeneous graph contra | arXiv: 2603.22821
crossearth-sar a sar-centric and billion-scale geospatial foundation model for d | arXiv: 2603.12008
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation | arXiv: 2603.12008
crosshoi-bench a unified benchmark for hoi evaluation across vision-language mod | arXiv: 2508.18753
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image | arXiv: 2603.17779
cryohype reconstructing a thousand cryo-em structures with transformer-based hyp | arXiv: 2512.06332
cryosense compressive sensing enables high-throughput microscopy with sparse and | arXiv: 2511.12931
ctcal rethinking text-to-image diffusion models via cross-timestep self-calibrat | arXiv: 2603.20741
ctfs collaborative teacher framework for forward-looking sonar image semantic se | arXiv: 2603.21071
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video | arXiv: 2603.04291
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens | arXiv: 2603.19232
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation | arXiv: 2601.15408
customized visual storytelling with unified multimodal llms | arXiv: 2603.27690
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization | arXiv: 2603.19121
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events | arXiv: 2603.06213
cva context-aware video-text alignment for video temporal grounding | arXiv: 2603.24934
Cycle-Consistent Tuning for Layered Image Decomposition | arXiv: 2602.20989
cyclebev regularizing view transformation networks via view cycle consistency fo | arXiv: 2602.23575
D2C: Accelerating Diffusion Model Training under Minimal Budgets via Condensation | arXiv: 2507.05914
D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping | arXiv: 2507.08492
da-mamba learning domain-aware state space model for global-local alignment in d | arXiv: 2603.18757
da-vae plug-in latent compression for diffusion via detail alignment | arXiv: 2603.22125
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation | arXiv: 2603.03744
dark3r learning structure from motion in the dark | arXiv: 2603.05330
data warmup complexity-aware curricula for efficient diffusion training | arXiv: 2604.07397
DAWN: Pixel Motion Diffusion is What We Need for Robot Control | arXiv: 2509.22652
DC-Merge: Improving Model Merging with Directional Consistency | arXiv: 2603.06242
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles | arXiv: 2603.01111
Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation | arXiv: 2603.12547
Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation | arXiv: 2603.12547
decompose and transfer cot-prompting enhanced alignment for open-vocabulary temp | arXiv: 2603.24030
deconstructing the failure of ideal noise correction a three-pillar diagnosis | arXiv: 2603.12997
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis | arXiv: 2603.12997
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation | arXiv: 2603.00574
Decoupling Vision and Language: Codebook Anchored Visual Adaptation | arXiv: 2602.19449
decovln decoupling observation reasoning and correction for vision-and-language | arXiv: 2603.13133
dedelayed deleting remote inference delay via on-device correction | arXiv: 2510.13714
Deep Learning Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging | arXiv: 2603.12715
deep learning-based assessment of the relation between the third molar and mandi | arXiv: 2603.11850
Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning | arXiv: 2603.11850
Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning | arXiv: 2603.11850
Deep Learning–Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging | arXiv: 2603.12715
Defending Unauthorized Model Merging via Dual-Stage Weight Protection | arXiv: 2511.11851
deformation-based in-context learning for point cloud understanding | arXiv: 2604.02845
demographic fairness in multimodal llms a benchmark of gender and ethnicity bias | arXiv: 2603.25613
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache | arXiv: 2602.22654
designing to forget deep semi-parametric models for unlearning | arXiv: 2603.22870
Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification | arXiv: 2602.18842
detecting unknown objects via energy-based separation for open world object dete | arXiv: 2603.29954
developing foundation models for universal segmentation from 3d whole-body posit | arXiv: 2603.11627
Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography | arXiv: 2603.11627
Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models | arXiv: 2603.06049
DIAE: Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception | arXiv: 2603.11556
diagnose correct and learn from manipulation failures via visual symbols | arXiv: 2512.02787
diagnosing and repairing unsafe channels in vision-language models via causal di | arXiv: 2603.27240
diff4splat controllable 4d scene generation with latent dynamic reconstruction m | arXiv: 2511.00503
diffbmp differentiable rendering with bitmap primitives | arXiv: 2602.22625
diffusion mental averages | arXiv: 2603.29239
Diffusion Probe: Generated Image Result Prediction Using CNN Probes | arXiv: 2602.23783
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification | arXiv: 2603.13182
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification | arXiv: 2603.13182
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization | arXiv: 2603.14267
dino-qpm adapting visual foundation models for globally interpretable image clas | arXiv: 2604.07166
dip taming diffusion models in pixel space | arXiv: 2511.18822
direct segmentation without logits optimization for training-free open-vocabular | arXiv: 2604.07723
directfisheye-gs enabling native fisheye input in gaussian splatting with cross- | arXiv: 2604.00648
DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification | arXiv: 2603.12905
DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification | arXiv: 2603.12905
disca accelerating video diffusion transformers with distillation-compatible lea | arXiv: 2602.05449
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching | arXiv: 2602.05449
disentangle-then-align non-iterative hybrid multimodal image registration via cr | arXiv: 2603.19623
Disentangled Textual Priors for Diffusion-based Image Super-Resolution | arXiv: 2603.07430
disentangling to re-couple resolving the similarity-controllability paradox in s | arXiv: 2604.00849
Distilling Balanced Knowledge from a Biased Teacher | arXiv: 2506.18496
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression | arXiv: 2603.13162
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression | arXiv: 2603.13162
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers | arXiv: 2603.04239
Diversity over Uniformity: Rethinking Representation in Generated Image Detection | arXiv: 2603.00717
divide then ground adapting frame selection to query types for long-form video u | arXiv: 2512.04000
dlwm dual latent world models enable holistic gaussian-centric pre-training in a | arXiv: 2604.00969
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis | arXiv: 2602.23022
dmin scalable training data influence estimation for diffusion models | arXiv: 2412.08637
do vision-language models leak what they learn adaptive token-weighted model inv | arXiv: 2508.04097
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks | arXiv: 2508.04097
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering | arXiv: 2603.12533
Does YOLO Really Need to See Every Training Image in Every Epoch? | arXiv: 2603.17684
Domain-Skewed Federated Learning with Feature Decoupling and Calibration | arXiv: 2603.14238
downscaling intelligence exploring perception and reasoning bottlenecks in small | arXiv: 2511.17487
DPAD: Discriminative Perception via Anchored Description for Reasoning Segmentation | arXiv: 2603.04002
DPCache: 去噪即路径规划——免训练扩散模型加速 | arXiv: 2602.22654
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras | arXiv: 2603.01007
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving | arXiv: 2603.01007
Draft and Refine with Visual Experts | arXiv: 2511.11005
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning | arXiv: 2603.12257
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning | arXiv: 2603.12257
drift-resilient temporal priors for visual tracking | arXiv: 2604.02654
drive my way preference alignment of vision-language-action model for personaliz | arXiv: 2603.25740
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance | arXiv: 2512.14266
DROID-W: DROID-SLAM in the Wild | arXiv: 2603.19076
dropping anchor and spherical harmonics for sparse-view gaussian splatting | arXiv: 2602.20933
dsca dynamic subspace concept alignment for lifelong vlm editing | arXiv: 2604.07965
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime | arXiv: 2603.10538
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime | arXiv: 2603.10538
DSS: Discover, Segment, and Select - A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation | arXiv: 2602.19944
DSS: Discover, Segment, and Select for Zero-shot Camouflaged Object Segmentation | arXiv: 2602.19944
DTR: Dynamic Token Reweighting for Robust Vision-Language Models | arXiv: 2505.17132
dual band thermal videography separating time-varying reflection and emission ne | arXiv: 2509.11334
dual-agent reinforcement learning for adaptive and cost-aware visual-inertial od | arXiv: 2511.21083
dual-imbalance continual learning for real-world food recognition | arXiv: 2603.29133
dualreg dual-space filtering and reinforcement for rigid registration | arXiv: 2508.17034
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference | arXiv: 2602.18846
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference | arXiv: 2602.18846
duo-vsr dual-stream distillation for one-step video super-resolution | arXiv: 2603.22271
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction | arXiv: 2603.03265
dynamic black-hole emission tomography with physics-informed neural fields | arXiv: 2602.08029
Dynamic Momentum Recalibration in Online Gradient Learning | arXiv: 2603.06120
Dynamic Token Reweighting for Robust Vision-Language Models | arXiv: 2505.17132
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs | arXiv: 2602.21864
dynavid learning to generate highly dynamic videos using synthetic motion data | arXiv: 2604.01666
e-3dpsm a state machine for event-based egocentric 3d human pose estimation | arXiv: 2604.08543
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought | arXiv: 2602.21698
e-rayzer self-supervised 3d reconstruction as spatial visual pre-training | arXiv: 2512.10950
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction | arXiv: 2603.14684
eaglenet energy-aware fine-grained relationship learning network for text-video | arXiv: 2603.25267
eaglevision a dual-stage framework with bev-grounding-based chain-of-thought for | arXiv: 2512.15160
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow | arXiv: 2602.21499
EB-JDAT: Energy-based Joint Distribution Adversarial Training | arXiv: 2505.19459
echoagent towards reliable echocardiography interpretation with eyeshands and mi | arXiv: 2604.05541
echoes of ownership adversarial-guided dual injection for copyright protection i | arXiv: 2602.18845
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models | arXiv: 2602.20981
echotrail-gui building actionable memory for gui agents via critic-guided self-e | arXiv: 2512.19396
ECKConv: Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant Point Cloud Analysis | arXiv: 2603.17538
edgedit hardware-aware diffusion transformers for efficient on-device image gene | arXiv: 2603.28405
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing | arXiv: 2603.17583
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking | arXiv: 2603.12949
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking | arXiv: 2603.12949
editing physiological signals in videos using latent representations | arXiv: 2509.25348
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing | arXiv: 2603.19224
Efficient Document Parsing via Parallel Token Prediction | arXiv: 2603.15206
efficient equivariant transformer for self-driving agent modeling | arXiv: 2604.01466
efficient hybrid se3-equivariant visuomotor flow policy via spherical harmonics | arXiv: 2603.23227
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance | arXiv: 2603.07570
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance | arXiv: 2603.07570
Ego-1K: A Large-Scale Multiview Video Dataset for Egocentric Vision | arXiv: 2603.13741
ego2web a web agent benchmark grounded in egocentric videos | arXiv: 2603.22529
egoflow gradient-guided flow matching for egocentric 6dof object motion generati | arXiv: 2604.01421
egomind activating spatial cognition through linguistic reasoning in mllms | arXiv: 2604.03318
EgoPointVQA: Gesture-Based Egocentric Video Question Answering | arXiv: 2603.12533
egoposeformer v2 accurate egocentric human motion estimation for arvr | arXiv: 2603.04090
egoxtreme a dataset for robust object pose estimation in egocentric views under | arXiv: 2603.25135
EI: Early Intervention for Multimodal Imaging based Disease Recognition | arXiv: 2603.17514
Elastic Weight Consolidation Done Right for Continual Learning | arXiv: 2603.18596
ELogitNorm: Enhancing OOD Detection with Extended Logit Normalization | arXiv: 2504.11434
Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models | arXiv: 2507.18534
elucidating the design space of arbitrary-noise-based diffusion models | arXiv: 2507.18534
elvis enhance low-light for video instance segmentation in the dark | arXiv: 2512.01495
EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease | arXiv: 2602.19178
embodiedsplat online feed-forward semantic 3dgs for open-vocabulary 3d scene und | arXiv: 2603.04254
EMDUL: Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets | arXiv: 2603.14507
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy | arXiv: 2512.06684
emma concept erasure benchmark with comprehensive semantic metrics and diverse c | arXiv: 2512.17320
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models | arXiv: 2602.23802
emotag emotion-aware talking head synthesis on gaussian splatting with few-shot | arXiv: 2603.21332
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis | arXiv: 2511.12554
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis | arXiv: 2511.12554
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM | arXiv: 2603.12773
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM | arXiv: 2603.12773
enc-bench a benchmark for evaluating multimodal large language models in electro | arXiv: 2603.22763
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking | arXiv: 2501.14894
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration | arXiv: 2501.14894
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator | arXiv: 2603.14726
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception | arXiv: 2603.11556
Enhancing Out-of-Distribution Detection with Extended Logit Normalization | arXiv: 2504.11434
Enhancing Spatial Understanding in Image Generation via Reward Modeling | arXiv: 2602.24233
EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis | arXiv: 2603.11294
EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis | arXiv: 2603.11294
erasure or erosion evaluating compositional degradation in unlearned text-to-ima | arXiv: 2604.04575
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection | arXiv: 2603.11521
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection | arXiv: 2603.11521
Evaluating Few-Shot Pill Recognition Under Visual Domain Shift | arXiv: 2603.10833
Evaluating Few-Shot Pill Recognition Under Visual Domain Shift | arXiv: 2603.10833
evatok adaptive length video tokenization for efficient visual autoregressive ge | arXiv: 2603.12267
EVATok: 自适应长度视频Tokenization用于高效视觉自回归生成 | arXiv: 2603.12267
eventhub data factory for generalizable event-based stereo networks without acti | arXiv: 2604.02331
every error has its magnitude asymmetric mistake severity training for multiclas | arXiv: 2603.13682
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation | arXiv: 2603.07476
evolmm self-evolving large multimodal models with continuous rewards | arXiv: 2511.16672
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards | arXiv: 2511.16672
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition | arXiv: 2603.03827
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory | arXiv: 2603.15800
Evolving Prompt Adaptation for Vision-Language Models | arXiv: 2603.09493
EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models | arXiv: 2603.09493
ew-detr evolving world object detection via incremental low-rank detection trans | arXiv: 2602.20985
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer | arXiv: 2602.20985
Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation | arXiv: 2603.12577
Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation | arXiv: 2603.12577
explaining clip zero-shot predictions through concepts | arXiv: 2603.28211
explore with long-term memory a benchmark and multimodal llm-based reinforcement | arXiv: 2601.10744
exploring conditions for diffusion models in robotic control | arXiv: 2510.15510
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction | arXiv: 2603.00611
ExpPortrait: Expressive Portrait Generation via Personalized Representation | arXiv: 2602.19900
expressedit fast editing of stylized facial expressions with diffusion models in | arXiv: 2604.03448
extend3d town-scale 3d generation | arXiv: 2603.29387
extending zach-vit to robust medical imaging corruption and adversarial stress t | arXiv: 2604.06099
extrinsplat decoupling geometry and semantics for open-vocabulary understanding | arXiv: 2509.22225
f3dgs federated 3d gaussian splatting for decentralized multi-agent world modeli | arXiv: 2604.01605
faar efficient frequency-aware multi-task fine-tuning via automatic rank selecti | arXiv: 2603.20403
face time traveller travel through ages without losing identity | arXiv: 2602.22819
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration | arXiv: 2603.16570
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning | arXiv: 2603.05506
FaceCoT: Chain-of-Thought Reasoning in MLLMs for Face Anti-Spoofing | arXiv: 2506.01783
fact-gs frequency-aligned complexity-aware texture reparameterization for 2d gau | arXiv: 2511.23292
failure modes for deep learning-based online mapping how to measure and address | arXiv: 2603.19852
Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning | arXiv: 2603.12988
Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning | arXiv: 2603.12988
FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning | arXiv: 2508.02291
fairllava fairness-aware parameter-efficient fine-tuning for large vision-langua | arXiv: 2603.26008
falcon false-negative aware learning of contrastive negatives in vision-language | arXiv: 2505.11192
fast scenescript fast and accurate language-based 3d scene understanding via mul | arXiv: 2512.05597
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning | arXiv: 2601.09708
fast3dcache training-free 3d geometry synthesis acceleration | arXiv: 2511.22533
FastGS: Training 3D Gaussian Splatting in 100 Seconds | arXiv: 2511.04283
FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters | arXiv: 2603.01685
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking | arXiv: 2603.12758
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking | arXiv: 2603.12758
fcl-cod weakly supervised camouflaged object detection with frequency-aware and | arXiv: 2603.22969
fdeid-toolbox face de-identification toolbox | arXiv: 2603.13121
FDeID-Toolbox: Face De-Identification Toolbox | arXiv: 2603.13121
FDeID-Toolbox: Face De-Identification Toolbox | arXiv: 2603.13121
feature attribution stability suite how stable are post-hoc attributions | arXiv: 2604.02532
fecalfed privacy-preserving poultry disease detection via federated learning | arXiv: 2604.00559
Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift | arXiv: 2603.01040
FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation | arXiv: 2603.04890
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts | arXiv: 2603.12912
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts | arXiv: 2603.12912
feddap domain-aware prototype learning for federated learning under domain shift | arXiv: 2604.06795
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance | arXiv: 2603.10341
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance | arXiv: 2603.10341
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation | arXiv: 2603.04887
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation | arXiv: 2603.04887
fedre a representation entanglement framework for model-heterogeneous federated | arXiv: 2511.22265
fedvg gradient-guided aggregation for enhanced federated learning | arXiv: 2602.21399
Few-shot Acoustic Synthesis with Multimodal Flow Matching | arXiv: 2603.19176
few-shot incremental 3d object detection in dynamic indoor environments | arXiv: 2604.07997
fg-portrait 3d flow guided editable portrait animation | arXiv: 2603.23381
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution | arXiv: 2603.02692
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression | arXiv: 2603.10470
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks | arXiv: 2603.03907
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients | arXiv: 2603.17809
FINER: MLLMs Hallucinate under Fine-grained Negative Queries | arXiv: 2603.17662
first frame is the place to go for video content customization | arXiv: 2511.15700
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation | arXiv: 2602.24144
Flash-Unified: Training-Free and Task-Aware Acceleration for Native Unified Models | arXiv: 2603.15271
FlashCache: Frequency-Domain-Guided Outlier-KV-Aware Multimodal KV Cache Compression | arXiv: 2511.16786
flashcap millisecond-accurate human motion capture via flashing leds and event-b | arXiv: 2603.19770
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance | arXiv: 2603.12146
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance | arXiv: 2603.12146
flexavatar learning complete 3d head avatars with partial supervision | arXiv: 2512.15599
FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT | arXiv: 2503.07516
FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT | arXiv: 2503.07516
flow3r factored flow prediction for scalable visual geometry learning | arXiv: 2602.20157
flowmotion training-free flow guidance for video motion transfer | arXiv: 2603.06289
fluidgaussian propagating simulation-based uncertainty toward functionally-intel | arXiv: 2603.21356
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy | arXiv: 2602.23791
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding | arXiv: 2603.02096
focus dont prune identifying instruction-relevant regions for information-rich i | arXiv: 2603.22815
focus-to-perceive representation learning a cognition-inspired hierarchical fram | arXiv: 2603.25778
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning | arXiv: 2603.11460
fontcrafter high-fidelity element-driven artistic font creation with visual in-c | arXiv: 2603.22054
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction | arXiv: 2509.21029
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction | arXiv: 2509.21029
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation | arXiv: 2603.15169
Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning | arXiv: 2603.12887
Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning | arXiv: 2603.12887
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph | arXiv: 2603.09266
FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration | arXiv: 2603.01284
foundation model priors enhance object focus in feature space for source-free ob | arXiv: 2512.17514
foundry distilling 3d foundation models for the edge | arXiv: 2511.20721
Fourier Angle Alignment for Oriented Object Detection in Remote Sensing | arXiv: 2602.23790
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting | arXiv: 2602.24084
fozo forward-only zeroth-order prompt optimization for test-time adaptation | arXiv: 2603.04733
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems | arXiv: 2603.13069
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems | arXiv: 2603.13069
frame2freq spectral adapters for fine-grained video understanding | arXiv: 2602.18977
framer frequency-aligned self-distillation with adaptive modulation leveraging d | arXiv: 2512.01390
free-grained hierarchical visual recognition | arXiv: 2510.14737
free-lunch long video generation via layer-adaptive ood correction | arXiv: 2603.25209
freeartgs articulated gaussian splatting under free-moving scenario | arXiv: 2603.22102
frequency switching mechanism for parameter-ecient multi-task learning | arXiv: 2603.21111
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction | arXiv: 2503.17788
from editor to dense geometry estimator | arXiv: 2509.04338
from fewer samples to fewer bits reframing dataset distillation as joint optimiz | arXiv: 2603.02411
from inpainting to layer decomposition repurposing generative inpainting models | arXiv: 2511.20996
from intuition to investigation a tool-augmented reasoning mllm framework for ge | arXiv: 2603.01038
from masks to pixels and meaning a new taxonomy benchmark and metrics for vlm im | arXiv: 2603.20193
from observation to action latent action-based primitive segmentation for vla pr | arXiv: 2511.21428
from orbit to ground generative city photogrammetry from extreme off-nadir satel | arXiv: 2512.07527
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection | arXiv: 2602.20630
from static to dynamic exploring self-supervised image-to-video representation t | arXiv: 2603.26597
from weights to concepts data-free interpretability of clip via singular vector | arXiv: 2603.24653
funrec reconstructing functional 3d scenes from egocentric interaction videos | arXiv: 2604.05621
fusionagent a multimodal agent with dynamic model selection for human recognitio | arXiv: 2603.26908
F²HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling | arXiv: 2603.14920
GACD: Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection | arXiv: 2509.03113
GAP: Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation | arXiv: 2602.23814
gardendesigner encoding aesthetic principles into jiangnan garden construction v | arXiv: 2604.01777
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories | arXiv: 2603.14153
gaussfusion improving 3d reconstruction in the wild with a geometry-informed vid | arXiv: 2603.25053
gaussian shannon high-precision diffusion model watermarking based on communicat | arXiv: 2603.26167
gaussiangrow geometry-aware gaussian growing from 3d point clouds with text guid | arXiv: 2604.05721
gaussianpile a unified sparse gaussian splatting framework for slice-based volum | arXiv: 2603.20611
GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion | arXiv: 2603.17161
GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer | arXiv: 2602.20871
GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer | arXiv: 2602.20871
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization | arXiv: 2603.05095
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation | arXiv: 2603.02554
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction | arXiv: 2602.21552
generate analyze and refine training-free sound source localization via mllm met | arXiv: 2604.06824
generative adversarial perturbations with cross-paradigm transferability on loca | arXiv: 2603.24821
Generative Neural Video Compression via Video Diffusion Prior | arXiv: 2512.05016
Generative Video Compression with One-Dimensional Latent Representation | arXiv: 2603.15302
genmask adapting dit for segmentation via direct mask generation | arXiv: 2603.23906
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration | arXiv: 2603.13068
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration | arXiv: 2603.13068
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis | arXiv: 2603.01010
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis | arXiv: 2603.01010
geoflow real-time fine-grained cross-view geolocalization via iterative flow pre | arXiv: 2603.21943
geofusion-cad structure-aware diffusion with geometric state space for parametri | arXiv: 2603.21978
geoguide hierarchical geometric guidance for open-vocabulary 3d semantic segment | arXiv: 2603.26260
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context | arXiv: 2602.21929
Geometry-Guided Camera Motion Understanding in VideoLLMs | arXiv: 2603.13119
Geometry-Guided Camera Motion Understanding in VideoLLMs | arXiv: 2603.13119
geosurge geo-localization using semantic fusion with hierarchy of geographic emb | arXiv: 2510.01448
geotikzbridge advancing multimodal code generation for geometric perception and | arXiv: 2603.22687
GeoWorld: Geometric World Models | arXiv: 2602.23058
GGPT: Geometry Grounded Point Transformer | arXiv: 2603.11174
ghost-fwl a large-scale full-waveform lidar dataset for ghost detection and remo | arXiv: 2603.28224
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis | arXiv: 2603.09446
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis | arXiv: 2603.09446
GKD: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation | arXiv: 2603.02554
gleam a multimodal imaging dataset and hamm for glaucoma classification | arXiv: 2603.12800
GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification | arXiv: 2603.12800
GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification | arXiv: 2603.12800
glint modeling scene-scale transparency via gaussian radiance transport | arXiv: 2603.26181
Global-Aware Edge Prioritization for Pose Graph Initialization | arXiv: 2602.21963
glove2hand synthesizing natural hand-object interaction from multi-modal sensing | arXiv: 2603.20850
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering | arXiv: 2603.15616
goal force teaching video models to accomplish physics-conditioned goals | arXiv: 2601.05848
goal-driven reward by video diffusion models for reinforcement learning | arXiv: 2512.00961
gp-4dgs probabilistic 4d gaussian splatting from monocular video via variational | arXiv: 2604.02915
gQIR: Generative Quanta Image Reconstruction | arXiv: 2602.20417
gQIR: Generative Quanta Image Reconstruction | arXiv: 2602.20417
graph-to-frame rag visual-space knowledge fusion for training-free and auditable | arXiv: 2604.04372
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs | arXiv: 2510.00507
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs | arXiv: 2510.00507
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning | arXiv: 2603.13370
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning | arXiv: 2603.13370
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion | arXiv: 2602.22862
graze grounded refinement and motion-aware zero-shot event localization | arXiv: 2604.01383
groundvts visual token sampling in multimodal large language models for video te | arXiv: 2604.02093
group editing edit multiple images in one go | arXiv: 2603.22883
GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning | arXiv: 2602.19206
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training | arXiv: 2512.13043
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training | arXiv: 2512.13043
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents | arXiv: 2603.15039
guide a benchmark for understanding and assisting users in open-ended gui tasks | arXiv: 2603.25864
guide guided updates for in-context decision evolution in llm-driven spacecraft | arXiv: 2603.27306
guiding a diffusion model by swapping its tokens | arXiv: 2604.08048
guiding a diffusion transformer with the internal dynamics of itself | arXiv: 2512.24176
Guiding Diffusion Models with Semantically Degraded Conditions | arXiv: 2603.10780
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation | arXiv: 2603.12696
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation | arXiv: 2603.12696
ham a training-free style transfer approach via heterogeneous attention modulati | arXiv: 2603.24043
HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding | arXiv: 2603.02329
handvqa diagnosing and improving fine-grained spatial reasoning about hands in v | arXiv: 2603.26362
handx scaling bimanual motion and interaction generation | arXiv: 2603.28766
Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing | arXiv: 2506.01783
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents | arXiv: 2603.12138
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents | arXiv: 2603.12138
hawk head importance-aware visual token pruning in multimodal models | arXiv: 2604.07812
hazematching dehazing light microscopy images with guided conditional flow match | arXiv: 2506.22397
hear what matters text-conditioned selective video-to-audio generation | arXiv: 2512.02650
herbench a benchmark for multi-evidence integration in video question answering | arXiv: 2512.14870
hess head sensitivity score for sparsity redistribution in vggt | arXiv: 2603.25336
Heterogeneous Decentralized Diffusion Models | arXiv: 2603.06741
heuristic self-paced learning for domain adaptive semantic segmentation under ad | arXiv: 2603.24322
hg-i2p bridging modalities for generalizable image-to-point-cloud registration v | arXiv: 2603.27969
HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation | arXiv: 2603.10128
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers | arXiv: 2603.12222
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers | arXiv: 2603.12222
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces | arXiv: 2503.07853
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces | arXiv: 2503.07853
hieramamba video temporal grounding via hierarchical anchor-mamba pooling | arXiv: 2510.23043
hieramp coarse-to-fine autoregressive amplification for generative dataset disti | arXiv: 2603.06932
hierarchical visual relocalization with nearest view synthesis from feature gaus | arXiv: 2603.29185
hif-vla hindsight insight and foresight through motion representation for vision | arXiv: 2512.09928
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images | arXiv: 2603.02210
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks | arXiv: 2603.12760
HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks | arXiv: 2603.12760
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning | arXiv: 2503.22179
high-quality and efficient turbulence mitigation with events | arXiv: 2603.20708
hippomm hippocampal-inspired multimodal memory for long audiovisual event unders | arXiv: 2504.10739
hispatial taming hierarchical 3d spatial understanding in vision-language models | arXiv: 2603.25411
hive query hypothesize verify an llm framework for multimodal reasoning-intensiv | arXiv: 2604.07220
HoneyBee: Data Recipes for Vision-Language Reasoners | arXiv: 2510.12225
HoneyBee: Data Recipes for Vision-Language Reasoners | arXiv: 2510.12225
horizonforge driving scene editing with any trajectories and any vehicles | arXiv: 2602.21333
HouseMind: Tokenization Allows MLLMs to Understand, Generate and Edit Architectural Floor Plans | arXiv: 2603.11640
How to Take a Memorable Picture? Empowering Users with Actionable Feedback | arXiv: 2602.21877
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models | arXiv: 2602.22727
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in LVLMs | arXiv: 2602.22727
human interaction-aware 3d reconstruction from a single image | arXiv: 2604.05436
human knowledge integrated multi-modal learning for single source domain general | arXiv: 2603.12369
Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization | arXiv: 2603.12369
HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation | arXiv: 2602.24148
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry | arXiv: 2603.11344
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry | arXiv: 2603.11344
Hyperbolic Busemann Neural Networks | arXiv: 2602.18858
hypergaussians high-dimensional gaussian splatting for high-fidelity animatable | arXiv: 2507.02803
HyperMVP: Hyperbolic Multiview Pretraining for Robotic Manipulation | arXiv: 2603.04848
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition | arXiv: 2506.04764
I'm a Map! Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers | arXiv: 2603.02919
iag input-aware backdoor attack on vlm-based visual grounding | arXiv: 2508.09456
IAPL: Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning | arXiv: 2508.01603
ictpolarreal a polarized reflection and material dataset of real world objects | arXiv: 2603.24912
identity-preserving image-to-video generation via reward-guided optimization | arXiv: 2510.14255
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations | arXiv: 2602.18831
igasa integrated geometry-aware and skip-attention modules for enhanced point cl | arXiv: 2603.12719
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration | arXiv: 2603.12719
image diffusion preview with consistency solver | arXiv: 2512.13592
Image Generation as a Visual Planner for Robotic Manipulation | arXiv: 2512.00532
imagine before concentration diffusion-guided registers enhance partially releva | arXiv: 2604.03653
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards | arXiv: 2603.00918
incarpose in-cabin relative camera pose estimation model and dataset | arXiv: 2604.03814
indoor asset detection in large scale 360 drone-captured imagery via 3d gaussian | arXiv: 2604.05316
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout | arXiv: 2511.20649
influence malleability in linearized attention dual implications of non-converge | arXiv: 2603.13085
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics | arXiv: 2603.13085
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation | arXiv: 2603.05898
insid3 training-free in-context segmentation with dinov3 | arXiv: 2603.28480
inside-out measuring generalization in vision transformers through inner working | arXiv: 2604.08192
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction | arXiv: 2603.11298
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction | arXiv: 2603.11298
instruction-guided lesion segmentation for chest x-rays with automatically gener | arXiv: 2511.15186
Integration of deep generative Anomaly Detection algorithm in high-speed industrial line | arXiv: 2603.07577
Integration of Deep Generative Anomaly Detection Algorithm in High-Speed Industrial Line | arXiv: 2603.07577
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing | arXiv: 2603.13082
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing | arXiv: 2603.13082
interpretable and steerable concept bottleneck sparse autoencoders | arXiv: 2512.10805
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment | arXiv: 2603.17655
Interpretable Debiasing of Vision-Language Models for Social Fairness | arXiv: 2602.24014
Intrinsic Concept Extraction Based on Compositional Interpretability | arXiv: 2603.11795
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models | arXiv: 2504.05662
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models | arXiv: 2504.05662
irisfp adversarial-example-based model fingerprinting with enhanced uniqueness a | arXiv: 2603.24996
it takes two a duet of periodicity and directionality for burst flicker removal | arXiv: 2603.22794
It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models | arXiv: 2603.08011
joint and streamwise distributed mimo satellite communications with multi-antenn | arXiv: 2603.12914
Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users | arXiv: 2603.12914
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild | arXiv: 2602.21736
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas | arXiv: 2603.06168
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas | arXiv: 2603.06168
just-in-time training-free spatial acceleration for diffusion transformers | arXiv: 2603.10744
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System | arXiv: 2512.20299
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing | arXiv: 2602.04268
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing | arXiv: 2602.04268
kαlos finds consensus a meta-algorithm for evaluating inter-annotator agreement | arXiv: 2603.27197
L2GTX: From Local to Global Time Series Explanations | arXiv: 2603.13065
L2GTX: From Local to Global Time Series Explanations | arXiv: 2603.13065
label-free cross-task lora merging with null-space compression | arXiv: 2603.26317
lamogen language to motion generation through llm-guided symbolic inference | arXiv: 2603.11605
lamp language-assisted motion planning for controllable video generation | arXiv: 2512.03619
language models can explain visual features via steering | arXiv: 2603.22593
language-free generative editing from one visual example | arXiv: 2603.25441
Language-Grounded Decoupled Action Representation for Robotic Manipulation | arXiv: 2603.12967
Language-Grounded Decoupled Action Representation for Robotic Manipulation (LaDA) | arXiv: 2603.12967
laof robust latent action learning with optical flow constraints | arXiv: 2511.16407
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency | arXiv: 2602.18735
lasca language-conditioned scalable modelling of affective dynamics | arXiv: 2604.07193
laser layer-wise scale alignment for training-free streaming 4d reconstruction | arXiv: 2512.13680
layer consistency matters elegant latent transition discrepancy for generalizabl | arXiv: 2603.10598
le mumo jepa multi-modal self-supervised representation learning with learnable | arXiv: 2603.24327
Learnability-Driven Submodular Optimization for Active Roadside 3D Detection | arXiv: 2601.01695
learnability-guided diffusion for dataset distillation | arXiv: 2604.00519
learning by neighbor-aware semantics deciding by open-form flows towards robust | arXiv: 2511.09388
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction | arXiv: 2602.18996
learning explicit continuous motion representation for dynamic gaussian splattin | arXiv: 2603.25058
Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting | arXiv: 2508.05059
learning from synthetic data via provenance-based input gradient guidance | arXiv: 2604.02946
Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision | arXiv: 2603.13660
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization | arXiv: 2603.12663
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization | arXiv: 2603.12663
Learning Latent Proxies for Controllable Single-Image Relighting | arXiv: 2603.15555
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal | arXiv: 2511.17353
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal | arXiv: 2511.17353
learning like humans analogical concept learning for generalized category discov | arXiv: 2603.19918
Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection | arXiv: 2602.18811
learning multi-view spatial reasoning from cross-view relations | arXiv: 2603.27967
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception | arXiv: 2602.19596
learning through creation a hash-free framework for on-the-fly category discover | arXiv: 2603.13858
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning | arXiv: 2603.11346
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos | arXiv: 2602.22091
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models | arXiv: 2603.06043
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation | arXiv: 2508.05186
learning to translate noise for robust image denoising | arXiv: 2412.04727
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection | arXiv: 2506.01085
lemma laplacian pyramids for efficient marine semantic segmentation | arXiv: 2603.25689
lenswalk agentic video understanding by planning how you see in videos | arXiv: 2603.24558
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration | arXiv: 2602.20497
let it snow animating 3d gaussian scenes with dynamic weather effects via physic | arXiv: 2504.05296
let your image move with your motion -- implicit multi-object multi-motion trans | arXiv: 2603.01000
leveraging multispectral sensors for color correction in mobile cameras | arXiv: 2512.08441
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment | arXiv: 2603.10929
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment | arXiv: 2603.10929
lifting unlabeled internet-level data for 3d scene understanding | arXiv: 2604.01907
lighting-grounded video generation with renderer-based agent reasoning | arXiv: 2604.07966
lightmover generative light movement with color and intensity controls | arXiv: 2603.27209
lightsplat fast and memory-efficient open-vocabulary 3d scene understanding in f | arXiv: 2603.24146
linking modality isolation in heterogeneous collaborative perception | arXiv: 2603.00609
Linking Perception, Confidence and Accuracy in MLLMs | arXiv: 2603.12149
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation | arXiv: 2510.08318
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation | arXiv: 2510.08318
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration | arXiv: 2602.21754
Lite Any Stereo: Efficient Zero-Shot Stereo Matching | arXiv: 2511.16555
litept lighter yet stronger point transformer | arXiv: 2512.13689
live interactive training for video segmentation | arXiv: 2603.26929
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models | arXiv: 2509.25896
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models | arXiv: 2509.25896
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models | arXiv: 2603.14882
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation | arXiv: 2603.16284
lod-loc v3 generalized aerial localization in dense cities using instance silhou | arXiv: 2603.19609
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry | arXiv: 2602.13172
longvideo-r1 smart navigation for low-cost long video understanding | arXiv: 2602.20913
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection | arXiv: 2507.16861
looking beyond the window global-local aligned clip for training-free open-vocab | arXiv: 2603.23030
LoST: Level of Semantics Tokenization for 3D Shapes | arXiv: 2603.17995
love me love my label rethinking the role of labels in prompt retrieval for visu | arXiv: 2604.03657
low-resolution editing is all you need for high-resolution editing | arXiv: 2511.19945
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction | arXiv: 2603.12647
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction | arXiv: 2603.12647
LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates | arXiv: 2510.09881
lumictrl learning illuminant prompts for lighting control in personalized text-t | arXiv: 2512.17489
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol | arXiv: 2603.14644
Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels | arXiv: 2602.22140
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation | arXiv: 2509.23728
m4-rag a massive-scale multilingual multi-cultural multimodal rag | arXiv: 2512.05959
ma-bench towards fine-grained micro-action understanding | arXiv: 2603.26586
MAD-Avatar: Motion-Aware Animatable Gaussian Avatars Deblurring | arXiv: 2411.16758
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness | arXiv: 2507.02314
magician efficient long-term planning with imagined gaussians for active mapping | arXiv: 2603.22650
Making Training-Free Diffusion Segmentors Scale with the Generative Power | arXiv: 2603.06178
mamba learns in context structure-aware domain generalization for multi-task poi | arXiv: 2603.20739
mamba-vmr multimodal query augmentation via generated videos for precise tempora | arXiv: 2603.22121
maniparena comprehensive real-world evaluation of reasoning-oriented generalist | arXiv: 2603.28545
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction | arXiv: 2603.10688
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction | arXiv: 2603.10688
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models | arXiv: 2511.20629
Mario: Multimodal Graph Reasoning with Large Language Models | arXiv: 2603.05181
marker-based 3d reconstruction of aggregates with a comparative analysis of 2d a | arXiv: 2603.12667
Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies | arXiv: 2603.12667
Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies | arXiv: 2603.12667
Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies | arXiv: 2603.12667
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation | arXiv: 2511.23334
markushgrapher-2 end-to-end multimodal recognition of chemical structures | arXiv: 2603.28550
MARVO: Marine-Adaptive Radiance-aware Visual Odometry | arXiv: 2511.22860
maskadapt learning flexible motion adaptation via mask-invariant prior for physi | arXiv: 2603.29272
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations | arXiv: 2602.18792
Masked Representation Modeling for Domain-Adaptive Segmentation | arXiv: 2509.13801
Masked Representation Modeling for Domain-Adaptive Segmentation | arXiv: 2509.13801
masking matters unlocking the spatial reasoning capabilities of llms for 3d scen | arXiv: 2512.02487
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models | arXiv: 2603.04800
mastering negation boosting grounding models via grouped opposition-based learni | arXiv: 2603.12606
Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning | arXiv: 2603.12606
matanyone 2 scaling video matting via a learned quality evaluator | arXiv: 2512.11782
Match-and-Fuse: Consistent Generation from Unstructured Image Sets | arXiv: 2511.22287
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision | arXiv: 2602.20689
meanfuser fast one-step multi-modal trajectory generation and adaptive reconstru | arXiv: 2602.20060
measuring the unfaithfulness of concept-based explanations | arXiv: 2504.10833
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation | arXiv: 2602.20423
MedGEN-Bench: Contextually Entangled Benchmark for Open-Ended Multimodal Medical Generation | arXiv: 2511.13135
medgrpo multi-task reinforcement learning for heterogeneous medical video unders | arXiv: 2512.06581
MEDISEG: A Dataset of Medication Images with Instance Segmentation Masks for Preventing Adverse Drug Events | arXiv: 2603.10825
MEDISEG: 药物图像实例分割数据集——预防不良药物事件 | arXiv: 2603.10825
MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration | arXiv: 2603.09101
memo human-like crisp edge detection using masked edge prediction | arXiv: 2603.20782
memory-efficient fine-tuning diffusion transformers via dynamic patch sampling a | arXiv: 2603.20755
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent | arXiv: 2511.18810
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent | arXiv: 2511.18810
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation | arXiv: 2603.00526
meta-learning in-context enables training-free cross subject brain decoding | arXiv: 2604.08537
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating | arXiv: 2603.09419
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating | arXiv: 2603.09419
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging | arXiv: 2603.09116
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging | arXiv: 2603.09116
Miburi: Towards Expressive Interactive Gesture Synthesis | arXiv: 2603.03282
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models | arXiv: 2602.19497
MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification | arXiv: 2603.09374
MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification | arXiv: 2603.09374
mimicat mimic with correspondence-aware cascade-transformer for category-free 3d | arXiv: 2511.18370
Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning | arXiv: 2603.13341
mind the generative details direct localized detail preference optimization for | arXiv: 2601.04068
mind the hitch dynamic calibration and articulated perception for autonomous tru | arXiv: 2603.23711
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs | arXiv: 2603.02618
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving | arXiv: 2602.21952
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents | arXiv: 2511.23055
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents | arXiv: 2511.23055
mine-jepa in-domain self-supervised learning for mine-like object classification | arXiv: 2604.00383
minerva-cultural a benchmark for cultural and multilingual long video reasoning | arXiv: 2601.10649
mining instance-centric vision-language contexts for human-object interaction de | arXiv: 2604.02071
Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared | arXiv: 2603.08018
mistake attribution fine-grained mistake understanding in egocentric videos | arXiv: 2511.20525
Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning | arXiv: 2603.04825
Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection | arXiv: 2603.13070
Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection | arXiv: 2603.13070
mitigating multimodal hallucinations via gradient-based self-reflection | arXiv: 2509.03113
mitigating object hallucinations in lvlms via attention imbalance rectification | arXiv: 2603.24058
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention | arXiv: 2603.01361
Mixture of States (MoS): Routing Token-Level Dynamics for Multimodal Generation | arXiv: 2511.12207
mixture of states routing token-level dynamics for multimodal generation | arXiv: 2511.12207
mm-recoder advancing chart-to-code generation with reinforcement learning and se | arXiv: 2604.01600
mmtit-bench a multilingual and multi-scenario benchmark with cognition-perceptio | arXiv: 2603.23896
Mobile-VTON: High-Fidelity On-Device Virtual Try-On | arXiv: 2603.00947
Mobile-VTON: High-Fidelity On-Device Virtual Try-On | arXiv: 2603.00947
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization | arXiv: 2603.03192
Model Merging in the Essential Subspace | arXiv: 2602.20208
modeling spatiotemporal neural frames for high resolution brain dynamic | arXiv: 2603.24176
modes accelerating mixture-of-experts multimodal large language models via dynam | arXiv: 2511.15690
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping | arXiv: 2511.15690
moe-grpo optimizing mixture-of-experts via reinforcement learning in vision-lang | arXiv: 2603.24984
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection | arXiv: 2603.03101
MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization | arXiv: 2603.12743
MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization | arXiv: 2603.12743
molingo motion-language alignment for text-to-motion generation | arXiv: 2512.13840
Momentum Memory for Knowledge Distillation in Computational Pathology | arXiv: 2602.21395
momo mars orbital model foundation model for mars orbital applications | arXiv: 2604.02719
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes (LegoOcc) | arXiv: 2602.22667
monosaod monocular 3d object detection with sparsely annotated label | arXiv: 2604.01646
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes | arXiv: 2603.09573
MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer | arXiv: 2603.05078
morel long-range flicker-free 4d motion modeling via anchor relay-based bidirect | arXiv: 2512.09270
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing | arXiv: 2601.00204
MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification | arXiv: 2512.03404
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models | arXiv: 2603.16001
Motion-Aware Animatable Gaussian Avatars Deblurring | arXiv: 2411.16758
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins | arXiv: 2603.12936
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins | arXiv: 2603.12936
motionscale reconstructing appearance geometry and motion of dynamic scenes with | arXiv: 2603.29296
MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer | arXiv: 2508.14327
MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer | arXiv: 2508.14327
movierecapsqa a multimodal open-ended video question-answering benchmark | arXiv: 2601.02536
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second | arXiv: 2507.10065
mozzavid mozzarella volumetric image dataset | arXiv: 2412.04880
mpdit multi-patch global-to-local transformer architecture for efficient flow ma | arXiv: 2603.26357
mpm mutual pair merging for efficient vision transformers | arXiv: 2604.05718
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding | arXiv: 2512.02906
mri contrast enhancement kinetics world model | arXiv: 2602.19285
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation | arXiv: 2511.10376
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation | arXiv: 2511.10376
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding | arXiv: 2602.22932
msrl scaling generative multimodal reward modeling via multi-stage reinforcement | arXiv: 2603.25108
muco multi-turn contrastive learning for multimodal embedding model | arXiv: 2602.06393
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following | arXiv: 2511.21662
multi-modal image fusion via intervention-stable feature learning | arXiv: 2603.23272
multi-modal representation learning via semi-supervised rate reduction for gener | arXiv: 2602.19910
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models | arXiv: 2603.04846
Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning | arXiv: 2603.11827
Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning | arXiv: 2603.11827
Multimodal OCR: Parse Anything from Documents | arXiv: 2603.13032
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation | arXiv: 2603.12845
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation | arXiv: 2603.12845
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning | arXiv: 2602.20223
Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation | arXiv: 2603.12581
Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation | arXiv: 2603.12581
muse harnessing precise and diverse semantics for few-shot whole slide image cla | arXiv: 2602.20873
must modality-specific representation-aware transformer for diffusion-enhanced s | arXiv: 2603.26071
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy | arXiv: 2602.24222
mv-roma from pairwise matching into multi-view track reconstruction | arXiv: 2603.27542
mvggt multimodal visual geometry grounded transformer for multiview 3d referring | arXiv: 2601.06874
MXNorm: Reusing MXFP block scales for efficient tensor normalisation | arXiv: 2603.13180
MXNorm: Reusing MXFP Block Scales for Efficient Tensor Normalisation | arXiv: 2603.13180
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs | arXiv: 2603.09737
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs | arXiv: 2603.09737
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries | arXiv: 2603.05446
nanosd edge efficient foundation model for real time image restoration | arXiv: 2601.09823
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | arXiv: 2603.12824
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | arXiv: 2603.12824
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning | arXiv: 2603.06688
near coupled neural asset-renderer stack | arXiv: 2511.18600
nec-diff noise-robust event-raw complementary diffusion for seeing motion in ext | arXiv: 2603.20005
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models | arXiv: 2511.16955
neighbor-aware localized concept erasure in text-to-image diffusion models | arXiv: 2603.25994
neoverse enhancing 4d world model with in-the-wild monocular videos | arXiv: 2601.00393
Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code | arXiv: 2603.00805
NERFIFY: 多智能体框架将NeRF论文自动转化为可运行代码 | arXiv: 2603.00805
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training | arXiv: 2602.22059
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences | arXiv: 2602.22212
neural collapse in test-time adaptation | arXiv: 2512.10421
neural field-based 3d surface reconstruction of microstructures from multi-detec | arXiv: 2508.04728
Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion | arXiv: 2509.17704
neuroseg meets dinov3 transferring 2d self-supervised visual priors to 3d neuron | arXiv: 2603.23104
next-scale autoregressive models for text-to-motion generation | arXiv: 2604.03799
NI-Tex: Non-isometric Image-based Garment Texture Generation | arXiv: 2511.18765
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency | arXiv: 2602.23559
no hard negatives required concept centric learning leads to compositionality wi | arXiv: 2603.25722
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors | arXiv: 2602.23141
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection | arXiv: 2602.19248
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs | arXiv: 2603.12078
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs | arXiv: 2603.12078
noise-aware few-shot learning through bi-directional multi-view prompt alignment | arXiv: 2603.11617
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment | arXiv: 2603.11617
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment | arXiv: 2603.11617
noovd novel category discovery and embedding for open-vocabulary object detectio | arXiv: 2603.21069
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning | arXiv: 2602.21172
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing | arXiv: 2603.02802
novel anomaly detection scenarios and evaluation metrics to address the ambiguit | arXiv: 2604.07097
Novel Architecture of RPA In Oral Cancer Lesion Detection | arXiv: 2603.10928
Novel Architecture of RPA In Oral Cancer Lesion Detection | arXiv: 2603.10928
NTK-Guided Implicit Neural Teaching | arXiv: 2511.15487
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction | arXiv: 2603.12144
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction | arXiv: 2603.12144
oars process-aware online alignment for generative real-world image super-resolu
oars process-aware online alignment for generative real-world image super-resolu | arXiv: 2603.12811
OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution | arXiv: 2603.12811
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos | arXiv: 2601.06391
occany generalized unconstrained urban 3d occupancy | arXiv: 2603.23502
Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking | arXiv: 2603.06034
occufly a 3d vision benchmark for semantic scene completion from the aerial pers | arXiv: 2512.20770
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models | arXiv: 2603.09326
off the grid detection of primitives for feed-forward 3d gaussian splatting | arXiv: 2512.15508
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments | arXiv: 2602.22025
omg-bench a new challenging benchmark for skeleton-based online micro hand gestu | arXiv: 2512.16727
omni-mmsi toward identity-attributed social interaction understanding | arXiv: 2604.00267
omnifm toward modality-robust and task-agnostic federated learning for heterogen | arXiv: 2603.21660
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens | arXiv: 2603.02138
OmniRet: Efficient and High-Fidelity Omni Modality Retrieval | arXiv: 2603.02098
omnisonic towards universal and holistic audio generation from video and text | arXiv: 2604.04348
On the Feasibility and Opportunity of Autoregressive 3D Object Detection | arXiv: 2603.07985
On the Possible Detectability of Image-in-Image Steganography | arXiv: 2603.11876
On the Possible Detectability of Image-in-Image Steganography | arXiv: 2603.11876
on the robustness of diffusion-based image compression to bit-flip errors | arXiv: 2604.05743
on tokens dilemma dynamic moe with drift-aware token assignment for continual le | arXiv: 2603.27481
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers | arXiv: 2603.12245
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers | arXiv: 2603.12245
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera | arXiv: 2511.03571
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery | arXiv: 2603.17355
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting | arXiv: 2603.18510
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation | arXiv: 2602.18853
opendpr open-vocabulary change detection via vision-centric diffusion-guided pro | arXiv: 2603.27645
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis | arXiv: 2602.22949
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments | arXiv: 2603.02390
openvo open-world visual odometry with temporal dynamics awareness | arXiv: 2602.19035
opro orthogonal panel-relative operators for panel-aware in-context image genera | arXiv: 2603.27637
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation | arXiv: 2509.18600
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation | arXiv: 2509.18600
Order Matters: 3D Shape Generation from Sequential VR Sketches | arXiv: 2512.04761
organizing unstructured image collections using natural language | arXiv: 2410.05217
oslash source models leak what they shouldnt nrightarrow unlearning zero-shot tr | arXiv: 2604.08238
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport | arXiv: 2602.20205
out of sight out of track adversarial attacks on propagation-based multi-object | arXiv: 2604.00452
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models | arXiv: 2603.13215
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models | arXiv: 2603.13215
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models | arXiv: 2603.07619
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models | arXiv: 2603.07619
pad-hand physics-aware diffusion for hand motion recovery | arXiv: 2603.26068
palm progress-aware policy learning via affordance reasoning for long-horizon ro | arXiv: 2601.07060
pam a pose-appearance-motion engine for sim-to-real hoi video generation | arXiv: 2603.22193
Pano360: Perspective to Panoramic Vision with Geometric Consistency | arXiv: 2603.12013
Pano360: Perspective to Panoramic Vision with Geometric Consistency | arXiv: 2603.12013
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image | arXiv: 2603.05908
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments | arXiv: 2603.09760
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments | arXiv: 2603.09760
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots | arXiv: 2603.13108
panoramic multimodal semantic occupancy prediction for quadruped robots | arXiv: 2603.13108
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery | arXiv: 2603.17571
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression | arXiv: 2603.03615
Parallel In-context Learning for Large Vision Language Models | arXiv: 2603.16092
Parallelised Differentiable Straightest Geodesics for 3D Meshes | arXiv: 2603.15780
parameter-efficient prompt tuning and hierarchical textual guidance for few-shot | arXiv: 2603.21504
parameter-efficient semantic augmentation for enhancing open-vocabulary object d | arXiv: 2604.04444
particulate feed-forward 3d object articulation | arXiv: 2512.11798
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis | arXiv: 2603.09611
pcstracker long-term scene flow estimation for point cloud sequences | arXiv: 2603.19762
pe3r perception-efficient 3d reconstruction | arXiv: 2503.07507
pearl geometry aligns semantics for training-free open-vocabulary semantic segme | arXiv: 2603.21528
perception characteristics distance measuring stability and robustness of percep | arXiv: 2506.09217
performrecast expression and head pose disentanglement for portrait video editin | arXiv: 2603.19731
perturb and recover fine-tuning for effective backdoor removal from clip | arXiv: 2412.00727
pet-dino unifying visual cues into grounding dino with prompt-enriched training | arXiv: 2604.00503
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning | arXiv: 2602.20537
pgr-net prior-guided roi reasoning network for brain tumor mri segmentation | arXiv: 2603.21626
PHAC: Promptable Human Amodal Completion | arXiv: 2603.14741
phantasia context-adaptive backdoors in vision language models | arXiv: 2604.08395
phantom physics-infused video generation via joint modeling of visual and latent | arXiv: 2604.08503
PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement | arXiv: 2509.24850
phasr generalized image shadow removal with physically aligned priors | arXiv: 2601.17470
phrase-instance alignment for generalized referring segmentation | arXiv: 2411.15087
phygap physically-grounded gaussians with polarization cues | arXiv: 2603.14001
physgaia a physics-aware benchmark with multi-body interactions for dynamic nove | arXiv: 2506.02794
physgen physically grounded 3d shape generation for industrial design | arXiv: 2512.00422
physgm large physical gaussian model for feed-forward 4d synthesis | arXiv: 2508.13911
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis | arXiv: 2508.13911
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation | arXiv: 2511.18570
physhead simulation-ready gaussian head avatars | arXiv: 2604.06467
Physical Simulator In-the-Loop Video Generation | arXiv: 2603.06408
physically inspired gaussian splatting for hdr novel view synthesis | arXiv: 2603.28020
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction | arXiv: 2603.00149
physmodpo physically-plausible humanoid motion with preference optimization | arXiv: 2603.13228
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization | arXiv: 2603.13228
physskin real-time and generalizable physics-based animation via self-supervised | arXiv: 2603.23194
physvid physics aware local conditioning for generative video models | arXiv: 2603.26285
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing | arXiv: 2603.04598
pioneering perceptual video fluency assessment a novel task with benchmark datas | arXiv: 2603.26055
PIP-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching | arXiv: 2602.20496
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction | arXiv: 2603.05888
Pixel Motion Diffusion Is What We Need for Robot Control | arXiv: 2509.22652
pixel-level scene understanding in one token visual states need what-is-where co | arXiv: 2603.13904
Pixel2Phys: Distilling Governing Laws from Visual Dynamics | arXiv: 2602.19516
pixelrush ultra-fast training-free high-resolution image generation via one-step
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion | arXiv: 2602.12769
Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision | arXiv: 2602.19715
planareloc camera relocalization in 3d planar primitives via region-based struct | arXiv: 2603.20818
planning in 8 tokens a compact discrete tokenizer for latent world model | arXiv: 2603.05438
plant taxonomy meets plant counting a fine-grained taxonomic dataset for countin | arXiv: 2603.21229
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers | arXiv: 2511.16156
PNG: Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning | arXiv: 2603.04870
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models | arXiv: 2603.00412
pointer-cad unifying b-rep and command sequences via pointer-based edges faces s | arXiv: 2603.04337
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors | arXiv: 2603.18782
pointtpa dynamic network parameter adaptation for 3d scene understanding | arXiv: 2604.04933
POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction | arXiv: 2603.09162
POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction | arXiv: 2603.09162
pose-dive pose-diversified augmentation with diffusion model for person re-ident | arXiv: 2406.16042
posemaster a unified 3d native framework for stylized pose generation | arXiv: 2506.21076
posteriq a design perspective benchmark for poster understanding and generation | arXiv: 2603.24078
PPCL: Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers | arXiv: 2511.16156
pr-iqa partial-reference image quality assessment for diffusion-based novel view | arXiv: 2604.04576
Precise Object and Effect Removal with Adaptive Target-Aware Attention | arXiv: 2505.22636
predictive regularization against visual representation degradation in multimoda | arXiv: 2603.20808
preference-aligned lora merging preserving subspace coverage and addressing dire | arXiv: 2603.26299
preserving source video realism high-fidelity face swapping for cinematic qualit | arXiv: 2512.07951
prime once then reprogram locally an efficient alternative to black-box service | arXiv: 2604.01474
principled steering via null-space projection for jailbreak defense in vision-la | arXiv: 2603.22094
prism video dataset condensation with progressive refinement and insertion for s | arXiv: 2505.22564
privi towards a general-purpose video model for primate behavior in the wild | arXiv: 2511.09675
probabilistic concept graph reasoning for multimodal misinformation detection | arXiv: 2603.25203
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models | arXiv: 2602.20501
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation | arXiv: 2603.05530
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars | arXiv: 2603.16447
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On | arXiv: 2603.11675
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On | arXiv: 2603.11675
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains | arXiv: 2603.12624
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains | arXiv: 2603.12624
Prompt-Free Universal Region Proposal Network | arXiv: 2603.17554
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts | arXiv: 2603.01650
Proof-of-Perception: 带组合共形保证的工具使用多模态推理 | arXiv: 2603.00324
proood prototype-guided out-of-distribution 3d occupancy prediction | arXiv: 2604.01081
Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting | arXiv: 2603.11938
Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting | arXiv: 2603.11938
Prototype-Guided Concept Erasure in Diffusion Models | arXiv: 2603.08271
ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning | arXiv: 2602.21078
prue a practical recipe for field boundary segmentation at scale | arXiv: 2603.27101
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives | arXiv: 2602.24136
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving | arXiv: 2508.13305
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving | arXiv: 2508.13305
psdesigner automated graphic design with a human-like creative workflow | arXiv: 2603.25738
psr scaling multi-subject personalized image generation with pairwise subject-co | arXiv: 2512.01236
ptc-depth pose-refined monocular depth estimation with temporal consistency | arXiv: 2604.01791
pulse privileged knowledge transfer from rich to deployable sensors for embodied | arXiv: 2510.24058
PureCC: Pure Learning for Text-to-Image Concept Customization | arXiv: 2603.07561
purify-then-align towards robust human sensing under modality missing with knowl | arXiv: 2604.05584
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment | arXiv: 2603.03726
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition | arXiv: 2602.22639
Quant Experts: Token-aware Adaptive Error Reconstruction for Large VLM Quantization | arXiv: 2602.24059
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization | arXiv: 2602.24059
quantization with unified adaptive distillation to enable multi-lora based one-f | arXiv: 2603.29535
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models | arXiv: 2602.20309
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models | arXiv: 2602.20309
question-guided visual compression with memory feedback for long-term video unde | arXiv: 2603.15167
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection | arXiv: 2603.11566
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection# R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection | arXiv: 2603.11566
radar closed-loop robotic data generation via semantic planning and autonomous c | arXiv: 2603.11811
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset | arXiv: 2603.11811
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset | arXiv: 2603.11811
ragtrack language-aware rgbt tracking with retrieval-augmented generation | arXiv: 2603.03617
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment | arXiv: 2603.00483
Random Wins All: Rethinking Grouping Strategies for Vision Tokens | arXiv: 2603.00486
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing | arXiv: 2602.19753
rascene high-fidelity 3d scene imaging with mmwave communication signals | arXiv: 2604.02603
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought | arXiv: 2507.07685
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought | arXiv: 2507.07685
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution | arXiv: 2603.12493
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution | arXiv: 2603.12493
RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space | arXiv: 2602.20685
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models | arXiv: 2603.14819
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models | arXiv: 2603.14819
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation | arXiv: 2603.11106
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation | arXiv: 2603.11106
rdface a benchmark dataset for rare disease facial image analysis under extreme | arXiv: 2604.03454
rdnet region proportion-aware dynamic adaptive salient object detection network | arXiv: 2603.12215
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images | arXiv: 2603.12215
Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting | arXiv: 2512.17908
reag reasoning-augmented generation for knowledge-based visual question answerin | arXiv: 2511.22715
real-world point tracking with verifier-guided pseudo-labeling | arXiv: 2603.12217
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling | arXiv: 2603.12217
real2edit2real generating robotic demonstrations via a 3d control interface | arXiv: 2512.19402
Reallocating Attention Across Layers to Reduce Multimodal Hallucination | arXiv: 2510.10285
Reallocating Attention Across Layers to Reduce Multimodal Hallucination | arXiv: 2510.10285
realm an mllm-agent framework for open world 3d reasoning segmentation and editi | arXiv: 2510.16410
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting | arXiv: 2510.16410
realunify do unified models truly benefit from unification a comprehensive bench | arXiv: 2509.24897
realvlg-r1 a large-scale real-world visual-language grounding benchmark for robo | arXiv: 2603.14880
reason-svg enhancing structured reasoning for vector graphics generation with re | arXiv: 2505.24499
Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics | arXiv: 2601.13401
reasoning-driven anomaly detection and localization with image-level supervision | arXiv: 2603.27179
reasonmap towards fine-grained visual reasoning from transit maps | arXiv: 2505.18675
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps | arXiv: 2505.18675
recall recalibrating capability degradation for mllm-based composed image retrie | arXiv: 2602.01639
Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning | arXiv: 2603.05235
reconstruction-guided slot curriculum addressing object over-fragmentation in vi | arXiv: 2603.22758
recover to predict progressive retrospective learning for variable-length trajec | arXiv: 2603.10597
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces | arXiv: 2602.20618
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress | arXiv: 2603.17312
Recursive Think-Answer Process for LLMs and VLMs | arXiv: 2603.02099
recyclelora rank-revealing qr-based dual-lora subspace adaptation for domain gen | arXiv: 2603.28142
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback | arXiv: 2603.13057
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback | arXiv: 2603.13057
Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning | arXiv: 2505.20107
reflexsplit single image reflection separation via layer fusion-separation | arXiv: 2601.17468
reframing long-tailed learning via loss landscape geometry | arXiv: 2603.21217
refton reference person shot assist virtual try-on | arXiv: 2511.00956
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data | arXiv: 2603.10947
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data | arXiv: 2603.10947
rehark refined hybrid adaptive rbf kernels for robust one-shot vision-language a | arXiv: 2603.11542
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation | arXiv: 2603.11542
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation | arXiv: 2603.11542
RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model | arXiv: 2509.24948
reinforce to learn elect to reason a dual paradigm for video reasoning | arXiv: 2604.04379
reinforcing structured chain-of-thought for video understanding | arXiv: 2603.25942
Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration | arXiv: 2603.12951
Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration | arXiv: 2603.12951
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion | arXiv: 2601.16788
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing | arXiv: 2603.17531
ReLaGS: Relational Language Gaussian Splatting | arXiv: 2603.17605
Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection | arXiv: 2603.18541
remogen real-time human interaction-to-reaction generation via modular learning | arXiv: 2604.01082
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding | arXiv: 2602.16412
ReMoT: Reinforcement Learning with Motion Contrast Triplets | arXiv: 2603.00461
renderflow single-step neural rendering via flow matching | arXiv: 2601.06928
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery | arXiv: 2603.01034
Representation Learning for Spatiotemporal Physical Systems | arXiv: 2603.13227
Representation Learning for Spatiotemporal Physical Systems | arXiv: 2603.13227
RESBev: Making BEV Perception More Robust | arXiv: 2603.09529
rescene4d temporally consistent semantic instance segmentation of evolving indoo | arXiv: 2601.11508
residual decoding mitigating hallucinations in large vision-language models via | arXiv: 2602.01047
Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning | arXiv: 2603.12816
Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning | arXiv: 2603.12816
resolving the identity crisis in text-to-image generation | arXiv: 2510.01399
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation | arXiv: 2603.02139
Rethinking Concept Bottleneck Models: From Pitfalls to Solutions | arXiv: 2603.05629
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token | arXiv: 2603.19026
rethinking pose refinement in 3d gaussian splatting under pose prior and geometr | arXiv: 2603.16538
rethinking position embedding as a context controller for multi-reference and mu | arXiv: 2604.03738
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model | arXiv: 2410.07547
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model | arXiv: 2410.07547
Rethinking VLMs for Image Forgery Detection and Localization | arXiv: 2603.12930
Rethinking VLMs for Image Forgery Detection and Localization | arXiv: 2603.12930
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting | arXiv: 2603.13783
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting | arXiv: 2603.13783
Retrieving Counterfactuals Improves Visual In-Context Learning | arXiv: 2603.16737
Revisiting Model Stitching In the Foundation Model Era | arXiv: 2603.12433
Revisiting Model Stitching In the Foundation Model Era | arXiv: 2603.12433
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach | arXiv: 2511.16786
revisiting unknowns towards effective and efficient open-set active learning | arXiv: 2603.07898
Reviving ConvNeXt for Efficient Convolutional Diffusion Models | arXiv: 2603.09408
rewardflow generate images by optimizing what you reward | arXiv: 2604.08536
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction | arXiv: 2601.16672
rewis3d reconstruction improves weakly-supervised semantic segmentation | arXiv: 2603.06374
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation | arXiv: 2603.06374
rho robust holistic osm-based metric cross-view geo-localization | arXiv: 2603.27758
riskprop collision-anchored self-supervised risk propagation for early accident | arXiv: 2603.27165
rl-scaniqa reinforcement-learned scanpaths for blind 360image quality assessment
rng a unified transformer for complete 3d modeling from partial observations
roboagent chaining basic capabilities for embodied task planning | arXiv: 2604.07774
robotseg a model and dataset for segmenting robots in image and video | arXiv: 2511.22950
robust multi-source covid-19 detection in ct images | arXiv: 2604.03320
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations | arXiv: 2602.22013
Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods | arXiv: 2603.13077
Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods | arXiv: 2603.13077
rs-ssm refining forgotten specifics in state space model for video semantic segm | arXiv: 2603.24295
rsonet region-guided selective optimization network for rgb-t salient object det | arXiv: 2603.12685
RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection | arXiv: 2603.12685
S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds | arXiv: 2512.00995
saber spatially consistent 3d universal adversarial objects for bev detectors | arXiv: 2505.22499
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World | arXiv: 2602.18887
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning | arXiv: 2603.05437
saliency-r1 enforcing interpretable and faithful vision-language reasoning via s | arXiv: 2604.04500
salmubench a benchmark for sensitive association-level multimodal unlearning | arXiv: 2603.26316
sampling-aware 3d spatial analysis in multiplexed imaging | arXiv: 2604.07890
SAP: Segment Any 4K Panorama | arXiv: 2603.12759
sapave towards active perception and manipulation in vision-language-action mode | arXiv: 2603.12193
SaPaVe: Towards Active Perception and Manipulation in VLA Models for Robotics | arXiv: 2603.12193
sarmae masked autoencoder for sar representation learning | arXiv: 2512.16635
sattc structure-aware label-free test-time calibration for cross-subject eeg-to- | arXiv: 2603.20738
sava-x ego-to-exo imitation error detection via scene-adaptive view alignment an | arXiv: 2603.12764
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion | arXiv: 2603.12764
save speech-aware video representation learning for video-text retrieval | arXiv: 2603.08224
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval | arXiv: 2603.08224
scalable object relation encoding for better 3d spatial reasoning in large langu | arXiv: 2603.24721
scaling spatial intelligence with multimodal foundation models | arXiv: 2511.13719
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework | arXiv: 2603.07659
scaling the long video understanding of multimodal large language models via vis | arXiv: 2603.29252
Scaling View Synthesis Transformers (SVSM) | arXiv: 2602.21341
scaling-aware data selection for end-to-end autonomous driving systems | arXiv: 2604.08366
scene grounding in the wild | arXiv: 2603.26584
scene-vlm multimodal video scene segmentation via vision-language models | arXiv: 2512.21778
sceneassistant a visual feedback agent for open-vocabulary 3d scene generation | arXiv: 2603.12238
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation | arXiv: 2603.12238
scenescribe-1m a large-scale video dataset with comprehensive geometric and sema | arXiv: 2604.07990
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation | arXiv: 2603.06572
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation | arXiv: 2603.06572
SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated Learning | arXiv: 2603.12976
SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning | arXiv: 2603.12976
score2instruct scaling up video quality-centric instructions via automated dimen | arXiv: 2506.21011
sdf-net structure-aware disentangled feature learning for opticall-sar ship re-i | arXiv: 2603.12588
SDF-Net: Structure-Aware Disentangled Feature Learning for Optical-SAR Ship Re-identification | arXiv: 2603.12588
SEA-Vision: A Multilingual Benchmark for Document and Scene Text Understanding in Southeast Asia | arXiv: 2603.15409
seacache spectral-evolution-aware cache for accelerating diffusion models | arXiv: 2602.18993
searchad large-scale rare image retrieval dataset for autonomous driving | arXiv: 2604.08008
see it say it sorted an iterative training-free framework for visually-grounded | arXiv: 2602.21497
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs | arXiv: 2602.21497
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles | arXiv: 2509.13615
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles (StaR) | arXiv: 2509.13615
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation | arXiv: 2603.15475
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness | arXiv: 2602.19615
seeing is improving visual feedback for iterative text layout refinement | arXiv: 2603.22187
seeing without pixels perception from camera trajectories | arXiv: 2511.21681
seethrough3d occlusion aware 3d control in text-to-image generation | arXiv: 2602.23359
seeu seeing the unseen world via 4d dynamics-aware generation | arXiv: 2512.03350
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models | arXiv: 2507.14811
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models | arXiv: 2507.14811
select hypothesize and verify towards verified neuron concept interpretation | arXiv: 2603.24953
self-consistency for llm-based motion trajectory generation and verification | arXiv: 2603.29301
self-corrected image generation with explainable latent rewards | arXiv: 2603.24965
semantic audio-visual navigation in continuous environments | arXiv: 2603.19660
Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation | arXiv: 2603.05202
Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation | arXiv: 2603.05202
Semantic Satellite Communications for Synchronized Audiovisual Reconstruction | arXiv: 2603.10791
semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score | arXiv: 2505.21147
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation | arXiv: 2603.11616
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation | arXiv: 2603.11616
semlayer semantic-aware generative segmentation and layer construction for abstr | arXiv: 2603.24039
SG-NLF: Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis | arXiv: 2603.12903
sgad-slam splatting gaussians at adjusted depth for better radiance fields in rg | arXiv: 2603.21055
sgi structured 2d gaussians for efficient and compact large image representation
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data | arXiv: 2603.02505
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data | arXiv: 2603.02505
shape-of-you fused gromov-wasserstein optimal transport for semantic corresponde | arXiv: 2603.11618
sharp short-window streaming for accurate and robust prediction in motion foreca | arXiv: 2603.28091
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration | arXiv: 2603.00906
shoe semantic hoi open-vocabulary evaluation metric | arXiv: 2604.01586
shoe style-invariant and ground-aware learning for dense foot contact estimation | arXiv: 2511.22184
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos | arXiv: 2603.12751
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos | arXiv: 2603.12751
show3d capturing scenes of 3d hands and objects in the wild | arXiv: 2603.28760
showtable unlocking creative table visualization with collaborative reflection a | arXiv: 2512.13303
SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules | arXiv: 2603.12307
SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules | arXiv: 2603.12307
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning | arXiv: 2602.18867
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images | arXiv: 2602.20412
simpact simulation-enabled action planning using vision-language models | arXiv: 2512.05955
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos | arXiv: 2603.02133
SimScale: Learning to Drive via Real-World Simulation at Scale | arXiv: 2511.23369
SineProject: Machine Unlearning for Stable Vision–Language Alignment | arXiv: 2511.18444
Single Pixel Image Classification using an Ultrafast Digital Light Projector | arXiv: 2603.12036
Single Pixel Image Classification using an Ultrafast Digital Light Projector | arXiv: 2603.12036
SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation | arXiv: 2603.18599
skeletoncontext skeleton-side context prompt learning for zero-shot skeleton-bas | arXiv: 2603.29692
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation | arXiv: 2603.02190
sketchdeco training-free latent composition for precise sketch colourisation | arXiv: 2405.18716
sky2ground a benchmark for site modeling under varying altitude | arXiv: 2603.13740
sldprtnet a large-scale multimodal dataset for cad generation in language-driven | arXiv: 2603.13098
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design | arXiv: 2603.13098
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design | arXiv: 2603.13098
slice semantic latent injection via compartmentalized embedding for image waterm | arXiv: 2603.12749
SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking | arXiv: 2603.12749
SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking | arXiv: 2603.12749
slotvtg object-centric adapter for generalizable video temporal grounding | arXiv: 2603.25733
slvmeval synthetic meta evaluation benchmark for text-to-long video generation | arXiv: 2603.29186
small target detection based on mask-enhanced attention fusion of visible and in | arXiv: 2603.06925
Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images | arXiv: 2603.06925
soda sensitivity-oriented dynamic acceleration for diffusion transformer | arXiv: 2603.07057
SOLACE: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards | arXiv: 2603.00918
Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion | arXiv: 2603.16939
solution for 10th competition on ambivalencehesitancy ah video recognition chall | arXiv: 2603.16939
Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors | arXiv: 2603.00882
sonoworld from one image to a 3d audio-visual scene | arXiv: 2603.28757
SoPE: Spherical Coordinate-Based Positional Embedding for 3D LVLMs | arXiv: 2602.22716
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs | arXiv: 2602.22716
souple enhancing audio-visual localization and segmentation with learnable promp | arXiv: 2603.22732
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection | arXiv: 2511.06702
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection | arXiv: 2511.06702
spar single-pass any-resolution vit for open-vocabulary segmentation | arXiv: 2604.02252
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs | arXiv: 2603.12382
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs | arXiv: 2603.12382
Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis | arXiv: 2603.10526
sparsecam4d spatio-temporally consistent 4d reconstruction from sparse cameras | arXiv: 2603.26481
sparsity-aware voxel attention and foreground modulation for 3d semantic scene c | arXiv: 2604.05780
sparvar exploring sparsity in visual autoregressive modeling for training-free a | arXiv: 2602.04361
spatial-ssrl enhancing spatial understanding via self-supervised reinforcement l | arXiv: 2510.27606
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning | arXiv: 2510.27606
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models | arXiv: 2602.20901
spatialstack layered geometry-language fusion for 3d vlm spatial reasoning | arXiv: 2603.27437
Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation | arXiv: 2603.12538
spdmark selective parameter displacement for robust video watermarking | arXiv: 2512.12090
specificity-aware reinforcement learning for fine-grained open-world classificat | arXiv: 2603.03197
spectral defense against resource-targeting attack in 3d gaussian splatting | arXiv: 2603.12796
Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting | arXiv: 2603.12796
Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization | arXiv: 2603.00920
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis | arXiv: 2603.12903
Speed3R: Sparse Feed-forward 3D Reconstruction Models | arXiv: 2603.08055
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists | arXiv: 2603.09277
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation | arXiv: 2603.11492
SpHOR: A Representation Learning Perspective on Open-set Recognition | arXiv: 2503.08049
SpHOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Neural Networks | arXiv: 2503.08049
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking | arXiv: 2602.23963
spiraldiff spiral diffusion with lora for rgb-to-raw conversion across cameras | arXiv: 2603.14885
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting | arXiv: 2602.24020
SSR2-GCD: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery | arXiv: 2602.19910
stable spike dual consistency optimization via bitwise and operations for spikin | arXiv: 2603.11676
stac plug-and-play spatio-temporal aware cache compression for streaming 3d reco | arXiv: 2603.20284
Stake the Points: Structure-Faithful Instance Unlearning | arXiv: 2603.12915
Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging | arXiv: 2603.18834
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction | arXiv: 2511.19854
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning | arXiv: 2603.11439
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning | arXiv: 2603.11439
STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting | arXiv: 2509.25210
steeldefectx a coarse-to-fine vision-language dataset and benchmark for generali | arXiv: 2603.21824
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering | arXiv: 2603.13878
STEPH: Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in WSI Prognosis | arXiv: 2603.10526
stepper stepwise immersive scene generation with multiview panoramas | arXiv: 2603.28980
StoryTailor: A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives | arXiv: 2602.21273
streamavatar streaming diffusion models for real-time interactive human avatars | arXiv: 2512.22065
streamdit real-time streaming text-to-video generation | arXiv: 2507.03745
streamgaze gaze-guided temporal reasoning and proactive understanding in streami | arXiv: 2512.01707
StreamingTOM: Streaming Token Compression for Efficient Video Understanding | arXiv: 2510.18269
StreamingTOM: Streaming Token Compression for Efficient Video Understanding | arXiv: 2510.18269
StreamReady: Learning What to Answer and When in Long Streaming Videos | arXiv: 2603.08620
stronger normalization-free transformers | arXiv: 2512.10938
StructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues | arXiv: 2602.20089
subflot submodel extraction for efficient and personalized federated learning vi | arXiv: 2604.06631
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling | arXiv: 2602.23013
suppressing non-semantic noise in masked image modeling representations | arXiv: 2604.00172
svc 2026 the second multimodal deception detection challenge and the first domai | arXiv: 2604.05748
swift sliding window reconstruction for few-shot training-free generated video a | arXiv: 2603.08536
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation | arXiv: 2603.19053
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls | arXiv: 2602.23956
symphomotion joint control of camera motion and object dynamics for coherent vid | arXiv: 2604.03723
Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos | arXiv: 2503.22174
t-gated adapter a lightweight temporal adapter for vision-language medical segme | arXiv: 2604.08167
tacsim a dataset and benchmark for football tactical style imitation | arXiv: 2603.25199
tag-moe task-aware gating for unified generative mixture-of-experts | arXiv: 2601.08881
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking | arXiv: 2512.01329
Talking Together: Synthesizing Co-Located 3D Conversations from Audio | arXiv: 2603.08674
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction | arXiv: 2512.02341
talon test-time adaptive learning for on-the-fly category discovery | arXiv: 2603.08075
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning | arXiv: 2512.24146
taming sampling perturbations with variance expansion loss for latent diffusion | arXiv: 2603.21085
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework | arXiv: 2603.10281
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework | arXiv: 2603.10281
taming video models for 3d and 4d generation via zero-shot camera control | arXiv: 2509.15130
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration | arXiv: 2603.03792
task-oriented data synthesis and control-rectify sampling for remote sensing sem | arXiv: 2512.16740
TAUE: Training-free Noise Transplant and Cultivation Diffusion Model | arXiv: 2511.02580
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models | arXiv: 2603.00431
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration | arXiv: 2603.02943
tdatr improving end-to-end table recognition via table detail-aware learning and | arXiv: 2603.22819
team leya in 10th abaw competition multimodal ambivalencehesitancy recognition a | arXiv: 2603.12848
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach | arXiv: 2603.12848
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach | arXiv: 2603.12848
team ras in 10th abaw competition multimodal valence and arousal estimation appr | arXiv: 2603.13056
Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach | arXiv: 2603.13056
Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach | arXiv: 2603.13056
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size | arXiv: 2603.07988
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models | arXiv: 2511.21145
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation | arXiv: 2602.19053
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures | arXiv: 2602.19679
tell model where to look mitigating hallucinations in mllms by vision-guided att | arXiv: 2511.20032
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model | arXiv: 2603.05012
temporal imbalance of positive and negative supervision in class-incremental lea | arXiv: 2603.02280
terraseg self-supervised ground segmentation for any lidar | arXiv: 2603.27344
Test-Time Attention Purification for Backdoored Large Vision Language Models | arXiv: 2603.12989
test-time ego-exo-centric adaptation for action anticipation via multi-label pro | arXiv: 2603.09798
test-time instance-specific parameter composition a new paradigm for adaptive ge | arXiv: 2603.27665
text-guided fine-grained video anomaly understanding | arXiv: 2511.00524
text-image conditioned 3d generation | arXiv: 2603.21295
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction | arXiv: 2512.04309
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval | arXiv: 2603.12711
textf2texthdr two-stage hdr video reconstruction via flow adapter and physical m | arXiv: 2603.14920
textit4dsurf high-fidelity dynamic scene surface reconstruction | arXiv: 2603.28064
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering | arXiv: 2602.20903
The Coherence Trap: MLLM-Crafted Narratives Exploit Manipulated Visual Contexts | arXiv: 2505.17476
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts | arXiv: 2505.17476
the cote score a decomposable framework for evaluating document layout analysis | arXiv: 2603.12718
The Devil is in the Details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection | arXiv: 2512.20340
the golden subspace where efficiency meets generalization in continual test-time | arXiv: 2603.21928
The Invisible Gorilla Effect in Out-of-distribution Detection | arXiv: 2602.20068
the llm bottleneck why open-source vision llms struggle with hierarchical visual | arXiv: 2505.24840
the more the merrier contrastive fusion for higher-order multimodal alignment | arXiv: 2511.21331
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers | arXiv: 2602.19096
the surprising effectiveness of noise pretraining for implicit neural representa | arXiv: 2603.29034
the universal normal embedding | arXiv: 2603.21786
think 360 evaluating the width-centric reasoning capability of mllms beyond dept | arXiv: 2603.22689
Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding | arXiv: 2603.04977
thinking diffusion penalize and guide visual-grounded reasoning in diffusion mul | arXiv: 2604.05497
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World | arXiv: 2603.12746
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking | arXiv: 2602.18863
tiger a unified framework for time images and geo-location retrieval | arXiv: 2603.24749
timelens rethinking video temporal grounding with multimodal llms | arXiv: 2512.14698
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models | arXiv: 2603.17828
tiny inference-time scaling with latent verifiers | arXiv: 2603.22492
tm-bsn triangular-masked blind-spot network for real-world self-supervised image | arXiv: 2604.04484
token reduction via local and global contexts optimization for efficient video l | arXiv: 2603.01400
token warping helps mllms look from nearby viewpoints | arXiv: 2604.02870
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans (HouseMind) | arXiv: 2603.11640
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity | arXiv: 2603.10990
Topo-R1: Detecting Topological Anomalies via Vision-Language Models | arXiv: 2603.13054
topomaskv3 3d mask head with dense offset and height predictions for road topolo | arXiv: 2603.01558
topomesh high-fidelity mesh autoencoding via topological unification | arXiv: 2603.24278
toward generalizable whole brain representations with high-resolution light-shee | arXiv: 2603.29842
toward real-world infrared image super-resolution a unified autoregressive frame
towards balanced multi modal learning in 3d human pose estimation
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation | arXiv: 2501.05264
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation | arXiv: 2501.05264
Towards Calibrating Prompt Tuning of Vision-Language Models | arXiv: 2602.19024
towards context-aware image anonymization with multi-agent reasoning | arXiv: 2603.27817
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data | arXiv: 2508.01450
Towards Faithful Multimodal Concept Bottleneck Models | arXiv: 2603.13163
towards generalizable ai-generated image detection via image-adaptive prompt lea | arXiv: 2508.01603
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning | arXiv: 2508.01603
towards gui agents vision-language diffusion models for gui grounding | arXiv: 2603.26211
towards high-quality image segmentation improving topology accuracy by penalizin | arXiv: 2603.18671
towards highly transferable vision-language attack via semantic-augmented dynami | arXiv: 2603.04839
Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction | arXiv: 2603.04839
towards intrinsic-aware monocular 3d object detection | arXiv: 2603.27059
Towards Multimodal Domain Generalization with Few Labels | arXiv: 2602.22917
towards open environments and instructions general vision-language navigation vi | arXiv: 2601.09111
towards real-world document parsing via realistic scene synthesis and document-a | arXiv: 2603.23885
towards robust content watermarking against removal and forgery attacks | arXiv: 2604.06662
Towards Source-Aware Object Swapping with Initial Noise Perturbation | arXiv: 2602.23697
Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos | arXiv: 2603.13185
Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos | arXiv: 2603.13185
towards training-free scene text editing | arXiv: 2603.24571
towards universal computational aberration correction in photographic cameras a
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast | arXiv: 2506.13387
trace structure-aware character encoding for robust and generalizable document w | arXiv: 2603.12873
TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking | arXiv: 2603.12873
trackmae video representation learning via track mask and predict | arXiv: 2603.27268
training high-level schedulers with execution-feedback reinforcement learning fo | arXiv: 2511.22235
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods | arXiv: 2603.15026
Training-free Motion Factorization for Compositional Video Generation | arXiv: 2603.09104
trajtok learning trajectory tokens enables better video understanding | arXiv: 2602.22779
TrajTok: 学习轨迹Token实现更好的视频理解 | arXiv: 2602.22779
transformer-based multi-region segmentation and radiomic analysis of hr-pqct ima | arXiv: 2603.09137
Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis | arXiv: 2602.19585
tridf evaluating perception detection and hallucination for interpretable deepfa | arXiv: 2512.10652
TriLite: Efficient WSOL with Universal Visual Features and Tri-Region Disentanglement | arXiv: 2602.23120
trivia self-supervised fine-tuning of vision-language models for table recogniti | arXiv: 2512.01248
TT-Occ: Test-Time 3D Occupancy Prediction | arXiv: 2503.08485
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction | arXiv: 2602.20160
tutor-student reinforcement learning a dynamic curriculum for robust deepfake de | arXiv: 2603.24139
U-F²-CBM: CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models | arXiv: 2503.10981
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation | arXiv: 2602.23739
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences | arXiv: 2512.02982
ucan unified convolutional attention network for expansive receptive fields in l
uetrack a unified and efficient framework for single object tracking
ufvideo towards unified fine-grained video cooperative understanding with large | arXiv: 2512.11336
ultrasound-clip semantic-aware contrastive pre-training for ultrasound image-tex | arXiv: 2604.01749
unblur-slam dense neural slam for blurry inputs | arXiv: 2603.26810
Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos | arXiv: 2603.00881
uncertainty-aware knowledge distillation for multimodal large language models | arXiv: 2603.21426
uncertainty-guided compositional alignment with part-to-whole semantic represent | arXiv: 2603.22042
understanding and mitigating hallucinations in multimodal chain-of-thought model | arXiv: 2603.27201
understanding task transfer in vision-language models | arXiv: 2511.18787
understanding temporal logic consistency in video-language models through cross- | arXiv: 2510.08138
understanding the role of hallucination in reinforcement post-training of multim | arXiv: 2604.03179
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation | arXiv: 2511.18281
uniavgen unified audio and video generation with asymmetric cross-modal interact | arXiv: 2511.03334
UNICBench: UNIfied Counting Benchmark for MLLM | arXiv: 2603.00595
UniComp: Rethinking Video Compression Through Informational Uniqueness | arXiv: 2512.03575
unidex a robot foundation suite for universal dexterous hand control from egocen | arXiv: 2603.22264
unified primitive proxies for structured shape completion | arXiv: 2601.00759
unified spatiotemporal token compression for video-llms at ultra-low retention | arXiv: 2603.21957
unified spherical frontend learning rotation-equivariant representations of sphe | arXiv: 2511.18174
unified vector floorplan generation via markup representation | arXiv: 2604.04859
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation | arXiv: 2603.14214
unigame turning a unified multimodal model into its own adversary | arXiv: 2511.19413
unils end-to-end audio-driven avatars for unified listening and speaking | arXiv: 2512.09327
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark | arXiv: 2603.05075
unimmad unified multi-modal and multi-class anomaly detection via moe-driven fea | arXiv: 2509.25934
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression | arXiv: 2509.25934
unirain unified image deraining with rag-based dataset distillation and multi-ob
unispector towards universal open-set defect recognition via spectral-contrastiv | arXiv: 2604.02905
unistainnet foundation-model-guided virtual staining of he to ihc | arXiv: 2603.12716
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation | arXiv: 2603.01418
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance | arXiv: 2602.19112
unleashing video language models for fine-grained hrct report generation | arXiv: 2603.12469
unleashing video language models for fine-grained hrct report generation | arXiv: 2603.12469
unleashing vision-language semantics for deepfake video detection | arXiv: 2603.24454
Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation | arXiv: 2603.05729
unlocking multi-site clinical data a federated approach to privacy-first child a | arXiv: 2604.02616
unlocking positive transfer in incrementally learning surgical instruments a sel | arXiv: 2604.02877
unlocking strong supervision a data-centric study of general-purpose audio pre-t | arXiv: 2603.25767
UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data | arXiv: 2601.00991
unsafe2safe controllable image anonymization for downstream utility | arXiv: 2603.28605
unsupervised domain adaptation with target-only margin disparity discrepancy | arXiv: 2603.09932
using gaussian splats to create high-fidelity facial geometry and texture | arXiv: 2512.16397
utptrack towards simple and unified token pruning for visual tracking | arXiv: 2602.23734
UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes | arXiv: 2512.04421
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs | arXiv: 2511.20223
v-bridge bridging video generative priors to versatile few-shot image restoratio | arXiv: 2603.13089
V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration | arXiv: 2603.13089
V2Drop: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models | arXiv: 2509.01552
vanast virtual try-on with human image animation via synthetic triplet supervisi | arXiv: 2604.04934
Variation-Aware Vision Token Dropping for Faster Large Vision-Language Models | arXiv: 2509.01552
variational garrote for sparse inverse problems
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM | arXiv: 2603.09673
vecattention vector-wise sparse attention for accelerating long context inferenc | arXiv: 2603.29494
VecGlypher: Unified Vector Glyph Generation with Language Models | arXiv: 2602.21461
VeCoR — Velocity Contrastive Regularization for Flow Matching | arXiv: 2511.18942
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping | arXiv: 2602.23980
verify claimed text-to-image models via boundary-aware prompt optimization | arXiv: 2603.26328
versecrafter dynamic realistic video world model with 4d geometric control | arXiv: 2601.05138
VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale | arXiv: 2602.23361
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving | arXiv: 2602.20794
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection | arXiv: 2603.00912
vggt-slam | arXiv: 2604.06830
video-only tom enhancing theory of mind in multimodal large language models | arXiv: 2603.24484
videoarm agentic reasoning over hierarchical memory for long-form video understa | arXiv: 2512.12360
videoauto-r1 video auto reasoning via thinking once answering twice | arXiv: 2601.05175
videochat-m1 collaborative policy planning for video understanding via multi-age | arXiv: 2511.19524
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning | arXiv: 2511.19524
videocof unified video editing with temporal reasoner | arXiv: 2512.07469
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion | arXiv: 2503.23359
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion | arXiv: 2503.23359
videomt your vit is secretly also a video segmentation model | arXiv: 2602.17807
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model | arXiv: 2602.17807
videoseek long-horizon video agent with tool-guided seeking | arXiv: 2603.20185
vihoi human-object interaction synthesis with visual priors | arXiv: 2603.24383
Vinedresser3D: Agentic Text-guided 3D Editing | arXiv: 2602.19542
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking | arXiv: 2512.14654
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation | arXiv: 2603.12918
viro robust and efficient neuro-symbolic reasoning with verification for referri | arXiv: 2601.12781
VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection | arXiv: 2603.17470
virst video-instructed reasoning assistant for spatiotemporal segmentation | arXiv: 2603.27060
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code | arXiv: 2501.18328
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding | arXiv: 2603.07071
vision on request enhanced vllm efficiency with sparse dynamically selected visi | arXiv: 2603.23495
Vision Transformers Need More Than Registers | arXiv: 2602.22394
Vision Transformers Need More Than Registers | arXiv: 2602.22394
vision-language attribute disentanglement and reinforcement for lifelong person | arXiv: 2603.19678
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning | arXiv: 2603.08921
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models | arXiv: 2603.00207
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models | arXiv: 2603.00207
vistorybench comprehensive benchmark suite for story visualization | arXiv: 2505.24862
visualad language-free zero-shot anomaly detection via vision transformer | arXiv: 2603.07952
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos | arXiv: 2603.04265
VL-RouterBench: A Benchmark for Vision-Language Model Routing | arXiv: 2512.23562
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery | arXiv: 2602.19180
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models | arXiv: 2603.09826
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm | arXiv: 2512.02700
vrr-qa visual relational reasoning in videos beyond explicit cues | arXiv: 2506.21742
vt-intrinsic physics-based decomposition of reflectance and shading using a sing | arXiv: 2509.10388
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis | arXiv: 2603.08258
walkgpt grounded vision-language conversation with depth-aware segmentation for | arXiv: 2603.10703
wan-weaver interleaved multi-modal generation via decoupled training | arXiv: 2603.25706
wanderland geometrically grounded simulation for open-world embodied ai | arXiv: 2511.20620
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI | arXiv: 2511.20620
watch and learn learning to use computers from online videos | arXiv: 2510.04673
Watch and Learn: Learning to Use Computers from Online Videos | arXiv: 2510.04673
wavelet-based frame selection by detecting semantic boundary for long video unde | arXiv: 2603.00512
weakly supervised teacher-student framework with progressive pseudo-mask refinem | arXiv: 2603.08605
Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation | arXiv: 2603.08605
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning | arXiv: 2603.00550
WeaveTime: 流式视频LLM的帧级逐步记忆 | arXiv: 2602.22142
WeaveTime: 流式视频LLM的帧级逐步记忆 | arXiv: 2602.22142
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models | arXiv: 2603.00510
what is the optimal ranking score between precision and recall we can always fin | arXiv: 2511.22442
what is wrong with synthetic data for scene text recognition a strong synthetic | arXiv: 2602.06450
What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching? | arXiv: 2504.16930
when identities collapse a stress-test benchmark for multi-subject personalizati | arXiv: 2603.26078
when numbers speak aligning textual numerals and visual instances in text-to-vid | arXiv: 2604.08546
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models | arXiv: 2511.21192
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance | arXiv: 2602.20880
When to Lock Attention: Training-Free KV Control in Video Diffusion | arXiv: 2603.09657
when to think and when to look uncertainty-guided lookback | arXiv: 2511.15613
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs | arXiv: 2512.07580
when understanding becomes a risk authenticity and safety risks in the emerging | arXiv: 2603.24079
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation | arXiv: 2509.22496
Where, What, Why: Toward Explainable 3D-GS Watermarking | arXiv: 2603.08809
which concepts to forget and how to refuse decomposing concepts for continual un | arXiv: 2603.21484
Why Does It Look There? Structured Explanations for Image Classification | arXiv: 2603.10234
widget2code from visual widgets to ui code via multimodal llms | arXiv: 2512.19918
wildcap facial albedo capture in the wild via hybrid inverse rendering | arXiv: 2512.11237
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval | arXiv: 2602.23029
WMGStereo: What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching? | arXiv: 2504.16930
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training | arXiv: 2509.24948
worldmm dynamic multimodal memory agent for long video reasoning | arXiv: 2512.02425
x-win building chest radiograph world model via predictive sensing | arXiv: 2511.14918
x2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space | arXiv: 2603.16671
xseg a large-scale x-ray contraband segmentation benchmark for real-world securi | arXiv: 2604.03706
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion | arXiv: 2511.18734
Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation | arXiv: 2505.19459
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image | arXiv: 2603.14772
zina multimodal fine-grained hallucination detection and editing | arXiv: 2506.13130
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training | arXiv: 2603.13115
ada3drift adaptive training-time drifting for one-step 3d visuomotor robotic man | arXiv: 2603.11984
ada3drift adaptive trainingtime drifting for onest | arXiv: 2603.11984
affostruction 3d affordance grounding with generative reconstruction | arXiv: 2601.09211
apc adversarial point counterattack | arXiv: 2604.15708
cari4d category agnostic 4d reconstruction of human object interaction | arXiv: 2512.11988
cube bspline 3d faces | arXiv: 2604.12894
deepshapematchingkit accelerated functional map solver | arXiv: 2604.10377
fall risk gait analysis hmr | arXiv: 2604.11961
ff3r feedforward feature 3d reconstruction from unconstrained views | arXiv: 2604.09862
freescale scaling 3d scenes | arXiv: 2604.10512
iris bringing realworld priors into diffusion model for monocular depth estimation | arXiv: 2603.16340
long scope fully sparse long range cooperative 3d perception | arXiv: 2604.09206
lumimotion gaussian relighting dynamics | arXiv: 2604.10994
marco semantic correspondence | arXiv: 2604.18267
neural gabor splatting | arXiv: 2604.15941
ng gs nerf guided 3d gaussian splatting segmentation | arXiv: 2604.14706
nimbusgs unified 3d scene reconstruction under hybrid weather | arXiv: 2603.27228
pointins instance-aware self-supervised learning for point clouds | arXiv: 2603.25165
reliev3r relieving feed-forward 3d reconstruction from multi-view geometric annot | arXiv: 2604.00548
rewis3d reconstruction improves weakly-supervised semantic segmentation | arXiv: 2603.06374
rewis3d reconstruction improves weaklysupervised s | arXiv: 2603.06374
rng unified transformer complete 3d modeling partial observations | arXiv: 2603.01194
sasnet spatially adaptive sinusoidal networks for inrs | arXiv: 2503.09750
sepatch3d revisiting token compression for accelerating vit based sparse 3d detectors | arXiv: 2604.14563
sgi structured 2d gaussians large image representation | arXiv: 2603.07789
sgs-intrinsic semantic-invariant gaussian splatting for sparse-view indoor invers | arXiv: 2603.27516
sts mixer 4d point cloud | arXiv: 2604.11637
tco learning 3d reconstruction with priors in test time | arXiv: 2604.03878
towards spatio-temporal world scene graph generation from monocular videos | arXiv: 2603.13185
unisplat 3d representations unposed | arXiv: 2604.10573
clustermark robust watermarking autoregressive image generators | arXiv: 2508.06656
clustermark towards robust watermarking for autoregressive image generators with | arXiv: 2508.06656
logitdynamics vit error detection | arXiv: 2604.10643
mcsd uncertainty estimation | arXiv: 2604.12719
one-to-more high-fidelity training-free anomaly generation with attention control | arXiv: 2603.18093
team leya in 10th abaw competition multimodal ambi | arXiv: 2603.12848
unim a unified any-to-any interleaved multimodal benchmark | arXiv: 2603.05075
vidscribe multimodal ai customizing audio description videos | arXiv: 2603.14662
vidscribe multimodal ai for customizing audio description and question answering | arXiv: 2603.14662
a prediction-as-perception framework for 3d object detection | arXiv: 2603.12599
a predictionasperception framework for 3d object d | arXiv: 2603.12599
c2t llm traffic coordination | arXiv: 2604.13098
climaood improving anomaly segmentation via physically realistic synthetic data | arXiv: 2512.02686
den tp a density balanced data curation and evaluation framework for trajectory | arXiv: 2409.17385
fedbprompt federated domain generalization person | arXiv: 2603.12912
fedbprompt federated domain generalization person re-identification via body dis | arXiv: 2603.12912
igasa integrated geometry-aware and skip-attention modules for enhanced point cl | arXiv: 2603.12719
igasa integrated geometryaware and skipattention m | arXiv: 2603.12719
leader lidar relocalization | arXiv: 2604.11355
mapgclr geospatial contrastive learning of represe | arXiv: 2603.10688
mapgclr geospatial contrastive learning of representations for online vectorized | arXiv: 2603.10688
neural distribution prior for lidar ood detection | arXiv: 2604.09232
open-vocabulary domain generalization in urban-scene segmentation | arXiv: 2602.18853
sparseworld tc trajectory conditioned sparse occupancy world model | arXiv: 2511.22039
traffic scene generation from natural language description for autonomous vehicl | arXiv: 2409.09575
ttsg text to traffic scene generation from natural language | arXiv: 2409.09575
vla world learning vision language action world models for autonomous driving | arXiv: 2604.09059
cipher counterfactual diffusion hallucination sup | arXiv: 2603.10470
codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
codepercept codegrounded visual stem perception fo | arXiv: 2603.10757
geotikzbridge advancing multimodal code generation for geometric perception and | arXiv: 2603.22687
mm-recoder advancing chart-to-code generation with reinforcement learning and se | arXiv: 2604.01600
evolutionary multimodal reasoning via hierarchical semantic representation for i | arXiv: 2603.03827
m3kg rag multi hop multimodal knowledge graph enhanced retrieval augmented genera | arXiv: 2512.20136
a two stage dual modality model for facial expression recognition | arXiv: 2603.12221
decovln decoupling observation reasoning and correction for vision-and-language | arXiv: 2603.13133
efficient onboard spacecraft pose estimation with event cameras and neuromorphic hardware | arXiv: 2604.04117
from 2d alignment to 3d plausibility unifying hete | arXiv: 2503.17788
fsmc-pose frequency and spatial fusion with multiscale selfcalibration for cattle | arXiv: 2603.16596
fsmc pose cattle mounting pose estimation | arXiv: 2603.16596
fsmc pose frequency spatial cattle mounting pose | arXiv: 2603.16596
handdreamer zero shot text to 3d hand model generation | arXiv: 2604.04425
hum4d markerless motion capture | arXiv: 2604.12765
l2gtx from local to global time series explanation | arXiv: 2603.13065
l2gtx from local to global time series explanations | arXiv: 2603.13065
lca large-scale codec avatars the unreasonable effectiveness of large-scale avata | arXiv: 2604.02320
mmgait multi modal gait recognition | arXiv: 2604.15979
molingo motion-language alignment for text-to-motion generation | arXiv: 2512.13840
quantvla scale-calibrated post-training quantization for vision-language-action | arXiv: 2602.20309
ram recover any 3d human motion in-the-wild | arXiv: 2603.19929
reference-free image quality assessment for virtual try-on via human feedback | arXiv: 2603.13057
referencefree image quality assessment for virtual | arXiv: 2603.13057
regformer transferable relational grounding for weakly-supervised hoi detection | arXiv: 2604.00507
rppg vqa video quality assessment | arXiv: 2604.11156
team ras in 10th abaw competition multimodal valen | arXiv: 2603.13056
textit4dsurf high-fidelity dynamic scene surface reconstruction | arXiv: 2603.28064
vibes a conversational agent with behaviorally intelligent 3d virtual body | arXiv: 2512.14234
ahs adaptive head synthesis | arXiv: 2604.15857
circuit mechanisms for spatial relation generation in diffusion models | arXiv: 2601.06338
cognitioncapturerpro towards high-fidelity visual decoding from eegmeg via multi | arXiv: 2603.12722
cognitioncapturerpro towards highfidelity visual d | arXiv: 2603.12722
craft aligning diffusion models with finetuning is easier than you think | arXiv: 2603.18991
dcw snr t bias diffusion | arXiv: 2604.16044
deco frequency-decoupled pixel diffusion for end-to-end image generation | arXiv: 2511.19365
depthvar depth adaptive var | arXiv: 2604.17286
dit-ic aligned diffusion transformer for efficient image compression | arXiv: 2603.13162
ditic aligned diffusion transformer for efficient | arXiv: 2603.13162
editing away the evidence diffusion-based image manipulation and the failure mod | arXiv: 2603.12949
editing away the evidence diffusionbased image man | arXiv: 2603.12949
emf meanflow text to image | arXiv: 2604.18168
evatok adaptive length video tokenization for eff | arXiv: 2603.12267
fdeidtoolbox face deidentification toolbox | arXiv: 2603.13121
fractals made practical denoising diffusion as par | arXiv: 2603.13069
fractals made practical denoising diffusion as partitioned iterated function sys | arXiv: 2603.13069
freqflow frequency aware flow matching | arXiv: 2604.15521
gist towards design compositing | arXiv: 2604.14605
groce graph-guided online concept erasure for text-to-image diffusion models | arXiv: 2511.12968
haltnav reactive visual halting over lightweight t | arXiv: 2603.12696
haltnav reactive visual halting over lightweight topological priors for robust v | arXiv: 2603.12696
intra finger variability of diffusion based latent fingerprint generation | arXiv: 2604.10040
leapalign post training flow matching models at any generation step | arXiv: 2604.15311
multibanana a challenging benchmark for multi reference text to image generation | arXiv: 2511.22989
oars processaware online alignment for generative | arXiv: 2603.12811
pixeldit pixel diffusion transformers for image generation | arXiv: 2511.20645
smoothing score function generalization diffusion models | arXiv: 2601.19285
smoothing the score function for generalization in diffusion models | arXiv: 2601.19285
tokenlight precise lighting control in images using attribute tokens | arXiv: 2604.15310
vosr a vision only generative model for image super resolution | arXiv: 2604.03225
yoeo you only erase once erasing anything without bringing unexpected content | arXiv: 2603.27599
drfusion degradation robust fusion via degradation aware diffusion framework | arXiv: 2604.08922
evlf early vision-language fusion for generative dataset distillation | arXiv: 2603.07476
finpercep rm a fine grained reward model and co evolutionary curriculum for rl ba | arXiv: 2512.22647
finpercep rm fine grained reward model rl super resolution | arXiv: 2512.22647
gsnr graph smooth null space representation for inverse problems | arXiv: 2602.20328
ia clahe image adaptive clip limit | arXiv: 2604.16010
ntire 2026 ai flash portrait challenge | arXiv: 2604.11230
ntire 2026 raindrop removal challenge | arXiv: 2604.10634
rar restore assess repeat a unified framework for iterative image restoration | arXiv: 2603.26385
real iisr infrared image super resolution autoregressive | arXiv: 2603.04745
sat selective aggregation transformer for image super resolution | arXiv: 2604.07994
selfhvd self-supervised handheld video deblurring | arXiv: 2508.08605
shadow removal cascaded refinement | arXiv: 2604.16177
ucan unified convolutional attention lightweight sr | arXiv: 2603.11680
udapose unsupervised domain adaptation for low light human pose estimation | arXiv: 2604.10485
uniblendnet unified global multi scale and region adaptive modeling for ambient lighting normalization | arXiv: 2604.13383
unicac universal computational aberration correction | arXiv: 2603.12083
unicac universal computational aberration correction benchmark | arXiv: 2603.12083
unirain unified image deraining rag dataset distillation | arXiv: 2603.03967
unirain unified image deraining with rag based dataset distillation and multi obje | arXiv: 2603.03967
beyond global similarity towards fine-grained multi-condition multimodal retriev | arXiv: 2603.01082
cc-vqa conflict- and correlation-aware method for mitigating knowledge conflict | arXiv: 2602.23952
explaining clip zero-shot predictions through concepts | arXiv: 2603.28211
m4-rag a massive-scale multilingual multi-cultural multimodal rag | arXiv: 2512.05959
mind the way you select negative texts pursuing the distance consistency in ood | arXiv: 2603.02618
muco multi-turn contrastive learning for multimodal embedding model | arXiv: 2602.06393
nanovdr distilling a 2b vision-language retriever into a 70m text-only encoder f | arXiv: 2603.12824
nanovdr distilling a 2b visionlanguage retriever i | arXiv: 2603.12824
robustvisrag causality-aware vision-based retrieval-augmented generation under v | arXiv: 2602.22013
beyond semantics disentangling information scope in sparse autoencoders for clip | arXiv: 2604.05724
beyond the fold quantifying split-level noise and the case for leave-one-dataset | arXiv: 2604.02162
ciice intrinsic concept extraction compositional | arXiv: 2603.11795
cut to the chase training-free multimodal summarization via chain-of-events | arXiv: 2603.06213
dino-qpm adapting visual foundation models for globally interpretable image clas | arXiv: 2604.07166
draft and refine with visual experts | arXiv: 2511.11005
edit-as-act goal-regressive planning for open-vocabulary 3d indoor scene editing | arXiv: 2603.17583
emoverse a mllms-driven emotion representation dataset for interpretable visual | arXiv: 2511.12554
emoverse mllm emotion representation dataset | arXiv: 2511.12554
ermoe eigen-reparameterized mixture-of-experts for stable routing | arXiv: 2511.10971
feature attribution stability suite how stable are post-hoc attributions | arXiv: 2604.02532
finer mllms hallucinate under fine-grained negative queries | arXiv: 2603.17662
from weights to concepts data-free interpretability of clip via singular vector | arXiv: 2603.24653
geometry-guided camera motion understanding in videollms | arXiv: 2603.13119
geometryguided camera motion understanding in vide | arXiv: 2603.13119
how to take a memorable picture empowering users with actionable feedback | arXiv: 2602.21877
inside-out measuring generalization in vision transformers through inner working | arXiv: 2604.08192
language models can explain visual features via steering | arXiv: 2603.22593
measuring the unfaithfulness of concept-based explanations | arXiv: 2504.10833
missing no more dictionary-guided cross-modal image fusion under missing infrare | arXiv: 2603.08018
neurodynamics-driven coupled neural p systems for multi-focus image fusion | arXiv: 2509.17704
on the possible detectability of image-in-image steganography | arXiv: 2603.11876
on the possible detectability of imageinimage steg | arXiv: 2603.11876
pixel2phys distilling governing laws from visual dynamics | arXiv: 2602.19516
reallocating attention across layers to reduce multimodal hallucination | arXiv: 2510.10285
reallocating attention reduce hallucination | arXiv: 2510.10285
recursive think-answer process for llms and vlms | arXiv: 2603.02099
safedrive fine-grained safety reasoning for end-to-end driving in a sparse world | arXiv: 2602.18887
subspacead training-free few-shot anomaly detection via subspace modeling | arXiv: 2602.23013
tdatr improving end-to-end table recognition via table detail-aware learning and | arXiv: 2603.22819
text-guided fine-grained video anomaly understanding | arXiv: 2511.00524
towards faithful multimodal concept bottleneck models | arXiv: 2603.13163
viro robust and efficient neuro-symbolic reasoning with verification for referri | arXiv: 2601.12781
where mllms attend and what they rely on explaining autoregressive token generat | arXiv: 2509.22496
why does it look there structured explanations for image classification | arXiv: 2603.10234
attribution-guided model rectification of unreliable neural network behaviors | arXiv: 2603.15656
argos agentic multi camera person search | arXiv: 2604.12762
echotrail-gui building actionable memory for gui agents | arXiv: 2512.19396
epiagent agent centric system for ancient inscription restoration | arXiv: 2604.09367
gen n val agentic image data generation and validation | arXiv: 2506.04676
haven hierarchical long video understanding audiovisual entity | arXiv: 2601.13719
haven hierarchical long video understanding with audiovisual entity cohesion | arXiv: 2601.13719
nerfify multiagent nerf paper to code | arXiv: 2603.00805
bias reward models t2i | arXiv: 2604.13305
adabet gradient-free layer selection for efficient training of deep neural netwo | arXiv: 2510.03101
cross-scale pansharpening via scaleformer and the panscale benchmark | arXiv: 2603.00543
cryohype reconstructing a thousand cryo-em structures with transformer-based hyp | arXiv: 2512.06332
enhancing out-of-distribution detection with extended logit normalization | arXiv: 2504.11434
flow3r factored flow prediction for scalable visual geometry learning | arXiv: 2602.20157
free-grained hierarchical visual recognition | arXiv: 2510.14737
hess head sensitivity score for sparsity redistribution in vggt | arXiv: 2603.25336
hier-cos making deep features hierarchy-aware via composition of orthogonal subs | arXiv: 2503.07853
hiercos making deep features hierarchyaware via co | arXiv: 2503.07853
hycal training free prototype calibration for cross discipline fscil | arXiv: 2604.15678
out of sight out of mind evaluating state evolutio | arXiv: 2603.13215
out of sight out of mind evaluating state evolution in video world models | arXiv: 2603.13215
pioneering perceptual video fluency assessment a novel task with benchmark datas | arXiv: 2603.26055
r2g multi view circuit graph benchmark suite from rtl to gdsii | arXiv: 2604.08810
reflexsplit single image reflection separation via layer fusion-separation | arXiv: 2601.17468
reframing long-tailed learning via loss landscape geometry | arXiv: 2603.21217
sattc structure-aware label-free test-time calibration for cross-subject eeg-to- | arXiv: 2603.20738
semi-supervised conformal prediction with unlabeled nonconformity score | arXiv: 2505.21147
sparsecam4d spatio-temporally consistent 4d reconstruction from sparse cameras | arXiv: 2603.26481
tacsim a dataset and benchmark for football tactical style imitation | arXiv: 2603.25199
temporal imbalance of positive and negative supervision in class-incremental lea | arXiv: 2603.02280
vga bench unified benchmark for video aesthetics and generation quality | arXiv: 2604.10127
weakly supervised video anomaly detection with anomaly-connected components and | arXiv: 2603.00550
bi cmpstereo bidirectional cross modal prompting for event frame asymmetric stereo | arXiv: 2604.15312
cops conditional prompt synthesis for zero-shot anomaly detection | arXiv: 2508.03447
perception programs visual tool reasoning | arXiv: 2604.12896
sign language recognition llms | arXiv: 2604.11225
defending unauthorized model merging via dual-stage weight protection | arXiv: 2511.11851
evidential transformation network post hoc uncertainty estimation | arXiv: 2604.08627
flowmotion training-free flow guidance for video motion transfer | arXiv: 2603.06289
linking modality isolation in heterogeneous collaborative perception | arXiv: 2603.00609
lottiegpt vector animation generation | arXiv: 2604.11792
mxnorm reusing mxfp block scales for efficient ten | arXiv: 2603.13180
mxnorm reusing mxfp block scales for efficient tensor normalisation | arXiv: 2603.13180
watch and learn computer use from videos | arXiv: 2510.04673
watch and learn learning to use computers from online videos | arXiv: 2510.04673
graze grounded refinement and motion-aware zero-shot generation | arXiv: 2604.01383
latent chain-of-thought world modeling for end-to-end autonomous driving | arXiv: 2512.10226
association and consolidation evolutionary memory-enhanced incremental multi-vie | arXiv: 2509.14544
blind spot of adaptation quantifying and mitigating forgetting in fine tuned driving models | arXiv: 2604.04857
damp class unlearning via depth aware removal of forget specific directions | arXiv: 2604.15166
designing to forget deep semi-parametric models for unlearning | arXiv: 2603.22870
elastic weight consolidation done right for continual learning | arXiv: 2603.18596
learning from oblivion predicting knowledge overflowed weights via retrodiction | arXiv: 2508.05059
oslash source models leak what they shouldnt nrightarrow unlearning zero-shot tr | arXiv: 2604.08238
select hypothesize and verify towards verified neuron concept interpretation | arXiv: 2603.24953
sineproject machine unlearning for stable vision language alignment | arXiv: 2511.18444
addressing data scarcity in 3d trauma detection th | arXiv: 2603.12514
addressing data scarcity in 3d trauma detection through self-supervised and semi | arXiv: 2603.12514
apex adaptive visual prompting | arXiv: 2604.17455
cloe expert consistency learning for missing modal | arXiv: 2603.09316
cloe expert consistency learning for missing modality segmentation | arXiv: 2603.09316
decoupling vision and language codebook anchored visual adaptation | arXiv: 2602.19449
deep learningbased assessment of the relation betw | arXiv: 2603.11850
developing foundation models for universal segment | arXiv: 2603.11627
developing foundation models for universal segmentation from 3d whole-body posit | arXiv: 2603.11627
emad evidence-centric grounded multimodal diagnosis for alzheimers disease | arXiv: 2602.19178
equivania a spectral method for rotation-equivariant anisotropic image analysis | arXiv: 2603.11294
equivania a spectral method for rotationequivarian | arXiv: 2603.11294
event level detection of surgical instrument handovers in videos | arXiv: 2604.07577
forecasting epileptic seizures from contactless ca | arXiv: 2603.12887
forecasting epileptic seizures from contactless camera via cross-species transfe | arXiv: 2603.12887
forge continual learning for fmri based brain disorder diagnosis | arXiv: 2604.14259
gleam a multimodal imaging dataset and hamm for gl | arXiv: 2603.12800
human knowledge integrated multi-modal learning for single source domain general | arXiv: 2603.12369
human knowledge integrated multimodal learning for | arXiv: 2603.12369
invad inversion-based reconstruction-free anomaly detection with diffusion model | arXiv: 2504.05662
invad inversionbased reconstructionfree anomaly de | arXiv: 2504.05662
lemon a large endoscopic monocular dataset and foundation model for perception in | arXiv: 2503.19740
lemon large endoscopic monocular dataset foundation model surgical | arXiv: 2503.19740
relativeflow taming medical image denoising learning with noisy reference | arXiv: 2604.15459
residual sodap residual self-organizing domain-adaptive prompting with structura | arXiv: 2603.12816
residual sodap residual selforganizing domainadapt | arXiv: 2603.12816
robust fair disease diagnosis in ct images | arXiv: 2604.09710
sd fsmis adapting stable diffusion for few shot medical image segmentation | arXiv: 2604.03134
semitooth a generalizable semi-supervised framework for multi-source tooth segme | arXiv: 2603.11616
semitooth a generalizable semisupervised framework | arXiv: 2603.11616
transformer-based multi-region segmentation and radiomic analysis of hr-pqct ima | arXiv: 2603.09137
uncertainty-aware concept and motion segmentation for semi-supervised angiograph | arXiv: 2603.00881
4d rgpt toward region level 4d understanding via perceptual distillation | arXiv: 2512.17012
adversarial concept distillation for one-step diffusion personalization | arXiv: 2510.20512
batch loss score for dynamic data pruning | arXiv: 2604.04681
enhancing mixture of experts specialization via cluster aware upcycling | arXiv: 2604.13508
flashvggt efficient and scalable visual geometry transformers with compressed descr | arXiv: 2512.01540
frequency switching mechanism for parameter-ecient multi-task learning | arXiv: 2603.21111
iapl aigenerated image detection adaptive prompt | arXiv: 2508.01603
llava-le large language-and-vision assistant for lunar exploration | arXiv: 2603.24696
mame and mare matrix based token merging and restoration for efficient visual perception and synthesis | arXiv: 2604.13432
memory efficient transfer learning with fading side networks | arXiv: 2604.09088
mine-jepa in-domain self-supervised learning for mine-like object classification | arXiv: 2604.00383
opad adversarial concept distillation for one-step diffusion personalization | arXiv: 2510.20512
rdvq differentiable vq image compression | arXiv: 2604.10546
understanding and enforcing weight disentanglement in task arithmetic | arXiv: 2604.17078
wpt world-to-policy transfer via online world model distillation | arXiv: 2511.20095
mmtit-bench a multilingual and multi-scenario benchmark with cognition-perceptio | arXiv: 2603.23896
sea-vision a multilingual benchmark for comprehensive document and scene text un | arXiv: 2603.15409
aif adaptive information flow vlm | arXiv: 2604.15809
av speakerbench audiovisual human speech understanding mllms | arXiv: 2512.02231
ava vla improving vision language action models with active visual attention | arXiv: 2511.18960
biclip domain canonicalization via structured geometric transformation | arXiv: 2603.08942
coat cbm concept wise attention | arXiv: 2604.15748
comp collaborative multi-mode pruning for vision-language models | arXiv: 2604.02956
cropvlm learning to zoom for fine grained vision language perception | arXiv: 2511.19820
dictionary aligned concept control for safeguarding multimodal llms | arXiv: 2604.08846
do vision language models need to process image tokens | arXiv: 2604.09425
docseeker long document understanding | arXiv: 2604.12812
dsert roll robust multi modal perception for diverse driving conditions | arXiv: 2604.03685
ebmc multimodal sentiment analysis | arXiv: 2604.12518
fairllava fairness-aware parameter-efficient fine-tuning for large vision-langua | arXiv: 2603.26008
flowcomposer composable flows for compositional zeroshot learning | arXiv: 2603.16641
flowhijack dynamics aware backdoor attack on flow matching vla models | arXiv: 2604.09651
g mixer geodesic mixup based implicit semantic expansion for zero shot cir | arXiv: 2604.14710
hog layout hierarchical 3d scene generation optimization and editing | arXiv: 2604.10772
isoclip decomposing clip projectors for efficient intramodal alignment | arXiv: 2603.19862
kec hierarchical textual knowledge clustering | arXiv: 2604.11144
lfpc learning to focus and precise cropping for mllms | arXiv: 2603.27494
medic-ad towards medical vision-language models clinical intelligence | arXiv: 2603.27176
mmrad multimodal anomaly detection | arXiv: 2604.10971
modix positional index scaling | arXiv: 2604.12537
mupo all roads lead to rome incentivizing divergent thinking in vlms | arXiv: 2604.00479
nano-emox unifying multimodal emotional intelligence from perception to empathy | arXiv: 2603.02123
noiseaware fewshot learning through bidirectional | arXiv: 2603.11617
paddleocr-vl boosting document parsing efficiency and performance with coarse | arXiv: 2603.24326
paddleocr vl coarse to fine document parsing | arXiv: 2603.24326
paddleocr vl document parsing coarse to fine visual processing | arXiv: 2603.24326
personavlm long term personalized multimodal llms | arXiv: 2604.13074
physisinone visual physics learning and reasoning in one suite | arXiv: 2604.09415
pop proof of perception conformal reasoning | arXiv: 2603.00324
rehearsevla simulated post-training for vlas with physically-consistent world mo | arXiv: 2509.24948
rehearsevla simulated posttraining world model | arXiv: 2509.24948
relational visual similarity | arXiv: 2512.07833
responses fall short of understanding gap between internal representations and responses in vdu | arXiv: 2604.04411
scipostgen bridging the gap between scientific papers and poster layouts | arXiv: 2511.22490
seatrack multimodal tracker | arXiv: 2604.12502
see hear and understand benchmarking audiovisual human speech understanding in mul | arXiv: 2512.02231
seeing through touch tactile localization | arXiv: 2604.11579
spatialscore towards comprehensive evaluation for spatial intelligence | arXiv: 2505.17012
think 360 evaluating the width-centric reasoning capability of mllms beyond dept | arXiv: 2603.22689
tipsv2 patch text alignment | arXiv: 2604.12012
treeteaming autonomous red-teaming of vision-language models via hierarchical s | arXiv: 2603.22882
treeteaming autonomous red teaming vlm strategy exploration | arXiv: 2603.22882
treeteaming autonomous red teaming vlm strategy tree | arXiv: 2603.22882
unbiased dynamic multimodal fusion | arXiv: 2603.19681
vecglypher unified vector glyph generation with language models | arXiv: 2602.21461
vikey enhancing temporal understanding in videos via visual prompting | arXiv: 2603.23186
vs bench evaluating vlms for strategic abilities in multi agent environments | arXiv: 2506.02387
weavetime streaming video llm memory | arXiv: 2602.22142
beyond global scores fine grained token grounding as robust detector of lvlm hallucinations | arXiv: 2604.04863
detecting unknown objects via energy-based separation | arXiv: 2603.29954
dreamvideo-omni omni-motion controlled multi-subject video customization with la | arXiv: 2603.12257
dreamvideoomni omnimotion controlled multisubject | arXiv: 2603.12257
geobridge semantic-anchored multi-view foundation model for geo-localization | arXiv: 2512.02697
herod heuristic inspired reasoning data efficient rod | arXiv: 2603.24166
mitigating memorization in text-to-image diffusion via region-aware prompt augme | arXiv: 2603.13070
mitigating memorization in texttoimage diffusion v | arXiv: 2603.13070
paq-detr learning pattern and quality-aware dynamic queries for object detection | arXiv: 2603.06917
radar closedloop robotic data generation via seman | arXiv: 2603.11811
rehark refined hybrid adaptive rbf kernels for rob | arXiv: 2603.11542
slice semantic latent injection via compartmentali | arXiv: 2603.12749
uavgen visual prototype conditioned focal region generation for uav based object detection | arXiv: 2604.02966
enhancing visual representation with textual semantics textual semantics powered p | arXiv: 2503.13543
fedtsp textual semantics powered prototypes heterogeneous fl | arXiv: 2503.13543
otprune distribution-aligned visual token pruning via optimal transport | arXiv: 2602.20205
crowdsourcing of real world image annotation via visual properties | arXiv: 2604.14449
do vision models perceive illusory motion in static images like humans | arXiv: 2604.09853
feat federated geometry aware correction for exemplar replay under continual dynamic heterogeneity | arXiv: 2604.08617
lovif 2026 semantic quality assessment challenge | arXiv: 2604.11207
myovision a mobile research tool and neatboost attention ensemble framework | arXiv: 2604.13456
omnifood8k nutrition estimation | arXiv: 2604.12356
sldprtnet a largescale multimodal dataset for cad | arXiv: CAD generation
v nutri nutrition estimation cooking videos | arXiv: 2604.11913
vit3 unlocking test time training in vision | arXiv: 2512.01643
qkd quantum gated incremental learning | arXiv: 2604.11112
linking perception confidence and accuracy in mllms | arXiv: 2603.12149
msrl scaling generative multimodal reward modeling | arXiv: 2603.25108
conflated inverse urban vegetation | arXiv: 2604.13028
geoflow real-time fine-grained cross-view geolocalization | arXiv: 2603.21943
geommbench and geommagent toward expert level multimodal intelligence in geoscience and remote sensing | arXiv: 2604.08896
pretrained image matchers for sar optical satellite registration | arXiv: 2604.10217
cyclemanip enabling cyclic task manipulation via effective historical percepti | arXiv: 2512.01022
deepsketcher internalizing visual manipulation for multimodal reasoning | arXiv: 2509.25866
diagnose correct and learn from manipulation failures | arXiv: 2512.02787
enc-bench a benchmark for evaluating multimodal large language models in electro | arXiv: 2603.22763
finecog nav fine grained cognitive modules for zero shot uav navigation | arXiv: 2604.16298
igen scalable data generation for robot learning from open-world images | arXiv: 2512.01773
sapave active perception manipulation vla roboti | arXiv: 2603.12193
strnet visual navigation with spatio-temporal representation through dynamic gra | arXiv: 2604.02829
boundary segment action segmentation | arXiv: 2604.01859
empowering semantic-sensitive underwater image enhancement with vlm | arXiv: 2603.12773
empowering semanticsensitive underwater image enha | arXiv: 2603.12773
geomprompt rgbd segmentation | arXiv: 2604.11585
low data supervised adaptation outperforms prompting for cloud segmentation | arXiv: 2604.08956
occsam bench occlusion robustness segmentation | arXiv: 2604.11711
pca-seg revisiting cost aggregation for openvocabulary semantic and part segmentat | arXiv: 2603.17520
pca seg cost aggregation open vocabulary segmentation | arXiv: 2603.17520
pca seg parallel cost aggregation open vocabulary segmentation | arXiv: 2603.17520
pixdlm uav reasoning segmentation | arXiv: 2604.15670
sddf specificity-driven dynamic focusing for open-vocabulary camouflaged object | arXiv: 2603.26109
wsrvos weakly supervised rvos | arXiv: 2604.17797
a stitch in time learning procedural workflow via self supervised plackett luce r | arXiv: 2511.17805
an optimal transport driven approach for cultivating latent space in online incr | arXiv: 2211.16780
com pt chain of models pretraining | arXiv: 2604.12391
group dinomics incorporating people dynamics into dino for self supervised group activity feature learning | arXiv: 2604.04467
momo mars orbital model foundation model for mars orbital applications | arXiv: 2604.02719
omnigcd abstracting generalized category discovery for modality agnosticism | arXiv: 2604.14762
otc optimal transport cultivating latent space online incremental learning | arXiv: 2211.16780
redepth anything test-time depth refinement via self-supervised re-lighting | arXiv: 2512.17908
robustness of vision foundation models to common perturbations | arXiv: 2604.14973
unigeoclip geospatial contrastive | arXiv: 2604.11668
zero ablation overstates register content dependence in dino vision transformers | arXiv: 2604.14433
clay conditional visual similarity | arXiv: 2604.11539
as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
as language models scale loworder linear depth dyn | arXiv: 2603.12541
learning from synthetic data via provenance-based input gradient guidance | arXiv: 2604.02946
revisiting unknowns towards effective and efficient open-set active learning | arXiv: 2603.07898
activityforensics a comprehensive benchmark for localizing manipulated activity | arXiv: 2604.03819
anti-i2v safeguarding your photos from malicious image-to-video generation | arXiv: 2603.24570
autocut end-to-end advertisement video editing based on multimodal discretizatio | arXiv: 2603.28366
chain of event-centric causal thought for physically plausible video generation | arXiv: 2603.09094
compressed-domain-aware online video super-resolution | arXiv: 2603.07694
cubecomposer spatio-temporal autoregressive 4k 360 video generation from perspec | arXiv: 2603.04291
diff4splat controllable 4d scene generation with latent dynamic reconstruction m | arXiv: 2511.00503
disca accelerating video diffusion transformers wi | arXiv: 2602.05449
disca accelerating video diffusion transformers with distillation-compatible lea | arXiv: 2602.05449
dreamshot storyboard synthesis | arXiv: 2604.17195
drivelaw unifying planning and video generation in a latent driving world | arXiv: 2512.23421
fastlightgen fast and light video generation with fewer steps and parameters | arXiv: 2603.01685
first frame is the place to go for video content customization | arXiv: 2511.15700
flashmotion few-step controllable video generation with trajectory guidance | arXiv: 2603.12146
flashmotion fewstep controllable video generation | arXiv: 2603.12146
free-lunch long video generation via layer-adaptive ood correction | arXiv: 2603.25209
from static to dynamic exploring self-supervised image-to-video representation t | arXiv: 2603.26597
generative neural video compression via video diffusion prior | arXiv: 2512.05016
geometry-as-context modulating explicit 3d in scene-consistent video generation | arXiv: 2602.21929
gloria consistent character video generation via content anchors | arXiv: 2603.29931
goal-driven reward by video diffusion models for reinforcement learning | arXiv: 2512.00961
identity-preserving image-to-video generation via reward-guided optimization | arXiv: 2510.14255
infinity-rope action-controllable infinite video generation emerges from autoreg | arXiv: 2511.20649
interpretable motion-attentive maps spatio-temporally localizing concepts in vid | arXiv: 2603.02919
lamp language-assisted motion planning for controllable video generation | arXiv: 2512.03619
let your image move with your motion -- implicit multi-object multi-motion trans | arXiv: 2603.01000
lighting-grounded video generation with renderer-based agent reasoning | arXiv: 2604.07966
lightmover generative light movement with color and intensity controls | arXiv: 2603.27209
linvideo a post-training framework towards on attention in efficient video gener | arXiv: 2510.08318
linvideo linear attention video generation | arXiv: 2510.08318
moviedrive multimodal multiview video diffusion | arXiv: 2508.14327
moviedrive urban scene synthesis with multi-modal multi-view video diffusion tra | arXiv: 2508.14327
neoverse enhancing 4d world model with in-the-wild monocular videos | arXiv: 2601.00393
nova sparse control dense synthesis for pair-free video editing | arXiv: 2603.02802
orbital video 3d foundation priors | arXiv: 2604.12309
pam a pose-appearance-motion engine for sim-to-real hoi video generation | arXiv: 2603.22193
performrecast expression and head pose disentanglement for portrait video editin | arXiv: 2603.19731
phantom physics-infused video generation via joint modeling of visual and latent | arXiv: 2604.08503
physical simulator in-the-loop video generation | arXiv: 2603.06408
posegen in-context lora finetuning for pose-controllable long human video genera | arXiv: 2508.05091
rethinking position embedding as a context controller for multi-reference and mu | arXiv: 2604.03738
seeu seeing the unseen world via 4d dynamics-aware generation | arXiv: 2512.03350
semantic satellite communications for synchronized | arXiv: 2603.10791
semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
slvmeval synthetic meta evaluation benchmark for text-to-long video generation | arXiv: 2603.29186
streamdit real-time streaming text-to-video generation | arXiv: 2507.03745
swift sliding window reconstruction for few-shot training-free generated video a | arXiv: 2603.08536
switchcraft training-free multi-event video generation with attention controls | arXiv: 2602.23956
symphomotion joint control of camera motion and object dynamics for coherent vid | arXiv: 2604.03723
tear temporal-aware automated red-teaming for text-to-video models | arXiv: 2511.21145
the devil is in the details enhancing video virtual try-on via keyframe-driven d | arXiv: 2512.20340
training-free motion factorization for compositional video generation | arXiv: 2603.09104
u-mind a unified framework for real-time multimodal interaction with audiovisual | arXiv: 2602.23739
uniavgen unified audio and video generation with asymmetric cross-modal interact | arXiv: 2511.03334
unified camera positional encoding for controlled video generation | arXiv: 2512.07237
unitalking a unified audio-video framework for talking portrait generation | arXiv: 2603.01418
vanast virtual try-on with human image animation via synthetic triplet supervisi | arXiv: 2604.04934
videocof unified video editing with temporal reasoner | arXiv: 2512.07469
when numbers speak aligning textual numerals and visual instances in text-to-vid | arXiv: 2604.08546
when to lock attention training-free kv control in video diffusion | arXiv: 2603.09657
adaspark adaptive sparsity for efficient long video understanding | arXiv: 2604.08077
chronotrack temporally consistent long term memory for 3d single object tracking | arXiv: 2604.13789
dual-level adaptation for multiobject tracking building testtime calibration from | arXiv: 2603.21629
envisioning the future one step at a time | arXiv: 2604.09527
event6d event-based novel object 6d pose tracking | arXiv: 2603.28045
how should video llms output time | arXiv: 2604.08966
humanvbench probing human centric video understanding in mllms with automatica | arXiv: 2412.17574
humanvbench probing human centric video understanding mllms | arXiv: 2412.17574
ninja codes neurally generated fiducial markers for stealthy 6-dof tracking | arXiv: 2510.18976
seen to scene keep the seen generate the unseen for video outpainting | arXiv: 2604.14648
storm referring multi object tracking | arXiv: 2604.10527
svagent storyline guided long video understanding via cross modal multi agent collaboration | arXiv: 2604.05079
tcei dual level adaptation multi object tracking | arXiv: 2603.21629
tcei test time calibration experience intuition mot | arXiv: 2603.21629
u2flow uncertainty aware unsupervised optical flow estimation | arXiv: 2604.10056
vidtag video gps geolocalization | arXiv: 2604.12159
vsi visual-subtitle integration for keyframe selection to enhance long video un | arXiv: 2508.06869