CVPR2026 论文笔记 TODO¶
总计: 2198 篇 | 已完成: 2198 | 待更新: 0
- \(\varphi\)-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models | arXiv: 2602.22601
- 2ndmatch finetuning pruned diffusion models via second-order jacobian matching | arXiv: 2506.05398
- 3d gaussian splatting with self-constrained priors for high fidelity surface rec | arXiv: 2603.19682
- 3d sans 3d scans scalable pre-training from video-generated point clouds | arXiv: 2512.23042
- 3d-fixer coarse-to-fine in-place completion for 3d scenes from a single image | arXiv: 2604.04406
- 3d-ide 3d implicit depth emergent | arXiv: 2604.03296
- 3drawagent teaching llm to draw in 3d with early contrastive experience | arXiv: 2604.08042
- 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion | arXiv: 2511.19117
- 4c4d 4 camera 4d gaussian splatting | arXiv: 2604.04063
- 4dequine disentangling motion and appearance for 4d equine reconstruction from m
- 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video | arXiv: 2603.10125
- A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks | arXiv: 2603.12998
- A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks | arXiv: 2603.12998
- a closer look at cross-domain few-shot object detection fine-tuning matters and | arXiv: 2603.28182
- a frame is worth one token efficient generative world modeling with delta tokens | arXiv: 2604.04913
- a mixed diet makes dino an omnivorous vision encoder | arXiv: 2602.24181
- A Mixed Diet Makes DINO An Omnivorous Vision Encoder | arXiv: 2602.24181
- A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning | arXiv: 2603.14052
- a paradigm shift fully end-to-end training for temporal sentence grounding in vi | arXiv: 2604.02860
- A Prediction-as-Perception Framework for 3D Object Detection | arXiv: 2603.12599
- A Prediction-as-Perception Framework for 3D Object Detection | arXiv: 2603.12599
- A protocol for evaluating robustness to H&E staining variation in computational pathology models | arXiv: 2603.12886
- a semantically disentangled unified model for multi-category 3d anomaly detectio | arXiv: 2603.25159
- A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement | arXiv: 2603.06167
- A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement | arXiv: 2603.06167
- a unified perspective on adversarial membership manipulation in vision models | arXiv: 2604.02780
- A2P: From 2D Alignment to 3D Plausibility for Occlusion-Robust Two-Hand Reconstruction | arXiv: 2503.17788
- a2z-10m geometric deep learning with a-to-z brep annotations for ai-assisted cad | arXiv: 2603.12605
- a3 towards advertising aesthetic assessment | arXiv: 2603.24037
- A4VL: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning | arXiv: 2603.14052
- ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection | arXiv: 2603.12409
- ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection | arXiv: 2603.12409
- Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective | arXiv: 2507.05914
- Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning | arXiv: 2603.13007
- Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning | arXiv: 2603.13007
- ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation | arXiv: 2603.02945
- acetone bridging words and colors for conditional image grading | arXiv: 2604.00530
- ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery | arXiv: 2603.16616
- Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning | arXiv: 2603.00667
- action-guided generation of 3d functionality segmentation data | arXiv: 2511.23230
- actionmesh animated 3d mesh generation with temporal 3d diffusion | arXiv: 2601.16148
- Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation | arXiv: 2602.23814
- activation matters test-time activated negative labels for ood detection with vi | arXiv: 2603.25250
- Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning | arXiv: 2603.07559
- activityforensics a comprehensive benchmark for localizing manipulated activity | arXiv: 2604.03819
- actta rethinking test-time adaptation via dynamic activation | arXiv: 2603.26096
- Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation | arXiv: 2603.11984
- Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation | arXiv: 2603.11984
- AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks | arXiv: 2510.03101
- ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation | arXiv: 2603.19157
- Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions | arXiv: 2603.12468
- Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions | arXiv: 2603.12468
- adapting a pre-trained single-cell foundation model to spatial gene expression g | arXiv: 2603.19766
- adapting point cloud analysis via multimodal bayesian distribution learning | arXiv: 2603.22070
- adaptive action chunking at inference-time for vision-language-action models | arXiv: 2604.04161
- Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation | arXiv: 2603.19158
- adaptive confidence regularization for multimodal failure detection | arXiv: 2603.02200
- adaptive learned image compression with graph neural networks | arXiv: 2603.25316
- Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration | arXiv: 2603.01623
- Adaptive Vision-Language Model Routing for Computer Use Agents | arXiv: 2603.12823
- adaptvision efficient vision-language models via adaptive visual acquisition | arXiv: 2512.03794
- adaradar rate adaptive spectral compression for radar-based perception | arXiv: 2603.17979
- adasformer adaptive serialized transformers for monocular semantic scene complet | arXiv: 2603.25494
- Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding | arXiv: 2603.12514
- Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding | arXiv: 2603.12514
- AdvMark: Decoupling Defense Strategies for Robust Image Watermarking | arXiv: 2602.20053
- AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction | arXiv: 2602.22376
- affordgrasp cross-modal diffusion for affordance-aware grasp synthesis | arXiv: 2603.08021
- affordmatcher affordance learning in 3d scenes from visual signifiers | arXiv: 2603.27970
- AFRO: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning | arXiv: 2512.00074
- Agentic Retoucher for Text-To-Image Generation | arXiv: 2601.02046
- Agentic Retoucher for Text-To-Image Generation | arXiv: 2601.02046
- agft alignment-guided fine-tuning for zero-shot adversarial robustness of vision | arXiv: 2603.29410
- AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution | arXiv: 2603.00589
- All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark | arXiv: 2602.23523
- All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference | arXiv: 2603.08498
- All-in-One Slider for Attribute Manipulation in Diffusion Models | arXiv: 2508.19195
- All-in-One Slider for Attribute Manipulation in Diffusion Models | arXiv: 2508.19195
- An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS | arXiv: 2603.10671
- An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS | arXiv: 2603.10671
- an instance-centric panoptic occupancy prediction benchmark for autonomous drivi | arXiv: 2603.27238
- Anchoring and Rescaling Attention for Semantically Coherent Inbetweening | arXiv: 2603.17651
- anchorsplat feed-forward 3d gaussian splatting with 3d geometric priors | arXiv: 2604.07053
- ani3dhuman photorealistic 3d human animation with self-guided stochastic samplin | arXiv: 2602.19089
- anomalyvfm -- transforming vision foundation models into zero-shot anomaly detec | arXiv: 2601.20524
- anthrotap learning point tracking with real-world motion | arXiv: 2507.06233
- anti-i2v safeguarding your photos from malicious image-to-video generation | arXiv: 2603.24570
- Anticipatory Planning for Multimodal AI Agents | arXiv: 2603.16777
- anydoc enhancing document generation via large-scale htmlcss data synthesis and | arXiv: 2603.25118
- AnyPcc: Compressing Any Point Cloud with a Single Universal Model | arXiv: 2510.20331
- ApET: Approximation-Error Guided Token Compression for Efficient VLMs | arXiv: 2602.19870
- ApET: Approximation-Error Guided Token Compression for Efficient VLMs | arXiv: 2602.19870
- apple attribute-preserving pseudo-labeling for diffusion-based face swapping | arXiv: 2601.15288
- ar2can an architect and an artist leveraging a canvas for multi-human generation | arXiv: 2511.22690
- ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation | arXiv: 2603.10188
- ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation | arXiv: 2603.10188
- Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study | arXiv: 2603.13044
- Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study | arXiv: 2603.13044
- arthoi taming foundation models for monocular 4d reconstruction of hand-articula | arXiv: 2603.25791
- artllm generating articulated assets via 3d llm | arXiv: 2603.01142
- AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos | arXiv: 2603.07758
- as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
- As Language Models Scale, Low-order Linear Depth Dynamics Emerge | arXiv: 2603.12541
- As Language Models Scale, Low-order Linear Depth Dynamics Emerge | arXiv: 2603.12541
- AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys | arXiv: 2603.11928
- AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys | arXiv: 2603.11928
- asking like socrates socrates helps vlms understand remote sensing images | arXiv: 2511.22396
- AssistMimic: Physics-Grounded Humanoid Assistance via Multi-Agent RL | arXiv: 2603.11346
- association and consolidation evolutionary memory-enhanced incremental multi-vie | arXiv: 2509.14544
- Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts | arXiv: 2603.09531
- Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts | arXiv: 2603.09531
- AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots | arXiv: 2603.07648
- AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots | arXiv: 2603.07648
- attend before attention efficient and scalable video understanding via autoregre | arXiv: 2603.12254
- attention may i have your decision localizing generative choices in diffusion mo | arXiv: 2604.06052
- Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution | arXiv: 2603.10583
- Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution | arXiv: 2603.10583
- Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors | arXiv: 2603.15656
- autocut end-to-end advertisement video editing based on multimodal discretizatio | arXiv: 2603.28366
- AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models | arXiv: 2508.00445
- AutoGaze: Attend Before Attention — Efficient and Scalable Video Understanding via Autoregressive Gazing | arXiv: 2603.12254
- Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI | arXiv: 2603.11818
- Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI | arXiv: 2603.11818
- AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models | arXiv: 2506.09082
- AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models | arXiv: 2506.09082
- avatar reinforcement learning to see hear and reason over video | arXiv: 2508.03100
- avatarpointillist autoregressive 4d gaussian avatarization | arXiv: 2604.04787
- AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network | arXiv: 2603.12659
- AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network | arXiv: 2603.12659
- AVR: Adaptive VLM Routing for Computer Use Agents | arXiv: 2603.12823
- babyvlm-v2 toward developmentally grounded pretraining and benchmarking of visio | arXiv: 2512.10932
- back to point exploring point-language models for zero-shot 3d anomaly detection | arXiv: 2603.21511
- balm a model-agnostic framework for balanced multimodal learning under imbalance | arXiv: 2603.19718
- banana100 breaking nr-iqa metrics by 100 iterative image replications with nano | arXiv: 2604.03400
- bases of steerable kernels for equivariant cnns from 2d rotations to the lorentz | arXiv: 2603.12459
- Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group | arXiv: 2603.12459
- bd-merging bias-aware dynamic model merging with evidence-guided contrastive lea | arXiv: 2603.03920
- beautygrpo aesthetic alignment for face retouching via dynamic path guidance and | arXiv: 2603.01163
- Benchmarking Endoscopic Surgical Image Restoration and Beyond | arXiv: 2505.19161
- benchmarking phd-level coding in 3d geometric computer vision | arXiv: 2603.30038
- benchmarking vision-language models under contradictory virtual content attacks | arXiv: 2604.05510
- BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending | arXiv: 2603.13102
- BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending | arXiv: 2603.13102
- better than average spatially-aware aggregation of segmentation uncertainty impr | arXiv: 2603.29941
- BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images | arXiv: 2603.17159
- Beyond Caption-Based Queries for Video Moment Retrieval | arXiv: 2603.02363
- Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D | arXiv: 2603.05906
- beyond global similarity towards fine-grained multi-condition multimodal retriev | arXiv: 2603.01082
- beyond ground-truth leveraging image quality priors for real-world image restora | arXiv: 2603.29773
- beyond heuristic prompting a concept-guided bayesian framework for zero-shot ima | arXiv: 2603.07911
- beyond loss values robust dynamic pruning via loss trajectory alignment | arXiv: 2604.07306
- Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control | arXiv: 2512.21058
- Beyond Prompt Degradation: Prototype-Guided Dual-Pool Prompting for Incremental Object Detection | arXiv: 2603.02286
- beyond recognition evaluating visual perspective taking in vision language model | arXiv: 2505.03821
- beyond semantic search towards referential anchoring in composed image retrieval | arXiv: 2604.05393
- beyond semantics disentangling information scope in sparse autoencoders for clip | arXiv: 2604.05724
- beyond single-sample reliable multi-sample distillation for video understanding | arXiv: 2603.11423
- Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding | arXiv: 2603.11423
- beyond static artifacts a forensic benchmark for video deepfake reasoning in vis | arXiv: 2602.21779
- beyond the fold quantifying split-level noise and the case for leave-one-dataset | arXiv: 2604.02162
- beyond the golden data resolving the motion-vision quality dilemma via timestep | arXiv: 2603.25527
- beyond the ground truth enhanced supervision for image restoration | arXiv: 2512.03932
- beyond the mean modelling annotation distributions in continuous affect predicti | arXiv: 2604.07198
- bhcast unlocking black hole plasma dynamics from a single blurry image with long | arXiv: 2603.26777
- BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation | arXiv: 2603.00156
- BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation | arXiv: 2603.00156
- bidirectional multimodal prompt learning with scale-aware training for few-shot | arXiv: 2408.13516
- bigain unified token compression for joint generation and classification | arXiv: 2603.12240
- BiGain: Unified Token Compression for Joint Generation and Classification | arXiv: 2603.12240
- Bilevel Layer-Positioning LoRA for Real Image Dehazing | arXiv: 2603.10872
- Bilevel Layer-Positioning LoRA for Real Image Dehazing | arXiv: 2603.10872
- bimotion b-spline motion for text-guided dynamic 3d character generation | arXiv: 2602.18873
- BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers | arXiv: 2603.09582
- biovita biological dataset model and benchmark for visual-textual-acoustic align | arXiv: 2603.23883
- bipremanip learning affordance-based bimanual preparatory manipulation through a | arXiv: 2603.21679
- blackmirror black-box backdoor detection for text-to-image models via instructio | arXiv: 2603.05921
- blazefl fast and deterministic federated learning simulation | arXiv: 2604.03606
- blink dynamic visual token resolution for enhanced multimodal understanding | arXiv: 2512.10548
- BluRef: Unsupervised Image Deblurring with Dense-Matching References | arXiv: 2603.14176
- Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting | arXiv: 2603.16129
- boosting vision-language-action finetuning with feasible action neighborhood pri | arXiv: 2604.01570
- BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning | arXiv: 2603.13109
- BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning | arXiv: 2603.13109
- Bounds on Agreement between Subjective and Objective Measurements | arXiv: 2603.13204
- bounds on agreement between subjective and objective measurements | arXiv: 2603.13204
- Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors | arXiv: 2603.13092
- Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors | arXiv: 2603.13092
- BRepGaussian: CAD Reconstruction from Multi-View Images with Gaussian Splatting | arXiv: 2602.21105
- Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation | arXiv: 2602.19863
- bridge multimodal-to-text retrieval via reinforcement-learned query alignment | arXiv: 2604.07201
- bridging pixels and words mask-aware local semantic fusion for multimodal media | arXiv: 2603.26052
- bridging the perception gap in image super-resolution evaluation | arXiv: 2503.13074
- Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD | arXiv: 2603.10933
- Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD | arXiv: 2603.10933
- brima bridged modality adaptation for multi-modal continual action quality asses | arXiv: 2602.19170
- BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy | arXiv: 2603.14361
- BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds | arXiv: 2602.23645
- bulletgen improving 4d reconstruction with bullet-time generation | arXiv: 2506.18601
- bussard normalizing flows for bijective universal scene-specific anomalous relat | arXiv: 2603.16645
- ca-lora concept-aware lora for domain-aligned segmentation dataset generation | arXiv: 2503.22172
- can natural image autoencoders compactly tokenize fmri volumes for long-range dy | arXiv: 2604.03619
- can vision-language models count a synthetic benchmark and analysis of attention | arXiv: 2511.17722
- capt confusion-aware prompt tuning for reducing vision-language misalignment | arXiv: 2603.02557
- care a molecular-guided foundation model with adaptive region modeling for whole | arXiv: 2602.21637
- CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing | arXiv: 2603.08589
- CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion | arXiv: 2602.19140
- carepilot a multi-agent framework for long-horizon computer task automation in h | arXiv: 2603.24157
- Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation | arXiv: 2603.12766
- Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation | arXiv: 2603.12766
- Causal Motion Diffusion Models for Autoregressive Motion Generation | arXiv: 2602.22594
- causalvad de-confounding end-to-end autonomous driving via causal intervention | arXiv: 2603.18561
- cc-vqa conflict- and correlation-aware method for mitigating knowledge conflict | arXiv: 2602.23952
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning | arXiv: 2602.21655
- ccf complementary collaborative fusion for domain generalized multi-modal 3d obj | arXiv: 2603.23276
- cd-buffer complementary dual-buffer framework for test-time adaptation in advers | arXiv: 2603.26092
- CDA-VSR: Compressed-Domain-Aware Online Video Super-Resolution | arXiv: 2603.07694
- CDG: Guiding Diffusion Models with Semantically Degraded Conditions | arXiv: 2603.10780
- Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images | arXiv: 2603.18461
- CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance | arXiv: 2603.03281
- cghair compact gaussian hair reconstruction with card clustering | arXiv: 2604.03716
- Chain of Event-Centric Causal Thought for Physically Plausible Video Generation | arXiv: 2603.09094
- Chain of World: World Model Thinking in Latent Motion (CoWVLA) | arXiv: 2603.03195
- changebridge spatiotemporal image generation with multimodal controls for remote | arXiv: 2507.04678
- Changes in Real Time: Online Scene Change Detection with Multi-View Fusion | arXiv: 2511.12370
- chartnet a million-scale high-quality multimodal dataset for robust chart unders | arXiv: 2603.27064
- cheem continual learning by reuse new adapt and skip -- a hierarchical explorati | arXiv: 2303.08250
- chips efficient clip adaptation via curvature-aware hybrid influence-based data | arXiv: 2511.18519
- CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection | arXiv: 2511.18519
- chordedit one-step low-energy transport for image editing | arXiv: 2602.19083
- CI-ICE: Intrinsic Concept Extraction Based on Compositional Interpretability | arXiv: 2603.11795
- cigpose causal intervention graph neural network for whole-body pose estimation | arXiv: 2603.09418
- cinematic audio source separation using visual cues | arXiv: 2603.26113
- CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization | arXiv: 2603.16966
- CIPHER: 用反事实对抗幻觉——扩散引导的LVLM幻觉抑制 | arXiv: 2603.10470
- circuit mechanisms for spatial relation generation in diffusion transformers | arXiv: 2601.06338
- circuit tracing in vision-language models understanding the internal mechanisms | arXiv: 2602.20330
- CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning | arXiv: 2602.19605
- cleaning the pool progressive filtering of unlabeled pools in deep active learni | arXiv: 2511.22344
- ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data | arXiv: 2512.02686
- clip is shortsighted paying attention beyond the first sentence | arXiv: 2602.22419
- CLIP Is Shortsighted: Paying Attention Beyond the First Sentence | arXiv: 2602.22419
- CLIP-Free, Label-Free, Unsupervised Concept Bottleneck Models | arXiv: 2503.10981
- CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation | arXiv: 2602.20409
- CLoE: Expert Consistency Learning for Missing Modality Segmentation | arXiv: 2603.09316
- CLoE: Expert Consistency Learning for Missing Modality Segmentation | arXiv: 2603.09316
- cluster-wise spatio-temporal masking for efficient video-language pretraining | arXiv: 2603.22953
- cmhanet a cross-modal hybrid attention network for point cloud registration | arXiv: 2603.12721
- CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration | arXiv: 2603.12721
- CoD: A Diffusion Foundation Model for Image Compression | arXiv: 2511.18706
- CodeBrain: Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code | arXiv: 2501.18328
- coded-e2lf coded aperture light field imaging from events | arXiv: 2602.22620
- codedance a dynamic tool-integrated mllm for executable visual reasoning | arXiv: 2512.17312
- codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
- CodePercept: Code-Grounded Visual STEM Perception for MLLMs | arXiv: 2603.10757
- coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation | arXiv: 2603.12829
- coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation | arXiv: 2603.12829
- cog confidence-aware optimal geometric correspondence for unsupervised single-re | arXiv: 2603.00493
- CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment | arXiv: 2603.12722
- CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment | arXiv: 2603.12722
- Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass | arXiv: 2603.12789
- Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass | arXiv: 2603.12789
- coin3d revisiting configuration-invariant multi-camera 3d object detection | arXiv: 2603.05042
- ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving | arXiv: 2512.22939
- CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion | arXiv: 2603.00682
- CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation | arXiv: 2602.22150
- color when it counts grayscale-guided online triggering for always-on streaming | arXiv: 2603.22466
- como learning continuous latent motion from internet videos for scalable robot l | arXiv: 2505.17006
- compagent an agentic framework for visual compliance verification | arXiv: 2511.00171
- Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging | arXiv: 2603.04796
- Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper | arXiv: 2603.04796
- Competition-Aware CPC Forecasting with Near-Market Coverage | arXiv: 2603.13059
- Competition-Aware CPC Forecasting with Near-Market Coverage# Competition-Aware CPC Forecasting with Near-Market Coverage | arXiv: 2603.13059
- CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation | arXiv: 2603.12864
- Composing Concepts from Images and Videos via Concept-prompt Binding | arXiv: 2512.09824
- Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation | arXiv: 2603.12864
- Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression | arXiv: 2603.13795
- concept-guided fine-tuning steering vits away from spurious correlations to impr | arXiv: 2603.08309
- conceptprism concept disentanglement in personalized diffusion models via residu | arXiv: 2602.19575
- conditional factuality controlled llms with generalization certificates via conf | arXiv: 2603.27403
- consistcompose unified multimodal layout control for image composition | arXiv: 2511.18333
- ConsistCompose: Unified Multimodal Layout Control for Image Composition | arXiv: 2511.18333
- Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation | arXiv: 2603.09506
- continual learning with vision-language models via semantic-geometry preservatio | arXiv: 2603.12055
- Continual Learning with Vision-Language Models via Semantic-Geometry Preservation | arXiv: 2603.12055
- Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis | arXiv: 2603.01398
- COT-FM: Cluster-wise Optimal Transport Flow Matching | arXiv: 2603.13395
- covft context-aware visual fine-tuning for multimodal large language models | arXiv: 2603.21077
- covr-rreason-aware composed video retrieval | arXiv: 2603.20190
- craterbench-r instance-level crater retrieval for planetary scale | arXiv: 2604.06245
- crft consistent-recurrent feature flow transformer for cross-modal image registr | arXiv: 2604.05689
- crit graph-based automatic data synthesis to enhance cross-modal multi-hop reaso | arXiv: 2604.01634
- critical patch-aware sparse prompting with decoupled training for continual lear | arXiv: 2604.07399
- cross-domain demo-to-code via neurosymbolic counterfactual reasoning | arXiv: 2603.18495
- cross-instance gaussian splatting registration via geometry-aware feature-guided | arXiv: 2603.21936
- cross-modal emotion transfer for emotion editing in talking face video | arXiv: 2604.07786
- cross-modal fuzzy alignment network for text-aerial person retrieval and a large | arXiv: 2603.20721
- Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning | arXiv: 2603.01696
- Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark | arXiv: 2603.00543
- cross-slice knowledge transfer via masked multi-modal heterogeneous graph contra | arXiv: 2603.22821
- crossearth-sar a sar-centric and billion-scale geospatial foundation model for d | arXiv: 2603.12008
- CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation | arXiv: 2603.12008
- crosshoi-bench a unified benchmark for hoi evaluation across vision-language mod | arXiv: 2508.18753
- CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image | arXiv: 2603.17779
- cryohype reconstructing a thousand cryo-em structures with transformer-based hyp | arXiv: 2512.06332
- cryosense compressive sensing enables high-throughput microscopy with sparse and | arXiv: 2511.12931
- ctcal rethinking text-to-image diffusion models via cross-timestep self-calibrat | arXiv: 2603.20741
- ctfs collaborative teacher framework for forward-looking sonar image semantic se | arXiv: 2603.21071
- CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video | arXiv: 2603.04291
- Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens | arXiv: 2603.19232
- CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation | arXiv: 2601.15408
- customized visual storytelling with unified multimodal llms | arXiv: 2603.27690
- CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization | arXiv: 2603.19121
- Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events | arXiv: 2603.06213
- cva context-aware video-text alignment for video temporal grounding | arXiv: 2603.24934
- Cycle-Consistent Tuning for Layered Image Decomposition | arXiv: 2602.20989
- cyclebev regularizing view transformation networks via view cycle consistency fo | arXiv: 2602.23575
- D2C: Accelerating Diffusion Model Training under Minimal Budgets via Condensation | arXiv: 2507.05914
- D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping | arXiv: 2507.08492
- da-mamba learning domain-aware state space model for global-local alignment in d | arXiv: 2603.18757
- da-vae plug-in latent compression for diffusion via detail alignment | arXiv: 2603.22125
- DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation | arXiv: 2603.03744
- dark3r learning structure from motion in the dark | arXiv: 2603.05330
- data warmup complexity-aware curricula for efficient diffusion training | arXiv: 2604.07397
- DAWN: Pixel Motion Diffusion is What We Need for Robot Control | arXiv: 2509.22652
- DC-Merge: Improving Model Merging with Directional Consistency | arXiv: 2603.06242
- DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles | arXiv: 2603.01111
- Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation | arXiv: 2603.12547
- Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation | arXiv: 2603.12547
- decompose and transfer cot-prompting enhanced alignment for open-vocabulary temp | arXiv: 2603.24030
- deconstructing the failure of ideal noise correction a three-pillar diagnosis | arXiv: 2603.12997
- Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis | arXiv: 2603.12997
- Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation | arXiv: 2603.00574
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation | arXiv: 2602.19449
- decovln decoupling observation reasoning and correction for vision-and-language | arXiv: 2603.13133
- dedelayed deleting remote inference delay via on-device correction | arXiv: 2510.13714
- Deep Learning Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging | arXiv: 2603.12715
- deep learning-based assessment of the relation between the third molar and mandi | arXiv: 2603.11850
- Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning | arXiv: 2603.11850
- Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning | arXiv: 2603.11850
- Deep Learning–Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging | arXiv: 2603.12715
- Defending Unauthorized Model Merging via Dual-Stage Weight Protection | arXiv: 2511.11851
- deformation-based in-context learning for point cloud understanding | arXiv: 2604.02845
- demographic fairness in multimodal llms a benchmark of gender and ethnicity bias | arXiv: 2603.25613
- Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache | arXiv: 2602.22654
- designing to forget deep semi-parametric models for unlearning | arXiv: 2603.22870
- Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification | arXiv: 2602.18842
- detecting unknown objects via energy-based separation for open world object dete | arXiv: 2603.29954
- developing foundation models for universal segmentation from 3d whole-body posit | arXiv: 2603.11627
- Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography | arXiv: 2603.11627
- Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models | arXiv: 2603.06049
- DIAE: Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception | arXiv: 2603.11556
- diagnose correct and learn from manipulation failures via visual symbols | arXiv: 2512.02787
- diagnosing and repairing unsafe channels in vision-language models via causal di | arXiv: 2603.27240
- diff4splat controllable 4d scene generation with latent dynamic reconstruction m | arXiv: 2511.00503
- diffbmp differentiable rendering with bitmap primitives | arXiv: 2602.22625
- diffusion mental averages | arXiv: 2603.29239
- Diffusion Probe: Generated Image Result Prediction Using CNN Probes | arXiv: 2602.23783
- Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification | arXiv: 2603.13182
- Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification | arXiv: 2603.13182
- DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization | arXiv: 2603.14267
- dino-qpm adapting visual foundation models for globally interpretable image clas | arXiv: 2604.07166
- dip taming diffusion models in pixel space | arXiv: 2511.18822
- direct segmentation without logits optimization for training-free open-vocabular | arXiv: 2604.07723
- directfisheye-gs enabling native fisheye input in gaussian splatting with cross- | arXiv: 2604.00648
- DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification | arXiv: 2603.12905
- DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification | arXiv: 2603.12905
- disca accelerating video diffusion transformers with distillation-compatible lea | arXiv: 2602.05449
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching | arXiv: 2602.05449
- disentangle-then-align non-iterative hybrid multimodal image registration via cr | arXiv: 2603.19623
- Disentangled Textual Priors for Diffusion-based Image Super-Resolution | arXiv: 2603.07430
- disentangling to re-couple resolving the similarity-controllability paradox in s | arXiv: 2604.00849
- Distilling Balanced Knowledge from a Biased Teacher | arXiv: 2506.18496
- DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression | arXiv: 2603.13162
- DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression | arXiv: 2603.13162
- DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers | arXiv: 2603.04239
- Diversity over Uniformity: Rethinking Representation in Generated Image Detection | arXiv: 2603.00717
- divide then ground adapting frame selection to query types for long-form video u | arXiv: 2512.04000
- dlwm dual latent world models enable holistic gaussian-centric pre-training in a | arXiv: 2604.00969
- DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis | arXiv: 2602.23022
- dmin scalable training data influence estimation for diffusion models | arXiv: 2412.08637
- do vision-language models leak what they learn adaptive token-weighted model inv | arXiv: 2508.04097
- Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks | arXiv: 2508.04097
- Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering | arXiv: 2603.12533
- Does YOLO Really Need to See Every Training Image in Every Epoch? | arXiv: 2603.17684
- Domain-Skewed Federated Learning with Feature Decoupling and Calibration | arXiv: 2603.14238
- downscaling intelligence exploring perception and reasoning bottlenecks in small | arXiv: 2511.17487
- DPAD: Discriminative Perception via Anchored Description for Reasoning Segmentation | arXiv: 2603.04002
- DPCache: 去噪即路径规划——免训练扩散模型加速 | arXiv: 2602.22654
- Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras | arXiv: 2603.01007
- Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving | arXiv: 2603.01007
- Draft and Refine with Visual Experts | arXiv: 2511.11005
- DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning | arXiv: 2603.12257
- DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning | arXiv: 2603.12257
- drift-resilient temporal priors for visual tracking | arXiv: 2604.02654
- drive my way preference alignment of vision-language-action model for personaliz | arXiv: 2603.25740
- DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance | arXiv: 2512.14266
- DROID-W: DROID-SLAM in the Wild | arXiv: 2603.19076
- dropping anchor and spherical harmonics for sparse-view gaussian splatting | arXiv: 2602.20933
- dsca dynamic subspace concept alignment for lifelong vlm editing | arXiv: 2604.07965
- DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime | arXiv: 2603.10538
- DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime | arXiv: 2603.10538
- DSS: Discover, Segment, and Select - A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation | arXiv: 2602.19944
- DSS: Discover, Segment, and Select for Zero-shot Camouflaged Object Segmentation | arXiv: 2602.19944
- DTR: Dynamic Token Reweighting for Robust Vision-Language Models | arXiv: 2505.17132
- dual band thermal videography separating time-varying reflection and emission ne | arXiv: 2509.11334
- dual-agent reinforcement learning for adaptive and cost-aware visual-inertial od | arXiv: 2511.21083
- dual-imbalance continual learning for real-world food recognition | arXiv: 2603.29133
- dualreg dual-space filtering and reinforcement for rigid registration | arXiv: 2508.17034
- DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference | arXiv: 2602.18846
- DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference | arXiv: 2602.18846
- duo-vsr dual-stream distillation for one-step video super-resolution | arXiv: 2603.22271
- DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction | arXiv: 2603.03265
- dynamic black-hole emission tomography with physics-informed neural fields | arXiv: 2602.08029
- Dynamic Momentum Recalibration in Online Gradient Learning | arXiv: 2603.06120
- Dynamic Token Reweighting for Robust Vision-Language Models | arXiv: 2505.17132
- DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs | arXiv: 2602.21864
- dynavid learning to generate highly dynamic videos using synthetic motion data | arXiv: 2604.01666
- e-3dpsm a state machine for event-based egocentric 3d human pose estimation | arXiv: 2604.08543
- E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought | arXiv: 2602.21698
- e-rayzer self-supervised 3d reconstruction as spatial visual pre-training | arXiv: 2512.10950
- E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction | arXiv: 2603.14684
- eaglenet energy-aware fine-grained relationship learning network for text-video | arXiv: 2603.25267
- eaglevision a dual-stage framework with bev-grounding-based chain-of-thought for | arXiv: 2512.15160
- Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow | arXiv: 2602.21499
- EB-JDAT: Energy-based Joint Distribution Adversarial Training | arXiv: 2505.19459
- echoagent towards reliable echocardiography interpretation with eyeshands and mi | arXiv: 2604.05541
- echoes of ownership adversarial-guided dual injection for copyright protection i | arXiv: 2602.18845
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models | arXiv: 2602.20981
- echotrail-gui building actionable memory for gui agents via critic-guided self-e | arXiv: 2512.19396
- ECKConv: Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant Point Cloud Analysis | arXiv: 2603.17538
- edgedit hardware-aware diffusion transformers for efficient on-device image gene | arXiv: 2603.28405
- Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing | arXiv: 2603.17583
- Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking | arXiv: 2603.12949
- Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking | arXiv: 2603.12949
- editing physiological signals in videos using latent representations | arXiv: 2509.25348
- EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing | arXiv: 2603.19224
- Efficient Document Parsing via Parallel Token Prediction | arXiv: 2603.15206
- efficient equivariant transformer for self-driving agent modeling | arXiv: 2604.01466
- efficient hybrid se3-equivariant visuomotor flow policy via spherical harmonics | arXiv: 2603.23227
- Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance | arXiv: 2603.07570
- Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance | arXiv: 2603.07570
- Ego-1K: A Large-Scale Multiview Video Dataset for Egocentric Vision | arXiv: 2603.13741
- ego2web a web agent benchmark grounded in egocentric videos | arXiv: 2603.22529
- egoflow gradient-guided flow matching for egocentric 6dof object motion generati | arXiv: 2604.01421
- egomind activating spatial cognition through linguistic reasoning in mllms | arXiv: 2604.03318
- EgoPointVQA: Gesture-Based Egocentric Video Question Answering | arXiv: 2603.12533
- egoposeformer v2 accurate egocentric human motion estimation for arvr | arXiv: 2603.04090
- egoxtreme a dataset for robust object pose estimation in egocentric views under | arXiv: 2603.25135
- EI: Early Intervention for Multimodal Imaging based Disease Recognition | arXiv: 2603.17514
- Elastic Weight Consolidation Done Right for Continual Learning | arXiv: 2603.18596
- ELogitNorm: Enhancing OOD Detection with Extended Logit Normalization | arXiv: 2504.11434
- Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models | arXiv: 2507.18534
- elucidating the design space of arbitrary-noise-based diffusion models | arXiv: 2507.18534
- elvis enhance low-light for video instance segmentation in the dark | arXiv: 2512.01495
- EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease | arXiv: 2602.19178
- embodiedsplat online feed-forward semantic 3dgs for open-vocabulary 3d scene und | arXiv: 2603.04254
- EMDUL: Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets | arXiv: 2603.14507
- EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy | arXiv: 2512.06684
- emma concept erasure benchmark with comprehensive semantic metrics and diverse c | arXiv: 2512.17320
- EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models | arXiv: 2602.23802
- emotag emotion-aware talking head synthesis on gaussian splatting with few-shot | arXiv: 2603.21332
- EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis | arXiv: 2511.12554
- EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis | arXiv: 2511.12554
- Empowering Semantic-Sensitive Underwater Image Enhancement with VLM | arXiv: 2603.12773
- Empowering Semantic-Sensitive Underwater Image Enhancement with VLM | arXiv: 2603.12773
- enc-bench a benchmark for evaluating multimodal large language models in electro | arXiv: 2603.22763
- Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking | arXiv: 2501.14894
- Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration | arXiv: 2501.14894
- Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator | arXiv: 2603.14726
- Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception | arXiv: 2603.11556
- Enhancing Out-of-Distribution Detection with Extended Logit Normalization | arXiv: 2504.11434
- Enhancing Spatial Understanding in Image Generation via Reward Modeling | arXiv: 2602.24233
- EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis | arXiv: 2603.11294
- EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis | arXiv: 2603.11294
- erasure or erosion evaluating compositional degradation in unlearned text-to-ima | arXiv: 2604.04575
- EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection | arXiv: 2603.11521
- EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection | arXiv: 2603.11521
- Evaluating Few-Shot Pill Recognition Under Visual Domain Shift | arXiv: 2603.10833
- Evaluating Few-Shot Pill Recognition Under Visual Domain Shift | arXiv: 2603.10833
- evatok adaptive length video tokenization for efficient visual autoregressive ge | arXiv: 2603.12267
- EVATok: 自适应长度视频Tokenization用于高效视觉自回归生成 | arXiv: 2603.12267
- eventhub data factory for generalizable event-based stereo networks without acti | arXiv: 2604.02331
- every error has its magnitude asymmetric mistake severity training for multiclas | arXiv: 2603.13682
- EVLF: Early Vision-Language Fusion for Generative Dataset Distillation | arXiv: 2603.07476
- evolmm self-evolving large multimodal models with continuous rewards | arXiv: 2511.16672
- EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards | arXiv: 2511.16672
- Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition | arXiv: 2603.03827
- Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory | arXiv: 2603.15800
- Evolving Prompt Adaptation for Vision-Language Models | arXiv: 2603.09493
- EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models | arXiv: 2603.09493
- ew-detr evolving world object detection via incremental low-rank detection trans | arXiv: 2602.20985
- EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer | arXiv: 2602.20985
- Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation | arXiv: 2603.12577
- Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation | arXiv: 2603.12577
- explaining clip zero-shot predictions through concepts | arXiv: 2603.28211
- explore with long-term memory a benchmark and multimodal llm-based reinforcement | arXiv: 2601.10744
- exploring conditions for diffusion models in robotic control | arXiv: 2510.15510
- Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction | arXiv: 2603.00611
- ExpPortrait: Expressive Portrait Generation via Personalized Representation | arXiv: 2602.19900
- expressedit fast editing of stylized facial expressions with diffusion models in | arXiv: 2604.03448
- extend3d town-scale 3d generation | arXiv: 2603.29387
- extending zach-vit to robust medical imaging corruption and adversarial stress t | arXiv: 2604.06099
- extrinsplat decoupling geometry and semantics for open-vocabulary understanding | arXiv: 2509.22225
- f3dgs federated 3d gaussian splatting for decentralized multi-agent world modeli | arXiv: 2604.01605
- faar efficient frequency-aware multi-task fine-tuning via automatic rank selecti | arXiv: 2603.20403
- face time traveller travel through ages without losing identity | arXiv: 2602.22819
- Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration | arXiv: 2603.16570
- FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning | arXiv: 2603.05506
- FaceCoT: Chain-of-Thought Reasoning in MLLMs for Face Anti-Spoofing | arXiv: 2506.01783
- fact-gs frequency-aligned complexity-aware texture reparameterization for 2d gau | arXiv: 2511.23292
- failure modes for deep learning-based online mapping how to measure and address | arXiv: 2603.19852
- Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning | arXiv: 2603.12988
- Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning | arXiv: 2603.12988
- FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning | arXiv: 2508.02291
- fairllava fairness-aware parameter-efficient fine-tuning for large vision-langua | arXiv: 2603.26008
- falcon false-negative aware learning of contrastive negatives in vision-language | arXiv: 2505.11192
- fast scenescript fast and accurate language-based 3d scene understanding via mul | arXiv: 2512.05597
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning | arXiv: 2601.09708
- fast3dcache training-free 3d geometry synthesis acceleration | arXiv: 2511.22533
- FastGS: Training 3D Gaussian Splatting in 100 Seconds | arXiv: 2511.04283
- FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters | arXiv: 2603.01685
- FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking | arXiv: 2603.12758
- FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking | arXiv: 2603.12758
- fcl-cod weakly supervised camouflaged object detection with frequency-aware and | arXiv: 2603.22969
- fdeid-toolbox face de-identification toolbox | arXiv: 2603.13121
- FDeID-Toolbox: Face De-Identification Toolbox | arXiv: 2603.13121
- FDeID-Toolbox: Face De-Identification Toolbox | arXiv: 2603.13121
- feature attribution stability suite how stable are post-hoc attributions | arXiv: 2604.02532
- fecalfed privacy-preserving poultry disease detection via federated learning | arXiv: 2604.00559
- Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift | arXiv: 2603.01040
- FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation | arXiv: 2603.04890
- FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts | arXiv: 2603.12912
- FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts | arXiv: 2603.12912
- feddap domain-aware prototype learning for federated learning under domain shift | arXiv: 2604.06795
- Federated Active Learning Under Extreme Non-IID and Global Class Imbalance | arXiv: 2603.10341
- Federated Active Learning Under Extreme Non-IID and Global Class Imbalance | arXiv: 2603.10341
- Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation | arXiv: 2603.04887
- Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation | arXiv: 2603.04887
- fedre a representation entanglement framework for model-heterogeneous federated | arXiv: 2511.22265
- fedvg gradient-guided aggregation for enhanced federated learning | arXiv: 2602.21399
- Few-shot Acoustic Synthesis with Multimodal Flow Matching | arXiv: 2603.19176
- few-shot incremental 3d object detection in dynamic indoor environments | arXiv: 2604.07997
- fg-portrait 3d flow guided editable portrait animation | arXiv: 2603.23381
- FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution | arXiv: 2603.02692
- Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression | arXiv: 2603.10470
- Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks | arXiv: 2603.03907
- Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients | arXiv: 2603.17809
- FINER: MLLMs Hallucinate under Fine-grained Negative Queries | arXiv: 2603.17662
- first frame is the place to go for video content customization | arXiv: 2511.15700
- Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation | arXiv: 2602.24144
- Flash-Unified: Training-Free and Task-Aware Acceleration for Native Unified Models | arXiv: 2603.15271
- FlashCache: Frequency-Domain-Guided Outlier-KV-Aware Multimodal KV Cache Compression | arXiv: 2511.16786
- flashcap millisecond-accurate human motion capture via flashing leds and event-b | arXiv: 2603.19770
- FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance | arXiv: 2603.12146
- FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance | arXiv: 2603.12146
- flexavatar learning complete 3d head avatars with partial supervision | arXiv: 2512.15599
- FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT | arXiv: 2503.07516
- FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT | arXiv: 2503.07516
- flow3r factored flow prediction for scalable visual geometry learning | arXiv: 2602.20157
- flowmotion training-free flow guidance for video motion transfer | arXiv: 2603.06289
- fluidgaussian propagating simulation-based uncertainty toward functionally-intel | arXiv: 2603.21356
- FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy | arXiv: 2602.23791
- FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding | arXiv: 2603.02096
- focus dont prune identifying instruction-relevant regions for information-rich i | arXiv: 2603.22815
- focus-to-perceive representation learning a cognition-inspired hierarchical fram | arXiv: 2603.25778
- Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning | arXiv: 2603.11460
- fontcrafter high-fidelity element-driven artistic font creation with visual in-c | arXiv: 2603.22054
- FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction | arXiv: 2509.21029
- FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction | arXiv: 2509.21029
- ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation | arXiv: 2603.15169
- Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning | arXiv: 2603.12887
- Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning | arXiv: 2603.12887
- ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph | arXiv: 2603.09266
- FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration | arXiv: 2603.01284
- foundation model priors enhance object focus in feature space for source-free ob | arXiv: 2512.17514
- foundry distilling 3d foundation models for the edge | arXiv: 2511.20721
- Fourier Angle Alignment for Oriented Object Detection in Remote Sensing | arXiv: 2602.23790
- FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting | arXiv: 2602.24084
- fozo forward-only zeroth-order prompt optimization for test-time adaptation | arXiv: 2603.04733
- Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems | arXiv: 2603.13069
- Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems | arXiv: 2603.13069
- frame2freq spectral adapters for fine-grained video understanding | arXiv: 2602.18977
- framer frequency-aligned self-distillation with adaptive modulation leveraging d | arXiv: 2512.01390
- free-grained hierarchical visual recognition | arXiv: 2510.14737
- free-lunch long video generation via layer-adaptive ood correction | arXiv: 2603.25209
- freeartgs articulated gaussian splatting under free-moving scenario | arXiv: 2603.22102
- frequency switching mechanism for parameter-ecient multi-task learning | arXiv: 2603.21111
- From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction | arXiv: 2503.17788
- from editor to dense geometry estimator | arXiv: 2509.04338
- from fewer samples to fewer bits reframing dataset distillation as joint optimiz | arXiv: 2603.02411
- from inpainting to layer decomposition repurposing generative inpainting models | arXiv: 2511.20996
- from intuition to investigation a tool-augmented reasoning mllm framework for ge | arXiv: 2603.01038
- from masks to pixels and meaning a new taxonomy benchmark and metrics for vlm im | arXiv: 2603.20193
- from observation to action latent action-based primitive segmentation for vla pr | arXiv: 2511.21428
- from orbit to ground generative city photogrammetry from extreme off-nadir satel | arXiv: 2512.07527
- From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection | arXiv: 2602.20630
- from static to dynamic exploring self-supervised image-to-video representation t | arXiv: 2603.26597
- from weights to concepts data-free interpretability of clip via singular vector | arXiv: 2603.24653
- funrec reconstructing functional 3d scenes from egocentric interaction videos | arXiv: 2604.05621
- fusionagent a multimodal agent with dynamic model selection for human recognitio | arXiv: 2603.26908
- F²HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling | arXiv: 2603.14920
- GACD: Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection | arXiv: 2509.03113
- GAP: Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation | arXiv: 2602.23814
- gardendesigner encoding aesthetic principles into jiangnan garden construction v | arXiv: 2604.01777
- Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories | arXiv: 2603.14153
- gaussfusion improving 3d reconstruction in the wild with a geometry-informed vid | arXiv: 2603.25053
- gaussian shannon high-precision diffusion model watermarking based on communicat | arXiv: 2603.26167
- gaussiangrow geometry-aware gaussian growing from 3d point clouds with text guid | arXiv: 2604.05721
- gaussianpile a unified sparse gaussian splatting framework for slice-based volum | arXiv: 2603.20611
- GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion | arXiv: 2603.17161
- GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer | arXiv: 2602.20871
- GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer | arXiv: 2602.20871
- GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization | arXiv: 2603.05095
- Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation | arXiv: 2603.02554
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction | arXiv: 2602.21552
- generate analyze and refine training-free sound source localization via mllm met | arXiv: 2604.06824
- generative adversarial perturbations with cross-paradigm transferability on loca | arXiv: 2603.24821
- Generative Neural Video Compression via Video Diffusion Prior | arXiv: 2512.05016
- Generative Video Compression with One-Dimensional Latent Representation | arXiv: 2603.15302
- genmask adapting dit for segmentation via direct mask generation | arXiv: 2603.23906
- GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration | arXiv: 2603.13068
- GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration | arXiv: 2603.13068
- GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis | arXiv: 2603.01010
- GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis | arXiv: 2603.01010
- geoflow real-time fine-grained cross-view geolocalization via iterative flow pre | arXiv: 2603.21943
- geofusion-cad structure-aware diffusion with geometric state space for parametri | arXiv: 2603.21978
- geoguide hierarchical geometric guidance for open-vocabulary 3d semantic segment | arXiv: 2603.26260
- Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context | arXiv: 2602.21929
- Geometry-Guided Camera Motion Understanding in VideoLLMs | arXiv: 2603.13119
- Geometry-Guided Camera Motion Understanding in VideoLLMs | arXiv: 2603.13119
- geosurge geo-localization using semantic fusion with hierarchy of geographic emb | arXiv: 2510.01448
- geotikzbridge advancing multimodal code generation for geometric perception and | arXiv: 2603.22687
- GeoWorld: Geometric World Models | arXiv: 2602.23058
- GGPT: Geometry Grounded Point Transformer | arXiv: 2603.11174
- ghost-fwl a large-scale full-waveform lidar dataset for ghost detection and remo | arXiv: 2603.28224
- GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis | arXiv: 2603.09446
- GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis | arXiv: 2603.09446
- GKD: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation | arXiv: 2603.02554
- gleam a multimodal imaging dataset and hamm for glaucoma classification | arXiv: 2603.12800
- GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification | arXiv: 2603.12800
- GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification | arXiv: 2603.12800
- glint modeling scene-scale transparency via gaussian radiance transport | arXiv: 2603.26181
- Global-Aware Edge Prioritization for Pose Graph Initialization | arXiv: 2602.21963
- glove2hand synthesizing natural hand-object interaction from multi-modal sensing | arXiv: 2603.20850
- GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering | arXiv: 2603.15616
- goal force teaching video models to accomplish physics-conditioned goals | arXiv: 2601.05848
- goal-driven reward by video diffusion models for reinforcement learning | arXiv: 2512.00961
- gp-4dgs probabilistic 4d gaussian splatting from monocular video via variational | arXiv: 2604.02915
- gQIR: Generative Quanta Image Reconstruction | arXiv: 2602.20417
- gQIR: Generative Quanta Image Reconstruction | arXiv: 2602.20417
- graph-to-frame rag visual-space knowledge fusion for training-free and auditable | arXiv: 2604.04372
- Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs | arXiv: 2510.00507
- Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs | arXiv: 2510.00507
- GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning | arXiv: 2603.13370
- GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning | arXiv: 2603.13370
- GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion | arXiv: 2602.22862
- graze grounded refinement and motion-aware zero-shot event localization | arXiv: 2604.01383
- groundvts visual token sampling in multimodal large language models for video te | arXiv: 2604.02093
- group editing edit multiple images in one go | arXiv: 2603.22883
- GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning | arXiv: 2602.19206
- GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training | arXiv: 2512.13043
- GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training | arXiv: 2512.13043
- GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents | arXiv: 2603.15039
- guide a benchmark for understanding and assisting users in open-ended gui tasks | arXiv: 2603.25864
- guide guided updates for in-context decision evolution in llm-driven spacecraft | arXiv: 2603.27306
- guiding a diffusion model by swapping its tokens | arXiv: 2604.08048
- guiding a diffusion transformer with the internal dynamics of itself | arXiv: 2512.24176
- Guiding Diffusion Models with Semantically Degraded Conditions | arXiv: 2603.10780
- HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation | arXiv: 2603.12696
- HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation | arXiv: 2603.12696
- ham a training-free style transfer approach via heterogeneous attention modulati | arXiv: 2603.24043
- HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding | arXiv: 2603.02329
- handvqa diagnosing and improving fine-grained spatial reasoning about hands in v | arXiv: 2603.26362
- handx scaling bimanual motion and interaction generation | arXiv: 2603.28766
- Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing | arXiv: 2506.01783
- HATS: Hardness-Aware Trajectory Synthesis for GUI Agents | arXiv: 2603.12138
- HATS: Hardness-Aware Trajectory Synthesis for GUI Agents | arXiv: 2603.12138
- hawk head importance-aware visual token pruning in multimodal models | arXiv: 2604.07812
- hazematching dehazing light microscopy images with guided conditional flow match | arXiv: 2506.22397
- hear what matters text-conditioned selective video-to-audio generation | arXiv: 2512.02650
- herbench a benchmark for multi-evidence integration in video question answering | arXiv: 2512.14870
- hess head sensitivity score for sparsity redistribution in vggt | arXiv: 2603.25336
- Heterogeneous Decentralized Diffusion Models | arXiv: 2603.06741
- heuristic self-paced learning for domain adaptive semantic segmentation under ad | arXiv: 2603.24322
- hg-i2p bridging modalities for generalizable image-to-point-cloud registration v | arXiv: 2603.27969
- HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation | arXiv: 2603.10128
- HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers | arXiv: 2603.12222
- HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers | arXiv: 2603.12222
- Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces | arXiv: 2503.07853
- Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces | arXiv: 2503.07853
- hieramamba video temporal grounding via hierarchical anchor-mamba pooling | arXiv: 2510.23043
- hieramp coarse-to-fine autoregressive amplification for generative dataset disti | arXiv: 2603.06932
- hierarchical visual relocalization with nearest view synthesis from feature gaus | arXiv: 2603.29185
- hif-vla hindsight insight and foresight through motion representation for vision | arXiv: 2512.09928
- HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images | arXiv: 2603.02210
- HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks | arXiv: 2603.12760
- HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks | arXiv: 2603.12760
- High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning | arXiv: 2503.22179
- high-quality and efficient turbulence mitigation with events | arXiv: 2603.20708
- hippomm hippocampal-inspired multimodal memory for long audiovisual event unders | arXiv: 2504.10739
- hispatial taming hierarchical 3d spatial understanding in vision-language models | arXiv: 2603.25411
- hive query hypothesize verify an llm framework for multimodal reasoning-intensiv | arXiv: 2604.07220
- HoneyBee: Data Recipes for Vision-Language Reasoners | arXiv: 2510.12225
- HoneyBee: Data Recipes for Vision-Language Reasoners | arXiv: 2510.12225
- horizonforge driving scene editing with any trajectories and any vehicles | arXiv: 2602.21333
- HouseMind: Tokenization Allows MLLMs to Understand, Generate and Edit Architectural Floor Plans | arXiv: 2603.11640
- How to Take a Memorable Picture? Empowering Users with Actionable Feedback | arXiv: 2602.21877
- HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models | arXiv: 2602.22727
- HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in LVLMs | arXiv: 2602.22727
- human interaction-aware 3d reconstruction from a single image | arXiv: 2604.05436
- human knowledge integrated multi-modal learning for single source domain general | arXiv: 2603.12369
- Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization | arXiv: 2603.12369
- HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation | arXiv: 2602.24148
- Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry | arXiv: 2603.11344
- Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry | arXiv: 2603.11344
- Hyperbolic Busemann Neural Networks | arXiv: 2602.18858
- hypergaussians high-dimensional gaussian splatting for high-fidelity animatable | arXiv: 2507.02803
- HyperMVP: Hyperbolic Multiview Pretraining for Robotic Manipulation | arXiv: 2603.04848
- HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition | arXiv: 2506.04764
- I'm a Map! Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers | arXiv: 2603.02919
- iag input-aware backdoor attack on vlm-based visual grounding | arXiv: 2508.09456
- IAPL: Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning | arXiv: 2508.01603
- ictpolarreal a polarized reflection and material dataset of real world objects | arXiv: 2603.24912
- identity-preserving image-to-video generation via reward-guided optimization | arXiv: 2510.14255
- IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations | arXiv: 2602.18831
- igasa integrated geometry-aware and skip-attention modules for enhanced point cl | arXiv: 2603.12719
- IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration | arXiv: 2603.12719
- image diffusion preview with consistency solver | arXiv: 2512.13592
- Image Generation as a Visual Planner for Robotic Manipulation | arXiv: 2512.00532
- imagine before concentration diffusion-guided registers enhance partially releva | arXiv: 2604.03653
- Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards | arXiv: 2603.00918
- incarpose in-cabin relative camera pose estimation model and dataset | arXiv: 2604.03814
- indoor asset detection in large scale 360 drone-captured imagery via 3d gaussian | arXiv: 2604.05316
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout | arXiv: 2511.20649
- influence malleability in linearized attention dual implications of non-converge | arXiv: 2603.13085
- Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics | arXiv: 2603.13085
- InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation | arXiv: 2603.05898
- insid3 training-free in-context segmentation with dinov3 | arXiv: 2603.28480
- inside-out measuring generalization in vision transformers through inner working | arXiv: 2604.08192
- InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction | arXiv: 2603.11298
- InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction | arXiv: 2603.11298
- instruction-guided lesion segmentation for chest x-rays with automatically gener | arXiv: 2511.15186
- Integration of deep generative Anomaly Detection algorithm in high-speed industrial line | arXiv: 2603.07577
- Integration of Deep Generative Anomaly Detection Algorithm in High-Speed Industrial Line | arXiv: 2603.07577
- InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing | arXiv: 2603.13082
- InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing | arXiv: 2603.13082
- interpretable and steerable concept bottleneck sparse autoencoders | arXiv: 2512.10805
- Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment | arXiv: 2603.17655
- Interpretable Debiasing of Vision-Language Models for Social Fairness | arXiv: 2602.24014
- Intrinsic Concept Extraction Based on Compositional Interpretability | arXiv: 2603.11795
- InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models | arXiv: 2504.05662
- InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models | arXiv: 2504.05662
- irisfp adversarial-example-based model fingerprinting with enhanced uniqueness a | arXiv: 2603.24996
- it takes two a duet of periodicity and directionality for burst flicker removal | arXiv: 2603.22794
- It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models | arXiv: 2603.08011
- joint and streamwise distributed mimo satellite communications with multi-antenn | arXiv: 2603.12914
- Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users | arXiv: 2603.12914
- Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild | arXiv: 2602.21736
- JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas | arXiv: 2603.06168
- JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas | arXiv: 2603.06168
- just-in-time training-free spatial acceleration for diffusion transformers | arXiv: 2603.10744
- KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System | arXiv: 2512.20299
- KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing | arXiv: 2602.04268
- KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing | arXiv: 2602.04268
- kαlos finds consensus a meta-algorithm for evaluating inter-annotator agreement | arXiv: 2603.27197
- L2GTX: From Local to Global Time Series Explanations | arXiv: 2603.13065
- L2GTX: From Local to Global Time Series Explanations | arXiv: 2603.13065
- label-free cross-task lora merging with null-space compression | arXiv: 2603.26317
- lamogen language to motion generation through llm-guided symbolic inference | arXiv: 2603.11605
- lamp language-assisted motion planning for controllable video generation | arXiv: 2512.03619
- language models can explain visual features via steering | arXiv: 2603.22593
- language-free generative editing from one visual example | arXiv: 2603.25441
- Language-Grounded Decoupled Action Representation for Robotic Manipulation | arXiv: 2603.12967
- Language-Grounded Decoupled Action Representation for Robotic Manipulation (LaDA) | arXiv: 2603.12967
- laof robust latent action learning with optical flow constraints | arXiv: 2511.16407
- LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency | arXiv: 2602.18735
- lasca language-conditioned scalable modelling of affective dynamics | arXiv: 2604.07193
- laser layer-wise scale alignment for training-free streaming 4d reconstruction | arXiv: 2512.13680
- layer consistency matters elegant latent transition discrepancy for generalizabl | arXiv: 2603.10598
- le mumo jepa multi-modal self-supervised representation learning with learnable | arXiv: 2603.24327
- Learnability-Driven Submodular Optimization for Active Roadside 3D Detection | arXiv: 2601.01695
- learnability-guided diffusion for dataset distillation | arXiv: 2604.00519
- learning by neighbor-aware semantics deciding by open-form flows towards robust | arXiv: 2511.09388
- Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction | arXiv: 2602.18996
- learning explicit continuous motion representation for dynamic gaussian splattin | arXiv: 2603.25058
- Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting | arXiv: 2508.05059
- learning from synthetic data via provenance-based input gradient guidance | arXiv: 2604.02946
- Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision | arXiv: 2603.13660
- Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization | arXiv: 2603.12663
- Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization | arXiv: 2603.12663
- Learning Latent Proxies for Controllable Single-Image Relighting | arXiv: 2603.15555
- Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal | arXiv: 2511.17353
- Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal | arXiv: 2511.17353
- learning like humans analogical concept learning for generalized category discov | arXiv: 2603.19918
- Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection | arXiv: 2602.18811
- learning multi-view spatial reasoning from cross-view relations | arXiv: 2603.27967
- Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception | arXiv: 2602.19596
- learning through creation a hash-free framework for on-the-fly category discover | arXiv: 2603.13858
- Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning | arXiv: 2603.11346
- Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos | arXiv: 2602.22091
- Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models | arXiv: 2603.06043
- Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation | arXiv: 2508.05186
- learning to translate noise for robust image denoising | arXiv: 2412.04727
- Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection | arXiv: 2506.01085
- lemma laplacian pyramids for efficient marine semantic segmentation | arXiv: 2603.25689
- lenswalk agentic video understanding by planning how you see in videos | arXiv: 2603.24558
- LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration | arXiv: 2602.20497
- let it snow animating 3d gaussian scenes with dynamic weather effects via physic | arXiv: 2504.05296
- let your image move with your motion -- implicit multi-object multi-motion trans | arXiv: 2603.01000
- leveraging multispectral sensors for color correction in mobile cameras | arXiv: 2512.08441
- Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment | arXiv: 2603.10929
- Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment | arXiv: 2603.10929
- lifting unlabeled internet-level data for 3d scene understanding | arXiv: 2604.01907
- lighting-grounded video generation with renderer-based agent reasoning | arXiv: 2604.07966
- lightmover generative light movement with color and intensity controls | arXiv: 2603.27209
- lightsplat fast and memory-efficient open-vocabulary 3d scene understanding in f | arXiv: 2603.24146
- linking modality isolation in heterogeneous collaborative perception | arXiv: 2603.00609
- Linking Perception, Confidence and Accuracy in MLLMs | arXiv: 2603.12149
- LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation | arXiv: 2510.08318
- LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation | arXiv: 2510.08318
- LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration | arXiv: 2602.21754
- Lite Any Stereo: Efficient Zero-Shot Stereo Matching | arXiv: 2511.16555
- litept lighter yet stronger point transformer | arXiv: 2512.13689
- live interactive training for video segmentation | arXiv: 2603.26929
- LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models | arXiv: 2509.25896
- LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models | arXiv: 2509.25896
- LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models | arXiv: 2603.14882
- Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation | arXiv: 2603.16284
- lod-loc v3 generalized aerial localization in dense cities using instance silhou | arXiv: 2603.19609
- LongStream: Long-Sequence Streaming Autoregressive Visual Geometry | arXiv: 2602.13172
- longvideo-r1 smart navigation for low-cost long video understanding | arXiv: 2602.20913
- Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection | arXiv: 2507.16861
- looking beyond the window global-local aligned clip for training-free open-vocab | arXiv: 2603.23030
- LoST: Level of Semantics Tokenization for 3D Shapes | arXiv: 2603.17995
- love me love my label rethinking the role of labels in prompt retrieval for visu | arXiv: 2604.03657
- low-resolution editing is all you need for high-resolution editing | arXiv: 2511.19945
- LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction | arXiv: 2603.12647
- LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction | arXiv: 2603.12647
- LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates | arXiv: 2510.09881
- lumictrl learning illuminant prompts for lighting control in personalized text-t | arXiv: 2512.17489
- LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol | arXiv: 2603.14644
- Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels | arXiv: 2602.22140
- M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation | arXiv: 2509.23728
- m4-rag a massive-scale multilingual multi-cultural multimodal rag | arXiv: 2512.05959
- ma-bench towards fine-grained micro-action understanding | arXiv: 2603.26586
- MAD-Avatar: Motion-Aware Animatable Gaussian Avatars Deblurring | arXiv: 2411.16758
- MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness | arXiv: 2507.02314
- magician efficient long-term planning with imagined gaussians for active mapping | arXiv: 2603.22650
- Making Training-Free Diffusion Segmentors Scale with the Generative Power | arXiv: 2603.06178
- mamba learns in context structure-aware domain generalization for multi-task poi | arXiv: 2603.20739
- mamba-vmr multimodal query augmentation via generated videos for precise tempora | arXiv: 2603.22121
- maniparena comprehensive real-world evaluation of reasoning-oriented generalist | arXiv: 2603.28545
- MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction | arXiv: 2603.10688
- MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction | arXiv: 2603.10688
- MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models | arXiv: 2511.20629
- Mario: Multimodal Graph Reasoning with Large Language Models | arXiv: 2603.05181
- marker-based 3d reconstruction of aggregates with a comparative analysis of 2d a | arXiv: 2603.12667
- Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies | arXiv: 2603.12667
- Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies | arXiv: 2603.12667
- Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies | arXiv: 2603.12667
- Markovian Scale Prediction: A New Era of Visual Autoregressive Generation | arXiv: 2511.23334
- markushgrapher-2 end-to-end multimodal recognition of chemical structures | arXiv: 2603.28550
- MARVO: Marine-Adaptive Radiance-aware Visual Odometry | arXiv: 2511.22860
- maskadapt learning flexible motion adaptation via mask-invariant prior for physi | arXiv: 2603.29272
- MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations | arXiv: 2602.18792
- Masked Representation Modeling for Domain-Adaptive Segmentation | arXiv: 2509.13801
- Masked Representation Modeling for Domain-Adaptive Segmentation | arXiv: 2509.13801
- masking matters unlocking the spatial reasoning capabilities of llms for 3d scen | arXiv: 2512.02487
- MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models | arXiv: 2603.04800
- mastering negation boosting grounding models via grouped opposition-based learni | arXiv: 2603.12606
- Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning | arXiv: 2603.12606
- matanyone 2 scaling video matting via a learned quality evaluator | arXiv: 2512.11782
- Match-and-Fuse: Consistent Generation from Unstructured Image Sets | arXiv: 2511.22287
- MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision | arXiv: 2602.20689
- meanfuser fast one-step multi-modal trajectory generation and adaptive reconstru | arXiv: 2602.20060
- measuring the unfaithfulness of concept-based explanations | arXiv: 2504.10833
- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation | arXiv: 2602.20423
- MedGEN-Bench: Contextually Entangled Benchmark for Open-Ended Multimodal Medical Generation | arXiv: 2511.13135
- medgrpo multi-task reinforcement learning for heterogeneous medical video unders | arXiv: 2512.06581
- MEDISEG: A Dataset of Medication Images with Instance Segmentation Masks for Preventing Adverse Drug Events | arXiv: 2603.10825
- MEDISEG: 药物图像实例分割数据集——预防不良药物事件 | arXiv: 2603.10825
- MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration | arXiv: 2603.09101
- memo human-like crisp edge detection using masked edge prediction | arXiv: 2603.20782
- memory-efficient fine-tuning diffusion transformers via dynamic patch sampling a | arXiv: 2603.20755
- MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent | arXiv: 2511.18810
- MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent | arXiv: 2511.18810
- Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation | arXiv: 2603.00526
- meta-learning in-context enables training-free cross subject brain decoding | arXiv: 2604.08537
- MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating | arXiv: 2603.09419
- MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating | arXiv: 2603.09419
- MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging | arXiv: 2603.09116
- MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging | arXiv: 2603.09116
- Miburi: Towards Expressive Interactive Gesture Synthesis | arXiv: 2603.03282
- MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models | arXiv: 2602.19497
- MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification | arXiv: 2603.09374
- MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification | arXiv: 2603.09374
- mimicat mimic with correspondence-aware cascade-transformer for category-free 3d | arXiv: 2511.18370
- Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning | arXiv: 2603.13341
- mind the generative details direct localized detail preference optimization for | arXiv: 2601.04068
- mind the hitch dynamic calibration and articulated perception for autonomous tru | arXiv: 2603.23711
- Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs | arXiv: 2603.02618
- MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving | arXiv: 2602.21952
- MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents | arXiv: 2511.23055
- MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents | arXiv: 2511.23055
- mine-jepa in-domain self-supervised learning for mine-like object classification | arXiv: 2604.00383
- minerva-cultural a benchmark for cultural and multilingual long video reasoning | arXiv: 2601.10649
- mining instance-centric vision-language contexts for human-object interaction de | arXiv: 2604.02071
- Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared | arXiv: 2603.08018
- mistake attribution fine-grained mistake understanding in egocentric videos | arXiv: 2511.20525
- Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning | arXiv: 2603.04825
- Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection | arXiv: 2603.13070
- Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection | arXiv: 2603.13070
- mitigating multimodal hallucinations via gradient-based self-reflection | arXiv: 2509.03113
- mitigating object hallucinations in lvlms via attention imbalance rectification | arXiv: 2603.24058
- MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention | arXiv: 2603.01361
- Mixture of States (MoS): Routing Token-Level Dynamics for Multimodal Generation | arXiv: 2511.12207
- mixture of states routing token-level dynamics for multimodal generation | arXiv: 2511.12207
- mm-recoder advancing chart-to-code generation with reinforcement learning and se | arXiv: 2604.01600
- mmtit-bench a multilingual and multi-scenario benchmark with cognition-perceptio | arXiv: 2603.23896
- Mobile-VTON: High-Fidelity On-Device Virtual Try-On | arXiv: 2603.00947
- Mobile-VTON: High-Fidelity On-Device Virtual Try-On | arXiv: 2603.00947
- MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization | arXiv: 2603.03192
- Model Merging in the Essential Subspace | arXiv: 2602.20208
- modeling spatiotemporal neural frames for high resolution brain dynamic | arXiv: 2603.24176
- modes accelerating mixture-of-experts multimodal large language models via dynam | arXiv: 2511.15690
- MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping | arXiv: 2511.15690
- moe-grpo optimizing mixture-of-experts via reinforcement learning in vision-lang | arXiv: 2603.24984
- MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection | arXiv: 2603.03101
- MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization | arXiv: 2603.12743
- MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization | arXiv: 2603.12743
- molingo motion-language alignment for text-to-motion generation | arXiv: 2512.13840
- Momentum Memory for Knowledge Distillation in Computational Pathology | arXiv: 2602.21395
- momo mars orbital model foundation model for mars orbital applications | arXiv: 2604.02719
- Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes (LegoOcc) | arXiv: 2602.22667
- monosaod monocular 3d object detection with sparsely annotated label | arXiv: 2604.01646
- More than the Sum: Panorama-Language Models for Adverse Omni-Scenes | arXiv: 2603.09573
- MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer | arXiv: 2603.05078
- morel long-range flicker-free 4d motion modeling via anchor relay-based bidirect | arXiv: 2512.09270
- MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing | arXiv: 2601.00204
- MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification | arXiv: 2512.03404
- Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models | arXiv: 2603.16001
- Motion-Aware Animatable Gaussian Avatars Deblurring | arXiv: 2411.16758
- MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins | arXiv: 2603.12936
- MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins | arXiv: 2603.12936
- motionscale reconstructing appearance geometry and motion of dynamic scenes with | arXiv: 2603.29296
- MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer | arXiv: 2508.14327
- MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer | arXiv: 2508.14327
- movierecapsqa a multimodal open-ended video question-answering benchmark | arXiv: 2601.02536
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second | arXiv: 2507.10065
- mozzavid mozzarella volumetric image dataset | arXiv: 2412.04880
- mpdit multi-patch global-to-local transformer architecture for efficient flow ma | arXiv: 2603.26357
- mpm mutual pair merging for efficient vision transformers | arXiv: 2604.05718
- MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding | arXiv: 2512.02906
- mri contrast enhancement kinetics world model | arXiv: 2602.19285
- MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation | arXiv: 2511.10376
- MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation | arXiv: 2511.10376
- MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding | arXiv: 2602.22932
- msrl scaling generative multimodal reward modeling via multi-stage reinforcement | arXiv: 2603.25108
- muco multi-turn contrastive learning for multimodal embedding model | arXiv: 2602.06393
- Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following | arXiv: 2511.21662
- multi-modal image fusion via intervention-stable feature learning | arXiv: 2603.23272
- multi-modal representation learning via semi-supervised rate reduction for gener | arXiv: 2602.19910
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models | arXiv: 2603.04846
- Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning | arXiv: 2603.11827
- Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning | arXiv: 2603.11827
- Multimodal OCR: Parse Anything from Documents | arXiv: 2603.13032
- Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation | arXiv: 2603.12845
- Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation | arXiv: 2603.12845
- MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning | arXiv: 2602.20223
- Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation | arXiv: 2603.12581
- Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation | arXiv: 2603.12581
- muse harnessing precise and diverse semantics for few-shot whole slide image cla | arXiv: 2602.20873
- must modality-specific representation-aware transformer for diffusion-enhanced s | arXiv: 2603.26071
- MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy | arXiv: 2602.24222
- mv-roma from pairwise matching into multi-view track reconstruction | arXiv: 2603.27542
- mvggt multimodal visual geometry grounded transformer for multiview 3d referring | arXiv: 2601.06874
- MXNorm: Reusing MXFP block scales for efficient tensor normalisation | arXiv: 2603.13180
- MXNorm: Reusing MXFP Block Scales for Efficient Tensor Normalisation | arXiv: 2603.13180
- M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs | arXiv: 2603.09737
- M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs | arXiv: 2603.09737
- NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries | arXiv: 2603.05446
- nanosd edge efficient foundation model for real time image restoration | arXiv: 2601.09823
- NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | arXiv: 2603.12824
- NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | arXiv: 2603.12824
- Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning | arXiv: 2603.06688
- near coupled neural asset-renderer stack | arXiv: 2511.18600
- nec-diff noise-robust event-raw complementary diffusion for seeing motion in ext | arXiv: 2603.20005
- Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models | arXiv: 2511.16955
- neighbor-aware localized concept erasure in text-to-image diffusion models | arXiv: 2603.25994
- neoverse enhancing 4d world model with in-the-wild monocular videos | arXiv: 2601.00393
- Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code | arXiv: 2603.00805
- NERFIFY: 多智能体框架将NeRF论文自动转化为可运行代码 | arXiv: 2603.00805
- NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training | arXiv: 2602.22059
- Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences | arXiv: 2602.22212
- neural collapse in test-time adaptation | arXiv: 2512.10421
- neural field-based 3d surface reconstruction of microstructures from multi-detec | arXiv: 2508.04728
- Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion | arXiv: 2509.17704
- neuroseg meets dinov3 transferring 2d self-supervised visual priors to 3d neuron | arXiv: 2603.23104
- next-scale autoregressive models for text-to-motion generation | arXiv: 2604.03799
- NI-Tex: Non-isometric Image-based Garment Texture Generation | arXiv: 2511.18765
- No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency | arXiv: 2602.23559
- no hard negatives required concept centric learning leads to compositionality wi | arXiv: 2603.25722
- No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors | arXiv: 2602.23141
- No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection | arXiv: 2602.19248
- Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs | arXiv: 2603.12078
- Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs | arXiv: 2603.12078
- noise-aware few-shot learning through bi-directional multi-view prompt alignment | arXiv: 2603.11617
- Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment | arXiv: 2603.11617
- Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment | arXiv: 2603.11617
- noovd novel category discovery and embedding for open-vocabulary object detectio | arXiv: 2603.21069
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning | arXiv: 2602.21172
- NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing | arXiv: 2603.02802
- novel anomaly detection scenarios and evaluation metrics to address the ambiguit | arXiv: 2604.07097
- Novel Architecture of RPA In Oral Cancer Lesion Detection | arXiv: 2603.10928
- Novel Architecture of RPA In Oral Cancer Lesion Detection | arXiv: 2603.10928
- NTK-Guided Implicit Neural Teaching | arXiv: 2511.15487
- O3N: Omnidirectional Open-Vocabulary Occupancy Prediction | arXiv: 2603.12144
- O3N: Omnidirectional Open-Vocabulary Occupancy Prediction | arXiv: 2603.12144
- oars process-aware online alignment for generative real-world image super-resolu
- oars process-aware online alignment for generative real-world image super-resolu | arXiv: 2603.12811
- OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution | arXiv: 2603.12811
- Object-WIPER: Training-Free Object and Associated Effect Removal in Videos | arXiv: 2601.06391
- occany generalized unconstrained urban 3d occupancy | arXiv: 2603.23502
- Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking | arXiv: 2603.06034
- occufly a 3d vision benchmark for semantic scene completion from the aerial pers | arXiv: 2512.20770
- OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models | arXiv: 2603.09326
- off the grid detection of primitives for feed-forward 3d gaussian splatting | arXiv: 2512.15508
- Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments | arXiv: 2602.22025
- omg-bench a new challenging benchmark for skeleton-based online micro hand gestu | arXiv: 2512.16727
- omni-mmsi toward identity-attributed social interaction understanding | arXiv: 2604.00267
- omnifm toward modality-robust and task-agnostic federated learning for heterogen | arXiv: 2603.21660
- OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens | arXiv: 2603.02138
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval | arXiv: 2603.02098
- omnisonic towards universal and holistic audio generation from video and text | arXiv: 2604.04348
- On the Feasibility and Opportunity of Autoregressive 3D Object Detection | arXiv: 2603.07985
- On the Possible Detectability of Image-in-Image Steganography | arXiv: 2603.11876
- On the Possible Detectability of Image-in-Image Steganography | arXiv: 2603.11876
- on the robustness of diffusion-based image compression to bit-flip errors | arXiv: 2604.05743
- on tokens dilemma dynamic moe with drift-aware token assignment for continual le | arXiv: 2603.27481
- One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers | arXiv: 2603.12245
- One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers | arXiv: 2603.12245
- OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera | arXiv: 2511.03571
- OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery | arXiv: 2603.17355
- OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting | arXiv: 2603.18510
- Open-Vocabulary Domain Generalization in Urban-Scene Segmentation | arXiv: 2602.18853
- opendpr open-vocabulary change detection via vision-centric diffusion-guided pro | arXiv: 2603.27645
- OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis | arXiv: 2602.22949
- OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments | arXiv: 2603.02390
- openvo open-world visual odometry with temporal dynamics awareness | arXiv: 2602.19035
- opro orthogonal panel-relative operators for panel-aware in-context image genera | arXiv: 2603.27637
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation | arXiv: 2509.18600
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation | arXiv: 2509.18600
- Order Matters: 3D Shape Generation from Sequential VR Sketches | arXiv: 2512.04761
- organizing unstructured image collections using natural language | arXiv: 2410.05217
- oslash source models leak what they shouldnt nrightarrow unlearning zero-shot tr | arXiv: 2604.08238
- OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport | arXiv: 2602.20205
- out of sight out of track adversarial attacks on propagation-based multi-object | arXiv: 2604.00452
- Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models | arXiv: 2603.13215
- Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models | arXiv: 2603.13215
- Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models | arXiv: 2603.07619
- Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models | arXiv: 2603.07619
- pad-hand physics-aware diffusion for hand motion recovery | arXiv: 2603.26068
- palm progress-aware policy learning via affordance reasoning for long-horizon ro | arXiv: 2601.07060
- pam a pose-appearance-motion engine for sim-to-real hoi video generation | arXiv: 2603.22193
- Pano360: Perspective to Panoramic Vision with Geometric Consistency | arXiv: 2603.12013
- Pano360: Perspective to Panoramic Vision with Geometric Consistency | arXiv: 2603.12013
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image | arXiv: 2603.05908
- PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments | arXiv: 2603.09760
- PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments | arXiv: 2603.09760
- Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots | arXiv: 2603.13108
- panoramic multimodal semantic occupancy prediction for quadruped robots | arXiv: 2603.13108
- PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery | arXiv: 2603.17571
- Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression | arXiv: 2603.03615
- Parallel In-context Learning for Large Vision Language Models | arXiv: 2603.16092
- Parallelised Differentiable Straightest Geodesics for 3D Meshes | arXiv: 2603.15780
- parameter-efficient prompt tuning and hierarchical textual guidance for few-shot | arXiv: 2603.21504
- parameter-efficient semantic augmentation for enhancing open-vocabulary object d | arXiv: 2604.04444
- particulate feed-forward 3d object articulation | arXiv: 2512.11798
- ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis | arXiv: 2603.09611
- pcstracker long-term scene flow estimation for point cloud sequences | arXiv: 2603.19762
- pe3r perception-efficient 3d reconstruction | arXiv: 2503.07507
- pearl geometry aligns semantics for training-free open-vocabulary semantic segme | arXiv: 2603.21528
- perception characteristics distance measuring stability and robustness of percep | arXiv: 2506.09217
- performrecast expression and head pose disentanglement for portrait video editin | arXiv: 2603.19731
- perturb and recover fine-tuning for effective backdoor removal from clip | arXiv: 2412.00727
- pet-dino unifying visual cues into grounding dino with prompt-enriched training | arXiv: 2604.00503
- PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning | arXiv: 2602.20537
- pgr-net prior-guided roi reasoning network for brain tumor mri segmentation | arXiv: 2603.21626
- PHAC: Promptable Human Amodal Completion | arXiv: 2603.14741
- phantasia context-adaptive backdoors in vision language models | arXiv: 2604.08395
- phantom physics-infused video generation via joint modeling of visual and latent | arXiv: 2604.08503
- PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement | arXiv: 2509.24850
- phasr generalized image shadow removal with physically aligned priors | arXiv: 2601.17470
- phrase-instance alignment for generalized referring segmentation | arXiv: 2411.15087
- phygap physically-grounded gaussians with polarization cues | arXiv: 2603.14001
- physgaia a physics-aware benchmark with multi-body interactions for dynamic nove | arXiv: 2506.02794
- physgen physically grounded 3d shape generation for industrial design | arXiv: 2512.00422
- physgm large physical gaussian model for feed-forward 4d synthesis | arXiv: 2508.13911
- PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis | arXiv: 2508.13911
- PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation | arXiv: 2511.18570
- physhead simulation-ready gaussian head avatars | arXiv: 2604.06467
- Physical Simulator In-the-Loop Video Generation | arXiv: 2603.06408
- physically inspired gaussian splatting for hdr novel view synthesis | arXiv: 2603.28020
- Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction | arXiv: 2603.00149
- physmodpo physically-plausible humanoid motion with preference optimization | arXiv: 2603.13228
- PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization | arXiv: 2603.13228
- physskin real-time and generalizable physics-based animation via self-supervised | arXiv: 2603.23194
- physvid physics aware local conditioning for generative video models | arXiv: 2603.26285
- PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing | arXiv: 2603.04598
- pioneering perceptual video fluency assessment a novel task with benchmark datas | arXiv: 2603.26055
- PIP-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching | arXiv: 2602.20496
- PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction | arXiv: 2603.05888
- Pixel Motion Diffusion Is What We Need for Robot Control | arXiv: 2509.22652
- pixel-level scene understanding in one token visual states need what-is-where co | arXiv: 2603.13904
- Pixel2Phys: Distilling Governing Laws from Visual Dynamics | arXiv: 2602.19516
- pixelrush ultra-fast training-free high-resolution image generation via one-step
- PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion | arXiv: 2602.12769
- Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision | arXiv: 2602.19715
- planareloc camera relocalization in 3d planar primitives via region-based struct | arXiv: 2603.20818
- planning in 8 tokens a compact discrete tokenizer for latent world model | arXiv: 2603.05438
- plant taxonomy meets plant counting a fine-grained taxonomic dataset for countin | arXiv: 2603.21229
- Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers | arXiv: 2511.16156
- PNG: Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning | arXiv: 2603.04870
- PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models | arXiv: 2603.00412
- pointer-cad unifying b-rep and command sequences via pointer-based edges faces s | arXiv: 2603.04337
- Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors | arXiv: 2603.18782
- pointtpa dynamic network parameter adaptation for 3d scene understanding | arXiv: 2604.04933
- POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction | arXiv: 2603.09162
- POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction | arXiv: 2603.09162
- pose-dive pose-diversified augmentation with diffusion model for person re-ident | arXiv: 2406.16042
- posemaster a unified 3d native framework for stylized pose generation | arXiv: 2506.21076
- posteriq a design perspective benchmark for poster understanding and generation | arXiv: 2603.24078
- PPCL: Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers | arXiv: 2511.16156
- pr-iqa partial-reference image quality assessment for diffusion-based novel view | arXiv: 2604.04576
- Precise Object and Effect Removal with Adaptive Target-Aware Attention | arXiv: 2505.22636
- predictive regularization against visual representation degradation in multimoda | arXiv: 2603.20808
- preference-aligned lora merging preserving subspace coverage and addressing dire | arXiv: 2603.26299
- preserving source video realism high-fidelity face swapping for cinematic qualit | arXiv: 2512.07951
- prime once then reprogram locally an efficient alternative to black-box service | arXiv: 2604.01474
- principled steering via null-space projection for jailbreak defense in vision-la | arXiv: 2603.22094
- prism video dataset condensation with progressive refinement and insertion for s | arXiv: 2505.22564
- privi towards a general-purpose video model for primate behavior in the wild | arXiv: 2511.09675
- probabilistic concept graph reasoning for multimodal misinformation detection | arXiv: 2603.25203
- Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models | arXiv: 2602.20501
- ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation | arXiv: 2603.05530
- ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars | arXiv: 2603.16447
- PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On | arXiv: 2603.11675
- PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On | arXiv: 2603.11675
- Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains | arXiv: 2603.12624
- Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains | arXiv: 2603.12624
- Prompt-Free Universal Region Proposal Network | arXiv: 2603.17554
- PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts | arXiv: 2603.01650
- Proof-of-Perception: 带组合共形保证的工具使用多模态推理 | arXiv: 2603.00324
- proood prototype-guided out-of-distribution 3d occupancy prediction | arXiv: 2604.01081
- Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting | arXiv: 2603.11938
- Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting | arXiv: 2603.11938
- Prototype-Guided Concept Erasure in Diffusion Models | arXiv: 2603.08271
- ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning | arXiv: 2602.21078
- prue a practical recipe for field boundary segmentation at scale | arXiv: 2603.27101
- Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives | arXiv: 2602.24136
- Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving | arXiv: 2508.13305
- Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving | arXiv: 2508.13305
- psdesigner automated graphic design with a human-like creative workflow | arXiv: 2603.25738
- psr scaling multi-subject personalized image generation with pairwise subject-co | arXiv: 2512.01236
- ptc-depth pose-refined monocular depth estimation with temporal consistency | arXiv: 2604.01791
- pulse privileged knowledge transfer from rich to deployable sensors for embodied | arXiv: 2510.24058
- PureCC: Pure Learning for Text-to-Image Concept Customization | arXiv: 2603.07561
- purify-then-align towards robust human sensing under modality missing with knowl | arXiv: 2604.05584
- QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment | arXiv: 2603.03726
- QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition | arXiv: 2602.22639
- Quant Experts: Token-aware Adaptive Error Reconstruction for Large VLM Quantization | arXiv: 2602.24059
- Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization | arXiv: 2602.24059
- quantization with unified adaptive distillation to enable multi-lora based one-f | arXiv: 2603.29535
- QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models | arXiv: 2602.20309
- QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models | arXiv: 2602.20309
- question-guided visual compression with memory feedback for long-term video unde | arXiv: 2603.15167
- R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection | arXiv: 2603.11566
- R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection# R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection | arXiv: 2603.11566
- radar closed-loop robotic data generation via semantic planning and autonomous c | arXiv: 2603.11811
- RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset | arXiv: 2603.11811
- RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset | arXiv: 2603.11811
- ragtrack language-aware rgbt tracking with retrieval-augmented generation | arXiv: 2603.03617
- RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment | arXiv: 2603.00483
- Random Wins All: Rethinking Grouping Strategies for Vision Tokens | arXiv: 2603.00486
- RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing | arXiv: 2602.19753
- rascene high-fidelity 3d scene imaging with mmwave communication signals | arXiv: 2604.02603
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought | arXiv: 2507.07685
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought | arXiv: 2507.07685
- RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution | arXiv: 2603.12493
- RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution | arXiv: 2603.12493
- RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space | arXiv: 2602.20685
- RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models | arXiv: 2603.14819
- RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models | arXiv: 2603.14819
- RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation | arXiv: 2603.11106
- RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation | arXiv: 2603.11106
- rdface a benchmark dataset for rare disease facial image analysis under extreme | arXiv: 2604.03454
- rdnet region proportion-aware dynamic adaptive salient object detection network | arXiv: 2603.12215
- RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images | arXiv: 2603.12215
- Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting | arXiv: 2512.17908
- reag reasoning-augmented generation for knowledge-based visual question answerin | arXiv: 2511.22715
- real-world point tracking with verifier-guided pseudo-labeling | arXiv: 2603.12217
- Real-World Point Tracking with Verifier-Guided Pseudo-Labeling | arXiv: 2603.12217
- real2edit2real generating robotic demonstrations via a 3d control interface | arXiv: 2512.19402
- Reallocating Attention Across Layers to Reduce Multimodal Hallucination | arXiv: 2510.10285
- Reallocating Attention Across Layers to Reduce Multimodal Hallucination | arXiv: 2510.10285
- realm an mllm-agent framework for open world 3d reasoning segmentation and editi | arXiv: 2510.16410
- REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting | arXiv: 2510.16410
- realunify do unified models truly benefit from unification a comprehensive bench | arXiv: 2509.24897
- realvlg-r1 a large-scale real-world visual-language grounding benchmark for robo | arXiv: 2603.14880
- reason-svg enhancing structured reasoning for vector graphics generation with re | arXiv: 2505.24499
- Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics | arXiv: 2601.13401
- reasoning-driven anomaly detection and localization with image-level supervision | arXiv: 2603.27179
- reasonmap towards fine-grained visual reasoning from transit maps | arXiv: 2505.18675
- ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps | arXiv: 2505.18675
- recall recalibrating capability degradation for mllm-based composed image retrie | arXiv: 2602.01639
- Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning | arXiv: 2603.05235
- reconstruction-guided slot curriculum addressing object over-fragmentation in vi | arXiv: 2603.22758
- recover to predict progressive retrospective learning for variable-length trajec | arXiv: 2603.10597
- RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces | arXiv: 2602.20618
- Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress | arXiv: 2603.17312
- Recursive Think-Answer Process for LLMs and VLMs | arXiv: 2603.02099
- recyclelora rank-revealing qr-based dual-lora subspace adaptation for domain gen | arXiv: 2603.28142
- Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback | arXiv: 2603.13057
- Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback | arXiv: 2603.13057
- Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning | arXiv: 2505.20107
- reflexsplit single image reflection separation via layer fusion-separation | arXiv: 2601.17468
- reframing long-tailed learning via loss landscape geometry | arXiv: 2603.21217
- refton reference person shot assist virtual try-on | arXiv: 2511.00956
- Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data | arXiv: 2603.10947
- Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data | arXiv: 2603.10947
- rehark refined hybrid adaptive rbf kernels for robust one-shot vision-language a | arXiv: 2603.11542
- ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation | arXiv: 2603.11542
- ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation | arXiv: 2603.11542
- RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model | arXiv: 2509.24948
- reinforce to learn elect to reason a dual paradigm for video reasoning | arXiv: 2604.04379
- reinforcing structured chain-of-thought for video understanding | arXiv: 2603.25942
- Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration | arXiv: 2603.12951
- Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration | arXiv: 2603.12951
- REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion | arXiv: 2601.16788
- Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing | arXiv: 2603.17531
- ReLaGS: Relational Language Gaussian Splatting | arXiv: 2603.17605
- Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection | arXiv: 2603.18541
- remogen real-time human interaction-to-reaction generation via modular learning | arXiv: 2604.01082
- ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding | arXiv: 2602.16412
- ReMoT: Reinforcement Learning with Motion Contrast Triplets | arXiv: 2603.00461
- renderflow single-step neural rendering via flow matching | arXiv: 2601.06928
- Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery | arXiv: 2603.01034
- Representation Learning for Spatiotemporal Physical Systems | arXiv: 2603.13227
- Representation Learning for Spatiotemporal Physical Systems | arXiv: 2603.13227
- RESBev: Making BEV Perception More Robust | arXiv: 2603.09529
- rescene4d temporally consistent semantic instance segmentation of evolving indoo | arXiv: 2601.11508
- residual decoding mitigating hallucinations in large vision-language models via | arXiv: 2602.01047
- Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning | arXiv: 2603.12816
- Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning | arXiv: 2603.12816
- resolving the identity crisis in text-to-image generation | arXiv: 2510.01399
- Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation | arXiv: 2603.02139
- Rethinking Concept Bottleneck Models: From Pitfalls to Solutions | arXiv: 2603.05629
- Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token | arXiv: 2603.19026
- rethinking pose refinement in 3d gaussian splatting under pose prior and geometr | arXiv: 2603.16538
- rethinking position embedding as a context controller for multi-reference and mu | arXiv: 2604.03738
- Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model | arXiv: 2410.07547
- Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model | arXiv: 2410.07547
- Rethinking VLMs for Image Forgery Detection and Localization | arXiv: 2603.12930
- Rethinking VLMs for Image Forgery Detection and Localization | arXiv: 2603.12930
- RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting | arXiv: 2603.13783
- RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting | arXiv: 2603.13783
- Retrieving Counterfactuals Improves Visual In-Context Learning | arXiv: 2603.16737
- Revisiting Model Stitching In the Foundation Model Era | arXiv: 2603.12433
- Revisiting Model Stitching In the Foundation Model Era | arXiv: 2603.12433
- Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach | arXiv: 2511.16786
- revisiting unknowns towards effective and efficient open-set active learning | arXiv: 2603.07898
- Reviving ConvNeXt for Efficient Convolutional Diffusion Models | arXiv: 2603.09408
- rewardflow generate images by optimizing what you reward | arXiv: 2604.08536
- ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction | arXiv: 2601.16672
- rewis3d reconstruction improves weakly-supervised semantic segmentation | arXiv: 2603.06374
- Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation | arXiv: 2603.06374
- rho robust holistic osm-based metric cross-view geo-localization | arXiv: 2603.27758
- riskprop collision-anchored self-supervised risk propagation for early accident | arXiv: 2603.27165
- rl-scaniqa reinforcement-learned scanpaths for blind 360image quality assessment
- rng a unified transformer for complete 3d modeling from partial observations
- roboagent chaining basic capabilities for embodied task planning | arXiv: 2604.07774
- robotseg a model and dataset for segmenting robots in image and video | arXiv: 2511.22950
- robust multi-source covid-19 detection in ct images | arXiv: 2604.03320
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations | arXiv: 2602.22013
- Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods | arXiv: 2603.13077
- Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods | arXiv: 2603.13077
- rs-ssm refining forgotten specifics in state space model for video semantic segm | arXiv: 2603.24295
- rsonet region-guided selective optimization network for rgb-t salient object det | arXiv: 2603.12685
- RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection | arXiv: 2603.12685
- S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds | arXiv: 2512.00995
- saber spatially consistent 3d universal adversarial objects for bev detectors | arXiv: 2505.22499
- SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World | arXiv: 2602.18887
- SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning | arXiv: 2603.05437
- saliency-r1 enforcing interpretable and faithful vision-language reasoning via s | arXiv: 2604.04500
- salmubench a benchmark for sensitive association-level multimodal unlearning | arXiv: 2603.26316
- sampling-aware 3d spatial analysis in multiplexed imaging | arXiv: 2604.07890
- SAP: Segment Any 4K Panorama | arXiv: 2603.12759
- sapave towards active perception and manipulation in vision-language-action mode | arXiv: 2603.12193
- SaPaVe: Towards Active Perception and Manipulation in VLA Models for Robotics | arXiv: 2603.12193
- sarmae masked autoencoder for sar representation learning | arXiv: 2512.16635
- sattc structure-aware label-free test-time calibration for cross-subject eeg-to- | arXiv: 2603.20738
- sava-x ego-to-exo imitation error detection via scene-adaptive view alignment an | arXiv: 2603.12764
- SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion | arXiv: 2603.12764
- save speech-aware video representation learning for video-text retrieval | arXiv: 2603.08224
- SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval | arXiv: 2603.08224
- scalable object relation encoding for better 3d spatial reasoning in large langu | arXiv: 2603.24721
- scaling spatial intelligence with multimodal foundation models | arXiv: 2511.13719
- Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework | arXiv: 2603.07659
- scaling the long video understanding of multimodal large language models via vis | arXiv: 2603.29252
- Scaling View Synthesis Transformers (SVSM) | arXiv: 2602.21341
- scaling-aware data selection for end-to-end autonomous driving systems | arXiv: 2604.08366
- scene grounding in the wild | arXiv: 2603.26584
- scene-vlm multimodal video scene segmentation via vision-language models | arXiv: 2512.21778
- sceneassistant a visual feedback agent for open-vocabulary 3d scene generation | arXiv: 2603.12238
- SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation | arXiv: 2603.12238
- scenescribe-1m a large-scale video dataset with comprehensive geometric and sema | arXiv: 2604.07990
- SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation | arXiv: 2603.06572
- SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation | arXiv: 2603.06572
- SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated Learning | arXiv: 2603.12976
- SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning | arXiv: 2603.12976
- score2instruct scaling up video quality-centric instructions via automated dimen | arXiv: 2506.21011
- sdf-net structure-aware disentangled feature learning for opticall-sar ship re-i | arXiv: 2603.12588
- SDF-Net: Structure-Aware Disentangled Feature Learning for Optical-SAR Ship Re-identification | arXiv: 2603.12588
- SEA-Vision: A Multilingual Benchmark for Document and Scene Text Understanding in Southeast Asia | arXiv: 2603.15409
- seacache spectral-evolution-aware cache for accelerating diffusion models | arXiv: 2602.18993
- searchad large-scale rare image retrieval dataset for autonomous driving | arXiv: 2604.08008
- see it say it sorted an iterative training-free framework for visually-grounded | arXiv: 2602.21497
- See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs | arXiv: 2602.21497
- See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles | arXiv: 2509.13615
- See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles (StaR) | arXiv: 2509.13615
- Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation | arXiv: 2603.15475
- Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness | arXiv: 2602.19615
- seeing is improving visual feedback for iterative text layout refinement | arXiv: 2603.22187
- seeing without pixels perception from camera trajectories | arXiv: 2511.21681
- seethrough3d occlusion aware 3d control in text-to-image generation | arXiv: 2602.23359
- seeu seeing the unseen world via 4d dynamics-aware generation | arXiv: 2512.03350
- SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models | arXiv: 2507.14811
- SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models | arXiv: 2507.14811
- select hypothesize and verify towards verified neuron concept interpretation | arXiv: 2603.24953
- self-consistency for llm-based motion trajectory generation and verification | arXiv: 2603.29301
- self-corrected image generation with explainable latent rewards | arXiv: 2603.24965
- semantic audio-visual navigation in continuous environments | arXiv: 2603.19660
- Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation | arXiv: 2603.05202
- Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation | arXiv: 2603.05202
- Semantic Satellite Communications for Synchronized Audiovisual Reconstruction | arXiv: 2603.10791
- semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
- Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score | arXiv: 2505.21147
- SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation | arXiv: 2603.11616
- SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation | arXiv: 2603.11616
- semlayer semantic-aware generative segmentation and layer construction for abstr | arXiv: 2603.24039
- SG-NLF: Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis | arXiv: 2603.12903
- sgad-slam splatting gaussians at adjusted depth for better radiance fields in rg | arXiv: 2603.21055
- sgi structured 2d gaussians for efficient and compact large image representation
- SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data | arXiv: 2603.02505
- SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data | arXiv: 2603.02505
- shape-of-you fused gromov-wasserstein optimal transport for semantic corresponde | arXiv: 2603.11618
- sharp short-window streaming for accurate and robust prediction in motion foreca | arXiv: 2603.28091
- ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration | arXiv: 2603.00906
- shoe semantic hoi open-vocabulary evaluation metric | arXiv: 2604.01586
- shoe style-invariant and ground-aware learning for dense foot contact estimation | arXiv: 2511.22184
- Show, Don't Tell: Detecting Novel Objects by Watching Human Videos | arXiv: 2603.12751
- Show, Don't Tell: Detecting Novel Objects by Watching Human Videos | arXiv: 2603.12751
- show3d capturing scenes of 3d hands and objects in the wild | arXiv: 2603.28760
- showtable unlocking creative table visualization with collaborative reflection a | arXiv: 2512.13303
- SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules | arXiv: 2603.12307
- SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules | arXiv: 2603.12307
- Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning | arXiv: 2602.18867
- SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images | arXiv: 2602.20412
- simpact simulation-enabled action planning using vision-language models | arXiv: 2512.05955
- SimRecon: SimReady Compositional Scene Reconstruction from Real Videos | arXiv: 2603.02133
- SimScale: Learning to Drive via Real-World Simulation at Scale | arXiv: 2511.23369
- SineProject: Machine Unlearning for Stable Vision–Language Alignment | arXiv: 2511.18444
- Single Pixel Image Classification using an Ultrafast Digital Light Projector | arXiv: 2603.12036
- Single Pixel Image Classification using an Ultrafast Digital Light Projector | arXiv: 2603.12036
- SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation | arXiv: 2603.18599
- skeletoncontext skeleton-side context prompt learning for zero-shot skeleton-bas | arXiv: 2603.29692
- Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation | arXiv: 2603.02190
- sketchdeco training-free latent composition for precise sketch colourisation | arXiv: 2405.18716
- sky2ground a benchmark for site modeling under varying altitude | arXiv: 2603.13740
- sldprtnet a large-scale multimodal dataset for cad generation in language-driven | arXiv: 2603.13098
- SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design | arXiv: 2603.13098
- SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design | arXiv: 2603.13098
- slice semantic latent injection via compartmentalized embedding for image waterm | arXiv: 2603.12749
- SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking | arXiv: 2603.12749
- SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking | arXiv: 2603.12749
- slotvtg object-centric adapter for generalizable video temporal grounding | arXiv: 2603.25733
- slvmeval synthetic meta evaluation benchmark for text-to-long video generation | arXiv: 2603.29186
- small target detection based on mask-enhanced attention fusion of visible and in | arXiv: 2603.06925
- Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images | arXiv: 2603.06925
- soda sensitivity-oriented dynamic acceleration for diffusion transformer | arXiv: 2603.07057
- SOLACE: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards | arXiv: 2603.00918
- Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion | arXiv: 2603.16939
- solution for 10th competition on ambivalencehesitancy ah video recognition chall | arXiv: 2603.16939
- Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors | arXiv: 2603.00882
- sonoworld from one image to a 3d audio-visual scene | arXiv: 2603.28757
- SoPE: Spherical Coordinate-Based Positional Embedding for 3D LVLMs | arXiv: 2602.22716
- SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs | arXiv: 2602.22716
- souple enhancing audio-visual localization and segmentation with learnable promp | arXiv: 2603.22732
- SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection | arXiv: 2511.06702
- SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection | arXiv: 2511.06702
- spar single-pass any-resolution vit for open-vocabulary segmentation | arXiv: 2604.02252
- SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs | arXiv: 2603.12382
- SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs | arXiv: 2603.12382
- Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis | arXiv: 2603.10526
- sparsecam4d spatio-temporally consistent 4d reconstruction from sparse cameras | arXiv: 2603.26481
- sparsity-aware voxel attention and foreground modulation for 3d semantic scene c | arXiv: 2604.05780
- sparvar exploring sparsity in visual autoregressive modeling for training-free a | arXiv: 2602.04361
- spatial-ssrl enhancing spatial understanding via self-supervised reinforcement l | arXiv: 2510.27606
- Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning | arXiv: 2510.27606
- SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models | arXiv: 2602.20901
- spatialstack layered geometry-language fusion for 3d vlm spatial reasoning | arXiv: 2603.27437
- Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation | arXiv: 2603.12538
- spdmark selective parameter displacement for robust video watermarking | arXiv: 2512.12090
- specificity-aware reinforcement learning for fine-grained open-world classificat | arXiv: 2603.03197
- spectral defense against resource-targeting attack in 3d gaussian splatting | arXiv: 2603.12796
- Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting | arXiv: 2603.12796
- Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization | arXiv: 2603.00920
- Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis | arXiv: 2603.12903
- Speed3R: Sparse Feed-forward 3D Reconstruction Models | arXiv: 2603.08055
- Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists | arXiv: 2603.09277
- SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation | arXiv: 2603.11492
- SpHOR: A Representation Learning Perspective on Open-set Recognition | arXiv: 2503.08049
- SpHOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Neural Networks | arXiv: 2503.08049
- SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking | arXiv: 2602.23963
- spiraldiff spiral diffusion with lora for rgb-to-raw conversion across cameras | arXiv: 2603.14885
- SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting | arXiv: 2602.24020
- SSR2-GCD: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery | arXiv: 2602.19910
- stable spike dual consistency optimization via bitwise and operations for spikin | arXiv: 2603.11676
- stac plug-and-play spatio-temporal aware cache compression for streaming 3d reco | arXiv: 2603.20284
- Stake the Points: Structure-Faithful Instance Unlearning | arXiv: 2603.12915
- Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging | arXiv: 2603.18834
- STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction | arXiv: 2511.19854
- Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning | arXiv: 2603.11439
- Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning | arXiv: 2603.11439
- STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting | arXiv: 2509.25210
- steeldefectx a coarse-to-fine vision-language dataset and benchmark for generali | arXiv: 2603.21824
- Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering | arXiv: 2603.13878
- STEPH: Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in WSI Prognosis | arXiv: 2603.10526
- stepper stepwise immersive scene generation with multiview panoramas | arXiv: 2603.28980
- StoryTailor: A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives | arXiv: 2602.21273
- streamavatar streaming diffusion models for real-time interactive human avatars | arXiv: 2512.22065
- streamdit real-time streaming text-to-video generation | arXiv: 2507.03745
- streamgaze gaze-guided temporal reasoning and proactive understanding in streami | arXiv: 2512.01707
- StreamingTOM: Streaming Token Compression for Efficient Video Understanding | arXiv: 2510.18269
- StreamingTOM: Streaming Token Compression for Efficient Video Understanding | arXiv: 2510.18269
- StreamReady: Learning What to Answer and When in Long Streaming Videos | arXiv: 2603.08620
- stronger normalization-free transformers | arXiv: 2512.10938
- StructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues | arXiv: 2602.20089
- subflot submodel extraction for efficient and personalized federated learning vi | arXiv: 2604.06631
- SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling | arXiv: 2602.23013
- suppressing non-semantic noise in masked image modeling representations | arXiv: 2604.00172
- svc 2026 the second multimodal deception detection challenge and the first domai | arXiv: 2604.05748
- swift sliding window reconstruction for few-shot training-free generated video a | arXiv: 2603.08536
- SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation | arXiv: 2603.19053
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls | arXiv: 2602.23956
- symphomotion joint control of camera motion and object dynamics for coherent vid | arXiv: 2604.03723
- Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos | arXiv: 2503.22174
- t-gated adapter a lightweight temporal adapter for vision-language medical segme | arXiv: 2604.08167
- tacsim a dataset and benchmark for football tactical style imitation | arXiv: 2603.25199
- tag-moe task-aware gating for unified generative mixture-of-experts | arXiv: 2601.08881
- TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking | arXiv: 2512.01329
- Talking Together: Synthesizing Co-Located 3D Conversations from Audio | arXiv: 2603.08674
- TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction | arXiv: 2512.02341
- talon test-time adaptive learning for on-the-fly category discovery | arXiv: 2603.08075
- Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning | arXiv: 2512.24146
- taming sampling perturbations with variance expansion loss for latent diffusion | arXiv: 2603.21085
- Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework | arXiv: 2603.10281
- Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework | arXiv: 2603.10281
- taming video models for 3d and 4d generation via zero-shot camera control | arXiv: 2509.15130
- TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration | arXiv: 2603.03792
- task-oriented data synthesis and control-rectify sampling for remote sensing sem | arXiv: 2512.16740
- TAUE: Training-free Noise Transplant and Cultivation Diffusion Model | arXiv: 2511.02580
- Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models | arXiv: 2603.00431
- TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration | arXiv: 2603.02943
- tdatr improving end-to-end table recognition via table detail-aware learning and | arXiv: 2603.22819
- team leya in 10th abaw competition multimodal ambivalencehesitancy recognition a | arXiv: 2603.12848
- Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach | arXiv: 2603.12848
- Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach | arXiv: 2603.12848
- team ras in 10th abaw competition multimodal valence and arousal estimation appr | arXiv: 2603.13056
- Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach | arXiv: 2603.13056
- Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach | arXiv: 2603.13056
- TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size | arXiv: 2603.07988
- TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models | arXiv: 2511.21145
- TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation | arXiv: 2602.19053
- TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures | arXiv: 2602.19679
- tell model where to look mitigating hallucinations in mllms by vision-guided att | arXiv: 2511.20032
- Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model | arXiv: 2603.05012
- temporal imbalance of positive and negative supervision in class-incremental lea | arXiv: 2603.02280
- terraseg self-supervised ground segmentation for any lidar | arXiv: 2603.27344
- Test-Time Attention Purification for Backdoored Large Vision Language Models | arXiv: 2603.12989
- test-time ego-exo-centric adaptation for action anticipation via multi-label pro | arXiv: 2603.09798
- test-time instance-specific parameter composition a new paradigm for adaptive ge | arXiv: 2603.27665
- text-guided fine-grained video anomaly understanding | arXiv: 2511.00524
- text-image conditioned 3d generation | arXiv: 2603.21295
- Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction | arXiv: 2512.04309
- Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval | arXiv: 2603.12711
- textf2texthdr two-stage hdr video reconstruction via flow adapter and physical m | arXiv: 2603.14920
- textit4dsurf high-fidelity dynamic scene surface reconstruction | arXiv: 2603.28064
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering | arXiv: 2602.20903
- The Coherence Trap: MLLM-Crafted Narratives Exploit Manipulated Visual Contexts | arXiv: 2505.17476
- The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts | arXiv: 2505.17476
- the cote score a decomposable framework for evaluating document layout analysis | arXiv: 2603.12718
- The Devil is in the Details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection | arXiv: 2512.20340
- the golden subspace where efficiency meets generalization in continual test-time | arXiv: 2603.21928
- The Invisible Gorilla Effect in Out-of-distribution Detection | arXiv: 2602.20068
- the llm bottleneck why open-source vision llms struggle with hierarchical visual | arXiv: 2505.24840
- the more the merrier contrastive fusion for higher-order multimodal alignment | arXiv: 2511.21331
- The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers | arXiv: 2602.19096
- the surprising effectiveness of noise pretraining for implicit neural representa | arXiv: 2603.29034
- the universal normal embedding | arXiv: 2603.21786
- think 360 evaluating the width-centric reasoning capability of mllms beyond dept | arXiv: 2603.22689
- Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding | arXiv: 2603.04977
- thinking diffusion penalize and guide visual-grounded reasoning in diffusion mul | arXiv: 2604.05497
- Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World | arXiv: 2603.12746
- TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking | arXiv: 2602.18863
- tiger a unified framework for time images and geo-location retrieval | arXiv: 2603.24749
- timelens rethinking video temporal grounding with multimodal llms | arXiv: 2512.14698
- TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models | arXiv: 2603.17828
- tiny inference-time scaling with latent verifiers | arXiv: 2603.22492
- tm-bsn triangular-masked blind-spot network for real-world self-supervised image | arXiv: 2604.04484
- token reduction via local and global contexts optimization for efficient video l | arXiv: 2603.01400
- token warping helps mllms look from nearby viewpoints | arXiv: 2604.02870
- Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans (HouseMind) | arXiv: 2603.11640
- Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity | arXiv: 2603.10990
- Topo-R1: Detecting Topological Anomalies via Vision-Language Models | arXiv: 2603.13054
- topomaskv3 3d mask head with dense offset and height predictions for road topolo | arXiv: 2603.01558
- topomesh high-fidelity mesh autoencoding via topological unification | arXiv: 2603.24278
- toward generalizable whole brain representations with high-resolution light-shee | arXiv: 2603.29842
- toward real-world infrared image super-resolution a unified autoregressive frame
- towards balanced multi modal learning in 3d human pose estimation
- Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation | arXiv: 2501.05264
- Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation | arXiv: 2501.05264
- Towards Calibrating Prompt Tuning of Vision-Language Models | arXiv: 2602.19024
- towards context-aware image anonymization with multi-agent reasoning | arXiv: 2603.27817
- Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data | arXiv: 2508.01450
- Towards Faithful Multimodal Concept Bottleneck Models | arXiv: 2603.13163
- towards generalizable ai-generated image detection via image-adaptive prompt lea | arXiv: 2508.01603
- Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning | arXiv: 2508.01603
- towards gui agents vision-language diffusion models for gui grounding | arXiv: 2603.26211
- towards high-quality image segmentation improving topology accuracy by penalizin | arXiv: 2603.18671
- towards highly transferable vision-language attack via semantic-augmented dynami | arXiv: 2603.04839
- Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction | arXiv: 2603.04839
- towards intrinsic-aware monocular 3d object detection | arXiv: 2603.27059
- Towards Multimodal Domain Generalization with Few Labels | arXiv: 2602.22917
- towards open environments and instructions general vision-language navigation vi | arXiv: 2601.09111
- towards real-world document parsing via realistic scene synthesis and document-a | arXiv: 2603.23885
- towards robust content watermarking against removal and forgery attacks | arXiv: 2604.06662
- Towards Source-Aware Object Swapping with Initial Noise Perturbation | arXiv: 2602.23697
- Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos | arXiv: 2603.13185
- Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos | arXiv: 2603.13185
- towards training-free scene text editing | arXiv: 2603.24571
- towards universal computational aberration correction in photographic cameras a
- TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast | arXiv: 2506.13387
- trace structure-aware character encoding for robust and generalizable document w | arXiv: 2603.12873
- TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking | arXiv: 2603.12873
- trackmae video representation learning via track mask and predict | arXiv: 2603.27268
- training high-level schedulers with execution-feedback reinforcement learning fo | arXiv: 2511.22235
- Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods | arXiv: 2603.15026
- Training-free Motion Factorization for Compositional Video Generation | arXiv: 2603.09104
- trajtok learning trajectory tokens enables better video understanding | arXiv: 2602.22779
- TrajTok: 学习轨迹Token实现更好的视频理解 | arXiv: 2602.22779
- transformer-based multi-region segmentation and radiomic analysis of hr-pqct ima | arXiv: 2603.09137
- Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis | arXiv: 2602.19585
- tridf evaluating perception detection and hallucination for interpretable deepfa | arXiv: 2512.10652
- TriLite: Efficient WSOL with Universal Visual Features and Tri-Region Disentanglement | arXiv: 2602.23120
- trivia self-supervised fine-tuning of vision-language models for table recogniti | arXiv: 2512.01248
- TT-Occ: Test-Time 3D Occupancy Prediction | arXiv: 2503.08485
- tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction | arXiv: 2602.20160
- tutor-student reinforcement learning a dynamic curriculum for robust deepfake de | arXiv: 2603.24139
- U-F²-CBM: CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models | arXiv: 2503.10981
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation | arXiv: 2602.23739
- U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences | arXiv: 2512.02982
- ucan unified convolutional attention network for expansive receptive fields in l
- uetrack a unified and efficient framework for single object tracking
- ufvideo towards unified fine-grained video cooperative understanding with large | arXiv: 2512.11336
- ultrasound-clip semantic-aware contrastive pre-training for ultrasound image-tex | arXiv: 2604.01749
- unblur-slam dense neural slam for blurry inputs | arXiv: 2603.26810
- Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos | arXiv: 2603.00881
- uncertainty-aware knowledge distillation for multimodal large language models | arXiv: 2603.21426
- uncertainty-guided compositional alignment with part-to-whole semantic represent | arXiv: 2603.22042
- understanding and mitigating hallucinations in multimodal chain-of-thought model | arXiv: 2603.27201
- understanding task transfer in vision-language models | arXiv: 2511.18787
- understanding temporal logic consistency in video-language models through cross- | arXiv: 2510.08138
- understanding the role of hallucination in reinforcement post-training of multim | arXiv: 2604.03179
- Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation | arXiv: 2511.18281
- uniavgen unified audio and video generation with asymmetric cross-modal interact | arXiv: 2511.03334
- UNICBench: UNIfied Counting Benchmark for MLLM | arXiv: 2603.00595
- UniComp: Rethinking Video Compression Through Informational Uniqueness | arXiv: 2512.03575
- unidex a robot foundation suite for universal dexterous hand control from egocen | arXiv: 2603.22264
- unified primitive proxies for structured shape completion | arXiv: 2601.00759
- unified spatiotemporal token compression for video-llms at ultra-low retention | arXiv: 2603.21957
- unified spherical frontend learning rotation-equivariant representations of sphe | arXiv: 2511.18174
- unified vector floorplan generation via markup representation | arXiv: 2604.04859
- UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation | arXiv: 2603.14214
- unigame turning a unified multimodal model into its own adversary | arXiv: 2511.19413
- unils end-to-end audio-driven avatars for unified listening and speaking | arXiv: 2512.09327
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark | arXiv: 2603.05075
- unimmad unified multi-modal and multi-class anomaly detection via moe-driven fea | arXiv: 2509.25934
- UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression | arXiv: 2509.25934
- unirain unified image deraining with rag-based dataset distillation and multi-ob
- unispector towards universal open-set defect recognition via spectral-contrastiv | arXiv: 2604.02905
- unistainnet foundation-model-guided virtual staining of he to ihc | arXiv: 2603.12716
- UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation | arXiv: 2603.01418
- Universal 3D Shape Matching via Coarse-to-Fine Language Guidance | arXiv: 2602.19112
- unleashing video language models for fine-grained hrct report generation | arXiv: 2603.12469
- unleashing video language models for fine-grained hrct report generation | arXiv: 2603.12469
- unleashing vision-language semantics for deepfake video detection | arXiv: 2603.24454
- Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation | arXiv: 2603.05729
- unlocking multi-site clinical data a federated approach to privacy-first child a | arXiv: 2604.02616
- unlocking positive transfer in incrementally learning surgical instruments a sel | arXiv: 2604.02877
- unlocking strong supervision a data-centric study of general-purpose audio pre-t | arXiv: 2603.25767
- UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data | arXiv: 2601.00991
- unsafe2safe controllable image anonymization for downstream utility | arXiv: 2603.28605
- unsupervised domain adaptation with target-only margin disparity discrepancy | arXiv: 2603.09932
- using gaussian splats to create high-fidelity facial geometry and texture | arXiv: 2512.16397
- utptrack towards simple and unified token pruning for visual tracking | arXiv: 2602.23734
- UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes | arXiv: 2512.04421
- V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs | arXiv: 2511.20223
- v-bridge bridging video generative priors to versatile few-shot image restoratio | arXiv: 2603.13089
- V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration | arXiv: 2603.13089
- V2Drop: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models | arXiv: 2509.01552
- vanast virtual try-on with human image animation via synthetic triplet supervisi | arXiv: 2604.04934
- Variation-Aware Vision Token Dropping for Faster Large Vision-Language Models | arXiv: 2509.01552
- variational garrote for sparse inverse problems
- VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM | arXiv: 2603.09673
- vecattention vector-wise sparse attention for accelerating long context inferenc | arXiv: 2603.29494
- VecGlypher: Unified Vector Glyph Generation with Language Models | arXiv: 2602.21461
- VeCoR — Velocity Contrastive Regularization for Flow Matching | arXiv: 2511.18942
- Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping | arXiv: 2602.23980
- verify claimed text-to-image models via boundary-aware prompt optimization | arXiv: 2603.26328
- versecrafter dynamic realistic video world model with 4d geometric control | arXiv: 2601.05138
- VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale | arXiv: 2602.23361
- VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving | arXiv: 2602.20794
- VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection | arXiv: 2603.00912
- vggt-slam | arXiv: 2604.06830
- video-only tom enhancing theory of mind in multimodal large language models | arXiv: 2603.24484
- videoarm agentic reasoning over hierarchical memory for long-form video understa | arXiv: 2512.12360
- videoauto-r1 video auto reasoning via thinking once answering twice | arXiv: 2601.05175
- videochat-m1 collaborative policy planning for video understanding via multi-age | arXiv: 2511.19524
- VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning | arXiv: 2511.19524
- videocof unified video editing with temporal reasoner | arXiv: 2512.07469
- VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion | arXiv: 2503.23359
- VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion | arXiv: 2503.23359
- videomt your vit is secretly also a video segmentation model | arXiv: 2602.17807
- VidEoMT: Your ViT is Secretly Also a Video Segmentation Model | arXiv: 2602.17807
- videoseek long-horizon video agent with tool-guided seeking | arXiv: 2603.20185
- vihoi human-object interaction synthesis with visual priors | arXiv: 2603.24383
- Vinedresser3D: Agentic Text-guided 3D Editing | arXiv: 2602.19542
- ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking | arXiv: 2512.14654
- VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation | arXiv: 2603.12918
- viro robust and efficient neuro-symbolic reasoning with verification for referri | arXiv: 2601.12781
- VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection | arXiv: 2603.17470
- virst video-instructed reasoning assistant for spatiotemporal segmentation | arXiv: 2603.27060
- Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code | arXiv: 2501.18328
- VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding | arXiv: 2603.07071
- vision on request enhanced vllm efficiency with sparse dynamically selected visi | arXiv: 2603.23495
- Vision Transformers Need More Than Registers | arXiv: 2602.22394
- Vision Transformers Need More Than Registers | arXiv: 2602.22394
- vision-language attribute disentanglement and reinforcement for lifelong person | arXiv: 2603.19678
- Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning | arXiv: 2603.08921
- VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models | arXiv: 2603.00207
- VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models | arXiv: 2603.00207
- vistorybench comprehensive benchmark suite for story visualization | arXiv: 2505.24862
- visualad language-free zero-shot anomaly detection via vision transformer | arXiv: 2603.07952
- ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos | arXiv: 2603.04265
- VL-RouterBench: A Benchmark for Vision-Language Model Routing | arXiv: 2512.23562
- VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery | arXiv: 2602.19180
- VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models | arXiv: 2603.09826
- VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm | arXiv: 2512.02700
- vrr-qa visual relational reasoning in videos beyond explicit cues | arXiv: 2506.21742
- vt-intrinsic physics-based decomposition of reflectance and shading using a sing | arXiv: 2509.10388
- WaDi: Weight Direction-aware Distillation for One-step Image Synthesis | arXiv: 2603.08258
- walkgpt grounded vision-language conversation with depth-aware segmentation for | arXiv: 2603.10703
- wan-weaver interleaved multi-modal generation via decoupled training | arXiv: 2603.25706
- wanderland geometrically grounded simulation for open-world embodied ai | arXiv: 2511.20620
- Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI | arXiv: 2511.20620
- watch and learn learning to use computers from online videos | arXiv: 2510.04673
- Watch and Learn: Learning to Use Computers from Online Videos | arXiv: 2510.04673
- wavelet-based frame selection by detecting semantic boundary for long video unde | arXiv: 2603.00512
- weakly supervised teacher-student framework with progressive pseudo-mask refinem | arXiv: 2603.08605
- Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation | arXiv: 2603.08605
- Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning | arXiv: 2603.00550
- WeaveTime: 流式视频LLM的帧级逐步记忆 | arXiv: 2602.22142
- WeaveTime: 流式视频LLM的帧级逐步记忆 | arXiv: 2602.22142
- What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models | arXiv: 2603.00510
- what is the optimal ranking score between precision and recall we can always fin | arXiv: 2511.22442
- what is wrong with synthetic data for scene text recognition a strong synthetic | arXiv: 2602.06450
- What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching? | arXiv: 2504.16930
- when identities collapse a stress-test benchmark for multi-subject personalizati | arXiv: 2603.26078
- when numbers speak aligning textual numerals and visual instances in text-to-vid | arXiv: 2604.08546
- When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models | arXiv: 2511.21192
- When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance | arXiv: 2602.20880
- When to Lock Attention: Training-Free KV Control in Video Diffusion | arXiv: 2603.09657
- when to think and when to look uncertainty-guided lookback | arXiv: 2511.15613
- When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs | arXiv: 2512.07580
- when understanding becomes a risk authenticity and safety risks in the emerging | arXiv: 2603.24079
- Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation | arXiv: 2509.22496
- Where, What, Why: Toward Explainable 3D-GS Watermarking | arXiv: 2603.08809
- which concepts to forget and how to refuse decomposing concepts for continual un | arXiv: 2603.21484
- Why Does It Look There? Structured Explanations for Image Classification | arXiv: 2603.10234
- widget2code from visual widgets to ui code via multimodal llms | arXiv: 2512.19918
- wildcap facial albedo capture in the wild via hybrid inverse rendering | arXiv: 2512.11237
- WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval | arXiv: 2602.23029
- WMGStereo: What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching? | arXiv: 2504.16930
- World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training | arXiv: 2509.24948
- worldmm dynamic multimodal memory agent for long video reasoning | arXiv: 2512.02425
- x-win building chest radiograph world model via predictive sensing | arXiv: 2511.14918
- x2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space | arXiv: 2603.16671
- xseg a large-scale x-ray contraband segmentation benchmark for real-world securi | arXiv: 2604.03706
- Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion | arXiv: 2511.18734
- Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation | arXiv: 2505.19459
- Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image | arXiv: 2603.14772
- zina multimodal fine-grained hallucination detection and editing | arXiv: 2506.13130
- ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training | arXiv: 2603.13115
- ada3drift adaptive training-time drifting for one-step 3d visuomotor robotic man | arXiv: 2603.11984
- ada3drift adaptive trainingtime drifting for onest | arXiv: 2603.11984
- affostruction 3d affordance grounding with generative reconstruction | arXiv: 2601.09211
- apc adversarial point counterattack | arXiv: 2604.15708
- cari4d category agnostic 4d reconstruction of human object interaction | arXiv: 2512.11988
- cube bspline 3d faces | arXiv: 2604.12894
- deepshapematchingkit accelerated functional map solver | arXiv: 2604.10377
- fall risk gait analysis hmr | arXiv: 2604.11961
- ff3r feedforward feature 3d reconstruction from unconstrained views | arXiv: 2604.09862
- freescale scaling 3d scenes | arXiv: 2604.10512
- iris bringing realworld priors into diffusion model for monocular depth estimation | arXiv: 2603.16340
- long scope fully sparse long range cooperative 3d perception | arXiv: 2604.09206
- lumimotion gaussian relighting dynamics | arXiv: 2604.10994
- marco semantic correspondence | arXiv: 2604.18267
- neural gabor splatting | arXiv: 2604.15941
- ng gs nerf guided 3d gaussian splatting segmentation | arXiv: 2604.14706
- nimbusgs unified 3d scene reconstruction under hybrid weather | arXiv: 2603.27228
- pointins instance-aware self-supervised learning for point clouds | arXiv: 2603.25165
- reliev3r relieving feed-forward 3d reconstruction from multi-view geometric annot | arXiv: 2604.00548
- rewis3d reconstruction improves weakly-supervised semantic segmentation | arXiv: 2603.06374
- rewis3d reconstruction improves weaklysupervised s | arXiv: 2603.06374
- rng unified transformer complete 3d modeling partial observations | arXiv: 2603.01194
- sasnet spatially adaptive sinusoidal networks for inrs | arXiv: 2503.09750
- sepatch3d revisiting token compression for accelerating vit based sparse 3d detectors | arXiv: 2604.14563
- sgi structured 2d gaussians large image representation | arXiv: 2603.07789
- sgs-intrinsic semantic-invariant gaussian splatting for sparse-view indoor invers | arXiv: 2603.27516
- sts mixer 4d point cloud | arXiv: 2604.11637
- tco learning 3d reconstruction with priors in test time | arXiv: 2604.03878
- towards spatio-temporal world scene graph generation from monocular videos | arXiv: 2603.13185
- unisplat 3d representations unposed | arXiv: 2604.10573
- clustermark robust watermarking autoregressive image generators | arXiv: 2508.06656
- clustermark towards robust watermarking for autoregressive image generators with | arXiv: 2508.06656
- logitdynamics vit error detection | arXiv: 2604.10643
- mcsd uncertainty estimation | arXiv: 2604.12719
- one-to-more high-fidelity training-free anomaly generation with attention control | arXiv: 2603.18093
- team leya in 10th abaw competition multimodal ambi | arXiv: 2603.12848
- unim a unified any-to-any interleaved multimodal benchmark | arXiv: 2603.05075
- vidscribe multimodal ai customizing audio description videos | arXiv: 2603.14662
- vidscribe multimodal ai for customizing audio description and question answering | arXiv: 2603.14662
- a prediction-as-perception framework for 3d object detection | arXiv: 2603.12599
- a predictionasperception framework for 3d object d | arXiv: 2603.12599
- c2t llm traffic coordination | arXiv: 2604.13098
- climaood improving anomaly segmentation via physically realistic synthetic data | arXiv: 2512.02686
- den tp a density balanced data curation and evaluation framework for trajectory | arXiv: 2409.17385
- fedbprompt federated domain generalization person | arXiv: 2603.12912
- fedbprompt federated domain generalization person re-identification via body dis | arXiv: 2603.12912
- igasa integrated geometry-aware and skip-attention modules for enhanced point cl | arXiv: 2603.12719
- igasa integrated geometryaware and skipattention m | arXiv: 2603.12719
- leader lidar relocalization | arXiv: 2604.11355
- mapgclr geospatial contrastive learning of represe | arXiv: 2603.10688
- mapgclr geospatial contrastive learning of representations for online vectorized | arXiv: 2603.10688
- neural distribution prior for lidar ood detection | arXiv: 2604.09232
- open-vocabulary domain generalization in urban-scene segmentation | arXiv: 2602.18853
- sparseworld tc trajectory conditioned sparse occupancy world model | arXiv: 2511.22039
- traffic scene generation from natural language description for autonomous vehicl | arXiv: 2409.09575
- ttsg text to traffic scene generation from natural language | arXiv: 2409.09575
- vla world learning vision language action world models for autonomous driving | arXiv: 2604.09059
- cipher counterfactual diffusion hallucination sup | arXiv: 2603.10470
- codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
- codepercept codegrounded visual stem perception fo | arXiv: 2603.10757
- geotikzbridge advancing multimodal code generation for geometric perception and | arXiv: 2603.22687
- mm-recoder advancing chart-to-code generation with reinforcement learning and se | arXiv: 2604.01600
- evolutionary multimodal reasoning via hierarchical semantic representation for i | arXiv: 2603.03827
- m3kg rag multi hop multimodal knowledge graph enhanced retrieval augmented genera | arXiv: 2512.20136
- a two stage dual modality model for facial expression recognition | arXiv: 2603.12221
- decovln decoupling observation reasoning and correction for vision-and-language | arXiv: 2603.13133
- efficient onboard spacecraft pose estimation with event cameras and neuromorphic hardware | arXiv: 2604.04117
- from 2d alignment to 3d plausibility unifying hete | arXiv: 2503.17788
- fsmc-pose frequency and spatial fusion with multiscale selfcalibration for cattle | arXiv: 2603.16596
- fsmc pose cattle mounting pose estimation | arXiv: 2603.16596
- fsmc pose frequency spatial cattle mounting pose | arXiv: 2603.16596
- handdreamer zero shot text to 3d hand model generation | arXiv: 2604.04425
- hum4d markerless motion capture | arXiv: 2604.12765
- l2gtx from local to global time series explanation | arXiv: 2603.13065
- l2gtx from local to global time series explanations | arXiv: 2603.13065
- lca large-scale codec avatars the unreasonable effectiveness of large-scale avata | arXiv: 2604.02320
- mmgait multi modal gait recognition | arXiv: 2604.15979
- molingo motion-language alignment for text-to-motion generation | arXiv: 2512.13840
- quantvla scale-calibrated post-training quantization for vision-language-action | arXiv: 2602.20309
- ram recover any 3d human motion in-the-wild | arXiv: 2603.19929
- reference-free image quality assessment for virtual try-on via human feedback | arXiv: 2603.13057
- referencefree image quality assessment for virtual | arXiv: 2603.13057
- regformer transferable relational grounding for weakly-supervised hoi detection | arXiv: 2604.00507
- rppg vqa video quality assessment | arXiv: 2604.11156
- team ras in 10th abaw competition multimodal valen | arXiv: 2603.13056
- textit4dsurf high-fidelity dynamic scene surface reconstruction | arXiv: 2603.28064
- vibes a conversational agent with behaviorally intelligent 3d virtual body | arXiv: 2512.14234
- ahs adaptive head synthesis | arXiv: 2604.15857
- circuit mechanisms for spatial relation generation in diffusion models | arXiv: 2601.06338
- cognitioncapturerpro towards high-fidelity visual decoding from eegmeg via multi | arXiv: 2603.12722
- cognitioncapturerpro towards highfidelity visual d | arXiv: 2603.12722
- craft aligning diffusion models with finetuning is easier than you think | arXiv: 2603.18991
- dcw snr t bias diffusion | arXiv: 2604.16044
- deco frequency-decoupled pixel diffusion for end-to-end image generation | arXiv: 2511.19365
- depthvar depth adaptive var | arXiv: 2604.17286
- dit-ic aligned diffusion transformer for efficient image compression | arXiv: 2603.13162
- ditic aligned diffusion transformer for efficient | arXiv: 2603.13162
- editing away the evidence diffusion-based image manipulation and the failure mod | arXiv: 2603.12949
- editing away the evidence diffusionbased image man | arXiv: 2603.12949
- emf meanflow text to image | arXiv: 2604.18168
- evatok adaptive length video tokenization for eff | arXiv: 2603.12267
- fdeidtoolbox face deidentification toolbox | arXiv: 2603.13121
- fractals made practical denoising diffusion as par | arXiv: 2603.13069
- fractals made practical denoising diffusion as partitioned iterated function sys | arXiv: 2603.13069
- freqflow frequency aware flow matching | arXiv: 2604.15521
- gist towards design compositing | arXiv: 2604.14605
- groce graph-guided online concept erasure for text-to-image diffusion models | arXiv: 2511.12968
- haltnav reactive visual halting over lightweight t | arXiv: 2603.12696
- haltnav reactive visual halting over lightweight topological priors for robust v | arXiv: 2603.12696
- intra finger variability of diffusion based latent fingerprint generation | arXiv: 2604.10040
- leapalign post training flow matching models at any generation step | arXiv: 2604.15311
- multibanana a challenging benchmark for multi reference text to image generation | arXiv: 2511.22989
- oars processaware online alignment for generative | arXiv: 2603.12811
- pixeldit pixel diffusion transformers for image generation | arXiv: 2511.20645
- smoothing score function generalization diffusion models | arXiv: 2601.19285
- smoothing the score function for generalization in diffusion models | arXiv: 2601.19285
- tokenlight precise lighting control in images using attribute tokens | arXiv: 2604.15310
- vosr a vision only generative model for image super resolution | arXiv: 2604.03225
- yoeo you only erase once erasing anything without bringing unexpected content | arXiv: 2603.27599
- drfusion degradation robust fusion via degradation aware diffusion framework | arXiv: 2604.08922
- evlf early vision-language fusion for generative dataset distillation | arXiv: 2603.07476
- finpercep rm a fine grained reward model and co evolutionary curriculum for rl ba | arXiv: 2512.22647
- finpercep rm fine grained reward model rl super resolution | arXiv: 2512.22647
- gsnr graph smooth null space representation for inverse problems | arXiv: 2602.20328
- ia clahe image adaptive clip limit | arXiv: 2604.16010
- ntire 2026 ai flash portrait challenge | arXiv: 2604.11230
- ntire 2026 raindrop removal challenge | arXiv: 2604.10634
- rar restore assess repeat a unified framework for iterative image restoration | arXiv: 2603.26385
- real iisr infrared image super resolution autoregressive | arXiv: 2603.04745
- sat selective aggregation transformer for image super resolution | arXiv: 2604.07994
- selfhvd self-supervised handheld video deblurring | arXiv: 2508.08605
- shadow removal cascaded refinement | arXiv: 2604.16177
- ucan unified convolutional attention lightweight sr | arXiv: 2603.11680
- udapose unsupervised domain adaptation for low light human pose estimation | arXiv: 2604.10485
- uniblendnet unified global multi scale and region adaptive modeling for ambient lighting normalization | arXiv: 2604.13383
- unicac universal computational aberration correction | arXiv: 2603.12083
- unicac universal computational aberration correction benchmark | arXiv: 2603.12083
- unirain unified image deraining rag dataset distillation | arXiv: 2603.03967
- unirain unified image deraining with rag based dataset distillation and multi obje | arXiv: 2603.03967
- beyond global similarity towards fine-grained multi-condition multimodal retriev | arXiv: 2603.01082
- cc-vqa conflict- and correlation-aware method for mitigating knowledge conflict | arXiv: 2602.23952
- explaining clip zero-shot predictions through concepts | arXiv: 2603.28211
- m4-rag a massive-scale multilingual multi-cultural multimodal rag | arXiv: 2512.05959
- mind the way you select negative texts pursuing the distance consistency in ood | arXiv: 2603.02618
- muco multi-turn contrastive learning for multimodal embedding model | arXiv: 2602.06393
- nanovdr distilling a 2b vision-language retriever into a 70m text-only encoder f | arXiv: 2603.12824
- nanovdr distilling a 2b visionlanguage retriever i | arXiv: 2603.12824
- robustvisrag causality-aware vision-based retrieval-augmented generation under v | arXiv: 2602.22013
- beyond semantics disentangling information scope in sparse autoencoders for clip | arXiv: 2604.05724
- beyond the fold quantifying split-level noise and the case for leave-one-dataset | arXiv: 2604.02162
- ciice intrinsic concept extraction compositional | arXiv: 2603.11795
- cut to the chase training-free multimodal summarization via chain-of-events | arXiv: 2603.06213
- dino-qpm adapting visual foundation models for globally interpretable image clas | arXiv: 2604.07166
- draft and refine with visual experts | arXiv: 2511.11005
- edit-as-act goal-regressive planning for open-vocabulary 3d indoor scene editing | arXiv: 2603.17583
- emoverse a mllms-driven emotion representation dataset for interpretable visual | arXiv: 2511.12554
- emoverse mllm emotion representation dataset | arXiv: 2511.12554
- ermoe eigen-reparameterized mixture-of-experts for stable routing | arXiv: 2511.10971
- feature attribution stability suite how stable are post-hoc attributions | arXiv: 2604.02532
- finer mllms hallucinate under fine-grained negative queries | arXiv: 2603.17662
- from weights to concepts data-free interpretability of clip via singular vector | arXiv: 2603.24653
- geometry-guided camera motion understanding in videollms | arXiv: 2603.13119
- geometryguided camera motion understanding in vide | arXiv: 2603.13119
- how to take a memorable picture empowering users with actionable feedback | arXiv: 2602.21877
- inside-out measuring generalization in vision transformers through inner working | arXiv: 2604.08192
- language models can explain visual features via steering | arXiv: 2603.22593
- measuring the unfaithfulness of concept-based explanations | arXiv: 2504.10833
- missing no more dictionary-guided cross-modal image fusion under missing infrare | arXiv: 2603.08018
- neurodynamics-driven coupled neural p systems for multi-focus image fusion | arXiv: 2509.17704
- on the possible detectability of image-in-image steganography | arXiv: 2603.11876
- on the possible detectability of imageinimage steg | arXiv: 2603.11876
- pixel2phys distilling governing laws from visual dynamics | arXiv: 2602.19516
- reallocating attention across layers to reduce multimodal hallucination | arXiv: 2510.10285
- reallocating attention reduce hallucination | arXiv: 2510.10285
- recursive think-answer process for llms and vlms | arXiv: 2603.02099
- safedrive fine-grained safety reasoning for end-to-end driving in a sparse world | arXiv: 2602.18887
- subspacead training-free few-shot anomaly detection via subspace modeling | arXiv: 2602.23013
- tdatr improving end-to-end table recognition via table detail-aware learning and | arXiv: 2603.22819
- text-guided fine-grained video anomaly understanding | arXiv: 2511.00524
- towards faithful multimodal concept bottleneck models | arXiv: 2603.13163
- viro robust and efficient neuro-symbolic reasoning with verification for referri | arXiv: 2601.12781
- where mllms attend and what they rely on explaining autoregressive token generat | arXiv: 2509.22496
- why does it look there structured explanations for image classification | arXiv: 2603.10234
- attribution-guided model rectification of unreliable neural network behaviors | arXiv: 2603.15656
- argos agentic multi camera person search | arXiv: 2604.12762
- echotrail-gui building actionable memory for gui agents | arXiv: 2512.19396
- epiagent agent centric system for ancient inscription restoration | arXiv: 2604.09367
- gen n val agentic image data generation and validation | arXiv: 2506.04676
- haven hierarchical long video understanding audiovisual entity | arXiv: 2601.13719
- haven hierarchical long video understanding with audiovisual entity cohesion | arXiv: 2601.13719
- nerfify multiagent nerf paper to code | arXiv: 2603.00805
- bias reward models t2i | arXiv: 2604.13305
- adabet gradient-free layer selection for efficient training of deep neural netwo | arXiv: 2510.03101
- cross-scale pansharpening via scaleformer and the panscale benchmark | arXiv: 2603.00543
- cryohype reconstructing a thousand cryo-em structures with transformer-based hyp | arXiv: 2512.06332
- enhancing out-of-distribution detection with extended logit normalization | arXiv: 2504.11434
- flow3r factored flow prediction for scalable visual geometry learning | arXiv: 2602.20157
- free-grained hierarchical visual recognition | arXiv: 2510.14737
- hess head sensitivity score for sparsity redistribution in vggt | arXiv: 2603.25336
- hier-cos making deep features hierarchy-aware via composition of orthogonal subs | arXiv: 2503.07853
- hiercos making deep features hierarchyaware via co | arXiv: 2503.07853
- hycal training free prototype calibration for cross discipline fscil | arXiv: 2604.15678
- out of sight out of mind evaluating state evolutio | arXiv: 2603.13215
- out of sight out of mind evaluating state evolution in video world models | arXiv: 2603.13215
- pioneering perceptual video fluency assessment a novel task with benchmark datas | arXiv: 2603.26055
- r2g multi view circuit graph benchmark suite from rtl to gdsii | arXiv: 2604.08810
- reflexsplit single image reflection separation via layer fusion-separation | arXiv: 2601.17468
- reframing long-tailed learning via loss landscape geometry | arXiv: 2603.21217
- sattc structure-aware label-free test-time calibration for cross-subject eeg-to- | arXiv: 2603.20738
- semi-supervised conformal prediction with unlabeled nonconformity score | arXiv: 2505.21147
- sparsecam4d spatio-temporally consistent 4d reconstruction from sparse cameras | arXiv: 2603.26481
- tacsim a dataset and benchmark for football tactical style imitation | arXiv: 2603.25199
- temporal imbalance of positive and negative supervision in class-incremental lea | arXiv: 2603.02280
- vga bench unified benchmark for video aesthetics and generation quality | arXiv: 2604.10127
- weakly supervised video anomaly detection with anomaly-connected components and | arXiv: 2603.00550
- bi cmpstereo bidirectional cross modal prompting for event frame asymmetric stereo | arXiv: 2604.15312
- cops conditional prompt synthesis for zero-shot anomaly detection | arXiv: 2508.03447
- perception programs visual tool reasoning | arXiv: 2604.12896
- sign language recognition llms | arXiv: 2604.11225
- defending unauthorized model merging via dual-stage weight protection | arXiv: 2511.11851
- evidential transformation network post hoc uncertainty estimation | arXiv: 2604.08627
- flowmotion training-free flow guidance for video motion transfer | arXiv: 2603.06289
- linking modality isolation in heterogeneous collaborative perception | arXiv: 2603.00609
- lottiegpt vector animation generation | arXiv: 2604.11792
- mxnorm reusing mxfp block scales for efficient ten | arXiv: 2603.13180
- mxnorm reusing mxfp block scales for efficient tensor normalisation | arXiv: 2603.13180
- watch and learn computer use from videos | arXiv: 2510.04673
- watch and learn learning to use computers from online videos | arXiv: 2510.04673
- graze grounded refinement and motion-aware zero-shot generation | arXiv: 2604.01383
- latent chain-of-thought world modeling for end-to-end autonomous driving | arXiv: 2512.10226
- association and consolidation evolutionary memory-enhanced incremental multi-vie | arXiv: 2509.14544
- blind spot of adaptation quantifying and mitigating forgetting in fine tuned driving models | arXiv: 2604.04857
- damp class unlearning via depth aware removal of forget specific directions | arXiv: 2604.15166
- designing to forget deep semi-parametric models for unlearning | arXiv: 2603.22870
- elastic weight consolidation done right for continual learning | arXiv: 2603.18596
- learning from oblivion predicting knowledge overflowed weights via retrodiction | arXiv: 2508.05059
- oslash source models leak what they shouldnt nrightarrow unlearning zero-shot tr | arXiv: 2604.08238
- select hypothesize and verify towards verified neuron concept interpretation | arXiv: 2603.24953
- sineproject machine unlearning for stable vision language alignment | arXiv: 2511.18444
- addressing data scarcity in 3d trauma detection th | arXiv: 2603.12514
- addressing data scarcity in 3d trauma detection through self-supervised and semi | arXiv: 2603.12514
- apex adaptive visual prompting | arXiv: 2604.17455
- cloe expert consistency learning for missing modal | arXiv: 2603.09316
- cloe expert consistency learning for missing modality segmentation | arXiv: 2603.09316
- decoupling vision and language codebook anchored visual adaptation | arXiv: 2602.19449
- deep learningbased assessment of the relation betw | arXiv: 2603.11850
- developing foundation models for universal segment | arXiv: 2603.11627
- developing foundation models for universal segmentation from 3d whole-body posit | arXiv: 2603.11627
- emad evidence-centric grounded multimodal diagnosis for alzheimers disease | arXiv: 2602.19178
- equivania a spectral method for rotation-equivariant anisotropic image analysis | arXiv: 2603.11294
- equivania a spectral method for rotationequivarian | arXiv: 2603.11294
- event level detection of surgical instrument handovers in videos | arXiv: 2604.07577
- forecasting epileptic seizures from contactless ca | arXiv: 2603.12887
- forecasting epileptic seizures from contactless camera via cross-species transfe | arXiv: 2603.12887
- forge continual learning for fmri based brain disorder diagnosis | arXiv: 2604.14259
- gleam a multimodal imaging dataset and hamm for gl | arXiv: 2603.12800
- human knowledge integrated multi-modal learning for single source domain general | arXiv: 2603.12369
- human knowledge integrated multimodal learning for | arXiv: 2603.12369
- invad inversion-based reconstruction-free anomaly detection with diffusion model | arXiv: 2504.05662
- invad inversionbased reconstructionfree anomaly de | arXiv: 2504.05662
- lemon a large endoscopic monocular dataset and foundation model for perception in | arXiv: 2503.19740
- lemon large endoscopic monocular dataset foundation model surgical | arXiv: 2503.19740
- relativeflow taming medical image denoising learning with noisy reference | arXiv: 2604.15459
- residual sodap residual self-organizing domain-adaptive prompting with structura | arXiv: 2603.12816
- residual sodap residual selforganizing domainadapt | arXiv: 2603.12816
- robust fair disease diagnosis in ct images | arXiv: 2604.09710
- sd fsmis adapting stable diffusion for few shot medical image segmentation | arXiv: 2604.03134
- semitooth a generalizable semi-supervised framework for multi-source tooth segme | arXiv: 2603.11616
- semitooth a generalizable semisupervised framework | arXiv: 2603.11616
- transformer-based multi-region segmentation and radiomic analysis of hr-pqct ima | arXiv: 2603.09137
- uncertainty-aware concept and motion segmentation for semi-supervised angiograph | arXiv: 2603.00881
- 4d rgpt toward region level 4d understanding via perceptual distillation | arXiv: 2512.17012
- adversarial concept distillation for one-step diffusion personalization | arXiv: 2510.20512
- batch loss score for dynamic data pruning | arXiv: 2604.04681
- enhancing mixture of experts specialization via cluster aware upcycling | arXiv: 2604.13508
- flashvggt efficient and scalable visual geometry transformers with compressed descr | arXiv: 2512.01540
- frequency switching mechanism for parameter-ecient multi-task learning | arXiv: 2603.21111
- iapl aigenerated image detection adaptive prompt | arXiv: 2508.01603
- llava-le large language-and-vision assistant for lunar exploration | arXiv: 2603.24696
- mame and mare matrix based token merging and restoration for efficient visual perception and synthesis | arXiv: 2604.13432
- memory efficient transfer learning with fading side networks | arXiv: 2604.09088
- mine-jepa in-domain self-supervised learning for mine-like object classification | arXiv: 2604.00383
- opad adversarial concept distillation for one-step diffusion personalization | arXiv: 2510.20512
- rdvq differentiable vq image compression | arXiv: 2604.10546
- understanding and enforcing weight disentanglement in task arithmetic | arXiv: 2604.17078
- wpt world-to-policy transfer via online world model distillation | arXiv: 2511.20095
- mmtit-bench a multilingual and multi-scenario benchmark with cognition-perceptio | arXiv: 2603.23896
- sea-vision a multilingual benchmark for comprehensive document and scene text un | arXiv: 2603.15409
- aif adaptive information flow vlm | arXiv: 2604.15809
- av speakerbench audiovisual human speech understanding mllms | arXiv: 2512.02231
- ava vla improving vision language action models with active visual attention | arXiv: 2511.18960
- biclip domain canonicalization via structured geometric transformation | arXiv: 2603.08942
- coat cbm concept wise attention | arXiv: 2604.15748
- comp collaborative multi-mode pruning for vision-language models | arXiv: 2604.02956
- cropvlm learning to zoom for fine grained vision language perception | arXiv: 2511.19820
- dictionary aligned concept control for safeguarding multimodal llms | arXiv: 2604.08846
- do vision language models need to process image tokens | arXiv: 2604.09425
- docseeker long document understanding | arXiv: 2604.12812
- dsert roll robust multi modal perception for diverse driving conditions | arXiv: 2604.03685
- ebmc multimodal sentiment analysis | arXiv: 2604.12518
- fairllava fairness-aware parameter-efficient fine-tuning for large vision-langua | arXiv: 2603.26008
- flowcomposer composable flows for compositional zeroshot learning | arXiv: 2603.16641
- flowhijack dynamics aware backdoor attack on flow matching vla models | arXiv: 2604.09651
- g mixer geodesic mixup based implicit semantic expansion for zero shot cir | arXiv: 2604.14710
- hog layout hierarchical 3d scene generation optimization and editing | arXiv: 2604.10772
- isoclip decomposing clip projectors for efficient intramodal alignment | arXiv: 2603.19862
- kec hierarchical textual knowledge clustering | arXiv: 2604.11144
- lfpc learning to focus and precise cropping for mllms | arXiv: 2603.27494
- medic-ad towards medical vision-language models clinical intelligence | arXiv: 2603.27176
- mmrad multimodal anomaly detection | arXiv: 2604.10971
- modix positional index scaling | arXiv: 2604.12537
- mupo all roads lead to rome incentivizing divergent thinking in vlms | arXiv: 2604.00479
- nano-emox unifying multimodal emotional intelligence from perception to empathy | arXiv: 2603.02123
- noiseaware fewshot learning through bidirectional | arXiv: 2603.11617
- paddleocr-vl boosting document parsing efficiency and performance with coarse | arXiv: 2603.24326
- paddleocr vl coarse to fine document parsing | arXiv: 2603.24326
- paddleocr vl document parsing coarse to fine visual processing | arXiv: 2603.24326
- personavlm long term personalized multimodal llms | arXiv: 2604.13074
- physisinone visual physics learning and reasoning in one suite | arXiv: 2604.09415
- pop proof of perception conformal reasoning | arXiv: 2603.00324
- rehearsevla simulated post-training for vlas with physically-consistent world mo | arXiv: 2509.24948
- rehearsevla simulated posttraining world model | arXiv: 2509.24948
- relational visual similarity | arXiv: 2512.07833
- responses fall short of understanding gap between internal representations and responses in vdu | arXiv: 2604.04411
- scipostgen bridging the gap between scientific papers and poster layouts | arXiv: 2511.22490
- seatrack multimodal tracker | arXiv: 2604.12502
- see hear and understand benchmarking audiovisual human speech understanding in mul | arXiv: 2512.02231
- seeing through touch tactile localization | arXiv: 2604.11579
- spatialscore towards comprehensive evaluation for spatial intelligence | arXiv: 2505.17012
- think 360 evaluating the width-centric reasoning capability of mllms beyond dept | arXiv: 2603.22689
- tipsv2 patch text alignment | arXiv: 2604.12012
- treeteaming autonomous red-teaming of vision-language models via hierarchical s | arXiv: 2603.22882
- treeteaming autonomous red teaming vlm strategy exploration | arXiv: 2603.22882
- treeteaming autonomous red teaming vlm strategy tree | arXiv: 2603.22882
- unbiased dynamic multimodal fusion | arXiv: 2603.19681
- vecglypher unified vector glyph generation with language models | arXiv: 2602.21461
- vikey enhancing temporal understanding in videos via visual prompting | arXiv: 2603.23186
- vs bench evaluating vlms for strategic abilities in multi agent environments | arXiv: 2506.02387
- weavetime streaming video llm memory | arXiv: 2602.22142
- beyond global scores fine grained token grounding as robust detector of lvlm hallucinations | arXiv: 2604.04863
- detecting unknown objects via energy-based separation | arXiv: 2603.29954
- dreamvideo-omni omni-motion controlled multi-subject video customization with la | arXiv: 2603.12257
- dreamvideoomni omnimotion controlled multisubject | arXiv: 2603.12257
- geobridge semantic-anchored multi-view foundation model for geo-localization | arXiv: 2512.02697
- herod heuristic inspired reasoning data efficient rod | arXiv: 2603.24166
- mitigating memorization in text-to-image diffusion via region-aware prompt augme | arXiv: 2603.13070
- mitigating memorization in texttoimage diffusion v | arXiv: 2603.13070
- paq-detr learning pattern and quality-aware dynamic queries for object detection | arXiv: 2603.06917
- radar closedloop robotic data generation via seman | arXiv: 2603.11811
- rehark refined hybrid adaptive rbf kernels for rob | arXiv: 2603.11542
- slice semantic latent injection via compartmentali | arXiv: 2603.12749
- uavgen visual prototype conditioned focal region generation for uav based object detection | arXiv: 2604.02966
- enhancing visual representation with textual semantics textual semantics powered p | arXiv: 2503.13543
- fedtsp textual semantics powered prototypes heterogeneous fl | arXiv: 2503.13543
- otprune distribution-aligned visual token pruning via optimal transport | arXiv: 2602.20205
- crowdsourcing of real world image annotation via visual properties | arXiv: 2604.14449
- do vision models perceive illusory motion in static images like humans | arXiv: 2604.09853
- feat federated geometry aware correction for exemplar replay under continual dynamic heterogeneity | arXiv: 2604.08617
- lovif 2026 semantic quality assessment challenge | arXiv: 2604.11207
- myovision a mobile research tool and neatboost attention ensemble framework | arXiv: 2604.13456
- omnifood8k nutrition estimation | arXiv: 2604.12356
- sldprtnet a largescale multimodal dataset for cad | arXiv: CAD generation
- v nutri nutrition estimation cooking videos | arXiv: 2604.11913
- vit3 unlocking test time training in vision | arXiv: 2512.01643
- qkd quantum gated incremental learning | arXiv: 2604.11112
- linking perception confidence and accuracy in mllms | arXiv: 2603.12149
- msrl scaling generative multimodal reward modeling | arXiv: 2603.25108
- conflated inverse urban vegetation | arXiv: 2604.13028
- geoflow real-time fine-grained cross-view geolocalization | arXiv: 2603.21943
- geommbench and geommagent toward expert level multimodal intelligence in geoscience and remote sensing | arXiv: 2604.08896
- pretrained image matchers for sar optical satellite registration | arXiv: 2604.10217
- cyclemanip enabling cyclic task manipulation via effective historical percepti | arXiv: 2512.01022
- deepsketcher internalizing visual manipulation for multimodal reasoning | arXiv: 2509.25866
- diagnose correct and learn from manipulation failures | arXiv: 2512.02787
- enc-bench a benchmark for evaluating multimodal large language models in electro | arXiv: 2603.22763
- finecog nav fine grained cognitive modules for zero shot uav navigation | arXiv: 2604.16298
- igen scalable data generation for robot learning from open-world images | arXiv: 2512.01773
- sapave active perception manipulation vla roboti | arXiv: 2603.12193
- strnet visual navigation with spatio-temporal representation through dynamic gra | arXiv: 2604.02829
- boundary segment action segmentation | arXiv: 2604.01859
- empowering semantic-sensitive underwater image enhancement with vlm | arXiv: 2603.12773
- empowering semanticsensitive underwater image enha | arXiv: 2603.12773
- geomprompt rgbd segmentation | arXiv: 2604.11585
- low data supervised adaptation outperforms prompting for cloud segmentation | arXiv: 2604.08956
- occsam bench occlusion robustness segmentation | arXiv: 2604.11711
- pca-seg revisiting cost aggregation for openvocabulary semantic and part segmentat | arXiv: 2603.17520
- pca seg cost aggregation open vocabulary segmentation | arXiv: 2603.17520
- pca seg parallel cost aggregation open vocabulary segmentation | arXiv: 2603.17520
- pixdlm uav reasoning segmentation | arXiv: 2604.15670
- sddf specificity-driven dynamic focusing for open-vocabulary camouflaged object | arXiv: 2603.26109
- wsrvos weakly supervised rvos | arXiv: 2604.17797
- a stitch in time learning procedural workflow via self supervised plackett luce r | arXiv: 2511.17805
- an optimal transport driven approach for cultivating latent space in online incr | arXiv: 2211.16780
- com pt chain of models pretraining | arXiv: 2604.12391
- group dinomics incorporating people dynamics into dino for self supervised group activity feature learning | arXiv: 2604.04467
- momo mars orbital model foundation model for mars orbital applications | arXiv: 2604.02719
- omnigcd abstracting generalized category discovery for modality agnosticism | arXiv: 2604.14762
- otc optimal transport cultivating latent space online incremental learning | arXiv: 2211.16780
- redepth anything test-time depth refinement via self-supervised re-lighting | arXiv: 2512.17908
- robustness of vision foundation models to common perturbations | arXiv: 2604.14973
- unigeoclip geospatial contrastive | arXiv: 2604.11668
- zero ablation overstates register content dependence in dino vision transformers | arXiv: 2604.14433
- clay conditional visual similarity | arXiv: 2604.11539
- as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
- as language models scale loworder linear depth dyn | arXiv: 2603.12541
- learning from synthetic data via provenance-based input gradient guidance | arXiv: 2604.02946
- revisiting unknowns towards effective and efficient open-set active learning | arXiv: 2603.07898
- activityforensics a comprehensive benchmark for localizing manipulated activity | arXiv: 2604.03819
- anti-i2v safeguarding your photos from malicious image-to-video generation | arXiv: 2603.24570
- autocut end-to-end advertisement video editing based on multimodal discretizatio | arXiv: 2603.28366
- chain of event-centric causal thought for physically plausible video generation | arXiv: 2603.09094
- compressed-domain-aware online video super-resolution | arXiv: 2603.07694
- cubecomposer spatio-temporal autoregressive 4k 360 video generation from perspec | arXiv: 2603.04291
- diff4splat controllable 4d scene generation with latent dynamic reconstruction m | arXiv: 2511.00503
- disca accelerating video diffusion transformers wi | arXiv: 2602.05449
- disca accelerating video diffusion transformers with distillation-compatible lea | arXiv: 2602.05449
- dreamshot storyboard synthesis | arXiv: 2604.17195
- drivelaw unifying planning and video generation in a latent driving world | arXiv: 2512.23421
- fastlightgen fast and light video generation with fewer steps and parameters | arXiv: 2603.01685
- first frame is the place to go for video content customization | arXiv: 2511.15700
- flashmotion few-step controllable video generation with trajectory guidance | arXiv: 2603.12146
- flashmotion fewstep controllable video generation | arXiv: 2603.12146
- free-lunch long video generation via layer-adaptive ood correction | arXiv: 2603.25209
- from static to dynamic exploring self-supervised image-to-video representation t | arXiv: 2603.26597
- generative neural video compression via video diffusion prior | arXiv: 2512.05016
- geometry-as-context modulating explicit 3d in scene-consistent video generation | arXiv: 2602.21929
- gloria consistent character video generation via content anchors | arXiv: 2603.29931
- goal-driven reward by video diffusion models for reinforcement learning | arXiv: 2512.00961
- identity-preserving image-to-video generation via reward-guided optimization | arXiv: 2510.14255
- infinity-rope action-controllable infinite video generation emerges from autoreg | arXiv: 2511.20649
- interpretable motion-attentive maps spatio-temporally localizing concepts in vid | arXiv: 2603.02919
- lamp language-assisted motion planning for controllable video generation | arXiv: 2512.03619
- let your image move with your motion -- implicit multi-object multi-motion trans | arXiv: 2603.01000
- lighting-grounded video generation with renderer-based agent reasoning | arXiv: 2604.07966
- lightmover generative light movement with color and intensity controls | arXiv: 2603.27209
- linvideo a post-training framework towards on attention in efficient video gener | arXiv: 2510.08318
- linvideo linear attention video generation | arXiv: 2510.08318
- moviedrive multimodal multiview video diffusion | arXiv: 2508.14327
- moviedrive urban scene synthesis with multi-modal multi-view video diffusion tra | arXiv: 2508.14327
- neoverse enhancing 4d world model with in-the-wild monocular videos | arXiv: 2601.00393
- nova sparse control dense synthesis for pair-free video editing | arXiv: 2603.02802
- orbital video 3d foundation priors | arXiv: 2604.12309
- pam a pose-appearance-motion engine for sim-to-real hoi video generation | arXiv: 2603.22193
- performrecast expression and head pose disentanglement for portrait video editin | arXiv: 2603.19731
- phantom physics-infused video generation via joint modeling of visual and latent | arXiv: 2604.08503
- physical simulator in-the-loop video generation | arXiv: 2603.06408
- posegen in-context lora finetuning for pose-controllable long human video genera | arXiv: 2508.05091
- rethinking position embedding as a context controller for multi-reference and mu | arXiv: 2604.03738
- seeu seeing the unseen world via 4d dynamics-aware generation | arXiv: 2512.03350
- semantic satellite communications for synchronized | arXiv: 2603.10791
- semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
- slvmeval synthetic meta evaluation benchmark for text-to-long video generation | arXiv: 2603.29186
- streamdit real-time streaming text-to-video generation | arXiv: 2507.03745
- swift sliding window reconstruction for few-shot training-free generated video a | arXiv: 2603.08536
- switchcraft training-free multi-event video generation with attention controls | arXiv: 2602.23956
- symphomotion joint control of camera motion and object dynamics for coherent vid | arXiv: 2604.03723
- tear temporal-aware automated red-teaming for text-to-video models | arXiv: 2511.21145
- the devil is in the details enhancing video virtual try-on via keyframe-driven d | arXiv: 2512.20340
- training-free motion factorization for compositional video generation | arXiv: 2603.09104
- u-mind a unified framework for real-time multimodal interaction with audiovisual | arXiv: 2602.23739
- uniavgen unified audio and video generation with asymmetric cross-modal interact | arXiv: 2511.03334
- unified camera positional encoding for controlled video generation | arXiv: 2512.07237
- unitalking a unified audio-video framework for talking portrait generation | arXiv: 2603.01418
- vanast virtual try-on with human image animation via synthetic triplet supervisi | arXiv: 2604.04934
- videocof unified video editing with temporal reasoner | arXiv: 2512.07469
- when numbers speak aligning textual numerals and visual instances in text-to-vid | arXiv: 2604.08546
- when to lock attention training-free kv control in video diffusion | arXiv: 2603.09657
- adaspark adaptive sparsity for efficient long video understanding | arXiv: 2604.08077
- chronotrack temporally consistent long term memory for 3d single object tracking | arXiv: 2604.13789
- dual-level adaptation for multiobject tracking building testtime calibration from | arXiv: 2603.21629
- envisioning the future one step at a time | arXiv: 2604.09527
- event6d event-based novel object 6d pose tracking | arXiv: 2603.28045
- how should video llms output time | arXiv: 2604.08966
- humanvbench probing human centric video understanding in mllms with automatica | arXiv: 2412.17574
- humanvbench probing human centric video understanding mllms | arXiv: 2412.17574
- ninja codes neurally generated fiducial markers for stealthy 6-dof tracking | arXiv: 2510.18976
- seen to scene keep the seen generate the unseen for video outpainting | arXiv: 2604.14648
- storm referring multi object tracking | arXiv: 2604.10527
- svagent storyline guided long video understanding via cross modal multi agent collaboration | arXiv: 2604.05079
- tcei dual level adaptation multi object tracking | arXiv: 2603.21629
- tcei test time calibration experience intuition mot | arXiv: 2603.21629
- u2flow uncertainty aware unsupervised optical flow estimation | arXiv: 2604.10056
- vidtag video gps geolocalization | arXiv: 2604.12159
- vsi visual-subtitle integration for keyframe selection to enhance long video un | arXiv: 2508.06869