ICCV2025 论文笔记 TODO¶
总计: 2019 篇 | 已完成: 1518 | 待更新: 501
- 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining | arXiv: 2501.00958
- 25 years in class a multimodal textbook for vision-language pretraining | arXiv: 2501.00958
- 2handedafforder learning precise actionable bimanual affordances from human vide | arXiv: 2503.09320
- 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
- 3d gaussian splatting driven multi-view robust physical adversarial camouflage g | arXiv: 2507.01367
- 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation | arXiv: 2507.01367
- 3D Mesh Editing using Masked LRMs | arXiv: 2412.08641
- 3d test-time adaptation via graph spectral driven point shift | arXiv: 2507.18225
- 3D Test-time Adaptation via Graph Spectral Driven Point Shift | arXiv: 2507.18225
- 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection | arXiv: 2507.23567
- 3dgraphllm combining semantic graphs and large language models for 3d scene unde | arXiv: 2412.18450
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding | arXiv: 2412.18450
- 3dgs-lm faster gaussian-splatting optimization with levenberg-marquardt | arXiv: 2409.12892
- 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt | arXiv: 2409.12892
- 3drealcar an in-the-wild rgb-d car dataset with 360-degree views | arXiv: 2406.04875
- 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views | arXiv: 2406.04875
- 3DSR: Bridging Diffusion Models and 3D Representations for 3D Consistent Super-Resolution | arXiv: 2508.04090
- 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark | arXiv: 2412.07825
- 4D Gaussian Splatting SLAM | arXiv: 2503.16710
- 4d visual pre-training for robot learning | arXiv: 2508.17230
- 4D Visual Pre-training for Robot Learning | arXiv: 2508.17230
- 4d-bench benchmarking multi-modal large language models for 4d object understand | arXiv: 2503.17827
- 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding | arXiv: 2503.17827
- 4dsegstreamer streaming 4d panoptic segmentation via dual threads | arXiv: 2510.17664
- 4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads | arXiv: 2510.17664
- 6dope-gs online 6d object pose estimation using gaussian splatting | arXiv: 2412.01543
- 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting | arXiv: 2412.01543
- 7dgs unified spatial-temporal-angular gaussian splatting | arXiv: 2503.07946
- 7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting | arXiv: 2503.07946
- a conditional probability framework for compositional zero-shot learning | arXiv: 2507.17377
- A Conditional Probability Framework for Compositional Zero-shot Learning | arXiv: 2507.17377
- a constrained optimization approach for gaussian splatting from coarsely-posed i
- A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy LiDAR Point Clouds | arXiv: 2504.09129
- a framework for double-blind federated adaptation of foundation models | arXiv: 2502.01289
- A Framework for Double-Blind Federated Adaptation of Foundation Models | arXiv: 2502.01289
- A Good Teacher Adapts Their Knowledge for Distillation
- A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention | arXiv: 2507.14315
- a hyperdimensional one place signature to represent them all stackable descripto | arXiv: 2412.06153
- a lesson in splats teacher-guided diffusion for 3d gaussian splats generation wi | arXiv: 2412.00623
- A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision | arXiv: 2412.00623
- a linear n-point solver for structure and motion from asynchronous tracks | arXiv: 2507.22733
- a plug-and-play physical motion restoration approach for in-the-wild high-diffic
- A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions | arXiv: 2412.17377
- a quality-guided mixture of score-fusion experts framework for human recognition | arXiv: 2508.00053
- A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition | arXiv: 2508.00053
- a real-world display inverse rendering dataset | arXiv: 2508.14411
- A Real-world Display Inverse Rendering Dataset | arXiv: 2508.14411
- A Recipe for Generating 3D Worlds from a Single Image | arXiv: 2503.16611
- a simple yet mighty hartley diffusion versatilist for generalizable dense vision
- A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks
- a tiny change a giant leap long-tailed class-incremental learning via geometric
- A Token-level Text Image Foundation Model for Document Understanding (TokenFD/TokenVL) | arXiv: 2503.02304
- a unified framework for industrial cel-animation colorization with temporal-stru
- a unified framework for motion reasoning and generation in human interaction | arXiv: 2410.05628
- a unified interpretation of training-time out-of-distribution detection
- A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets | arXiv: 2507.04699
- a0 an affordance-aware hierarchical model for general robotic manipulation | arXiv: 2504.12636
- a3gs arbitrary artistic style into arbitrary 3d gaussian splatting
- A3GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting
- aaa-gaussians anti-aliased and artifact-free 3d gaussian rendering | arXiv: 2504.12811
- AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering | arXiv: 2504.12811
- acam-kd adaptive and cooperative attention masking for knowledge distillation | arXiv: 2503.06307
- accelerate 3d object detection models via zero-shot attention key pruning | arXiv: 2503.08101
- accelerating diffusion sampling via exploiting local transition coherence | arXiv: 2503.09675
- ace-g improving generalization of scene coordinate regression through query pre- | arXiv: 2510.11605
- ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training | arXiv: 2510.11605
- achieving more with less additive prompt tuning for rehearsal-free class-increme
- acknowledging focus ambiguity in visual questions | arXiv: 2501.02201
- Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning | arXiv: 2509.07879
- active membership inference test amint enhancing model auditability with multi-t | arXiv: 2509.07879
- AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images
- ad-gs object-aware b-spline gaussian splatting for self-supervised autonomous dr | arXiv: 2507.12137
- adadcp learning an adapter with discrete cosine prior for clear-to-adverse domai
- adadrive self-adaptive slow-fast system for language-grounded autonomous driving | arXiv: 2511.06253
- adahuman animatable detailed 3d human generation with compositional multiview di | arXiv: 2505.24877
- AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion | arXiv: 2505.24877
- adapt foundational segmentation models with heterogeneous searching space
- adaptive articulated object manipulation on the fly with foundation model reason | arXiv: 2507.18276
- adaptive dual uncertainty optimization boosting monocular 3d object detection un | arXiv: 2508.20488
- Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts | arXiv: 2508.20488
- adaptive hyper-graph convolution network for skeleton-based human action recogni
- adaptive learning of high-value regions for semi-supervised medical image segmen
- adaptive prompt learning via gaussian outlier synthesis for out-of-distribution
- adaptive routing of text-to-image generation requests between large cloud model
- AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes | arXiv: 2508.13503
- addressing representation collapse in vector quantized models with one linear la | arXiv: 2411.02038
- addressing text embedding leakage in diffusion-based image editing | arXiv: 2412.04715
- adiee automatic dataset creation and scorer for instruction-guided image editing | arXiv: 2507.07317
- advancing text-to-3d generation with linearized lookahead variational score dist | arXiv: 2507.09748
- advancing textual prompt learning with anchored attributes | arXiv: 2412.09442
- advancing visual large language model for multi-granular versatile perception | arXiv: 2507.16213
- AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? | arXiv: 2412.03002
- adversarial attention perturbations for large object detection transformers | arXiv: 2508.02987
- adversarial data augmentation for single domain generalization via lyapunov expo | arXiv: 2507.04302
- adversarial distribution matching for diffusion distillation towards efficient i | arXiv: 2507.18569
- Adversarial Exploitation of Data Diversity Improves Visual Localization | arXiv: 2412.00138
- adversarial exploitation of data diversity improves visual localization | arXiv: 2412.00138
- adversarial robust memory-based continual learner | arXiv: 2311.17608
- adversarial training for probabilistic robustness
- aether geometric-aware unified world modeling | arXiv: 2503.18945
- Aether: Geometric-Aware Unified World Modeling | arXiv: 2503.18945
- AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm | arXiv: 2506.23537
- AGO: Adaptive Grounding for Open World 3D Occupancy Prediction | arXiv: 2504.10117
- ahcptq accurate and hardware-compatible post-training quantization for segment a
- aicomposer any style and content image composition via feature integration | arXiv: 2507.20721
- aid adapting image2video diffusion models for instruction-guided video predictio | arXiv: 2406.06465
- aigi-holmes towards explainable and generalizable ai-generated image detection v
- aim adaptive inference of multi-modal llms via token merging and pruning | arXiv: 2412.03248
- aim amending inherent interpretability via self-supervised masking | arXiv: 2508.11502
- aira activation-informed low-rank adaptation for large models
- aircache activating inter-modal relevancy kv cache compression for efficient lar
- AJAHR: Amputated Joint Aware 3D Human Mesh Recovery | arXiv: 2509.19939
- align your rhythm generating highly aligned dance poses with gating-enhanced rhy | arXiv: 2503.17340
- aligning effective tokens with video anomaly in large language models | arXiv: 2508.06350
- aligning information capacity between vision and language via dense-to-sparse fe
- aligning moments in time using video queries | arXiv: 2508.15439
- alleviating textual reliance in medical language-guided segmentation via prototy | arXiv: 2507.11055
- alltracker efficient dense point tracking at high resolution | arXiv: 2506.07310
- alocc adaptive lifting-based 3d semantic occupancy and cost volume-based flow pr | arXiv: 2411.07725
- ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions | arXiv: 2411.07725
- always skip attention | arXiv: 2505.01996
- am-adapter appearance matching adapter for exemplar-based semantic image synthes
- Amodal Depth Anything: Amodal Depth Estimation in the Wild | arXiv: 2412.02336
- Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images | arXiv: 2503.13439
- an empirical study of autoregressive pre-training from videos | arXiv: 2501.05453
- an openmind for 3d medical vision self-supervised learning | arXiv: 2412.17041
- An OpenMind for 3D Medical Vision Self-supervised Learning | arXiv: 2412.17041
- analyzing finetuning representation shift for multimodal llms steering | arXiv: 2501.03012
- anchor token matching implicit structure locking for training-free ar image edit | arXiv: 2504.10434
- animalclue recognizing animals by their traces | arXiv: 2507.20240
- AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation | arXiv: 2506.09982
- animegamer infinite anime life simulation with next game state prediction | arXiv: 2504.01014
- annofreeod detecting all classes at low frame rates without human annotations
- anomaly detection of integrated circuits package substrates using the large visi
- anti-tamper protection for unauthorized individual image generation | arXiv: 2508.06325
- any-ssr how recursive least squares works in continual learning of large languag
- any2anytryon leveraging adaptive position embeddings for versatile virtual cloth
- anybimanual transferring unimanual policy for general bimanual manipulation | arXiv: 2412.06779
- AnyI2V: Animating Any Conditional Image with Motion Control | arXiv: 2507.02857
- anyportal zero-shot consistent video background replacement | arXiv: 2509.07472
- AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction | arXiv: 2503.12929
- ar-vrm imitating human motions for visual robot manipulation with analogical rea | arXiv: 2508.07626
- are they the same exploring visual correspondence shortcomings of multimodal llm | arXiv: 2501.04670
- are vlms ready for autonomous driving an empirical study from the reliability da
- argmatch adaptive refinement gathering for efficient dense matching
- argotweak towards self-updating hd maps through structured priors | arXiv: 2509.08764
- arteditor learning customized instructional image editor from few-shot examples
- articulate3d holistic understanding of 3d scenes as universal scene description | arXiv: 2412.01398
- Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description | arXiv: 2412.01398
- ascent annotation-free self-supervised contrastive embeddings for 3d neuron trac
- asgs single-domain generalizable open-set object detection via adaptive subgraph
- ask and remember a questions-only replay strategy for continual visual question | arXiv: 2502.04469
- astroloc robust space to ground image localizer | arXiv: 2502.07003
- asynchronous event error-minimizing noise for safeguarding event dataset | arXiv: 2507.05728
- atlas decoupling skeletal and shape parameters for expressive parametric human m | arXiv: 2508.15767
- ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling | arXiv: 2508.15767
- attention to neural plagiarism diffusion models can plagiarize your copyrighted | arXiv: 2603.00150
- attention to the burstiness in visual prompt tuning | arXiv: 2506.22908
- attention to trajectory trajectory-aware open-vocabulary tracking | arXiv: 2503.08145
- augmenting moment retrieval zero-dependency two-stage learning | arXiv: 2510.19622
- authentic 4d driving simulation with a video generation model
- auto-controlled image perception in mllms via visual perception tokens
- auto-regressively generating multi-view consistent images | arXiv: 2506.18527
- Auto-Regressively Generating Multi-View Consistent Images (MV-AR) | arXiv: 2506.18527
- auto-vocabulary semantic segmentation | arXiv: 2312.04539
- autocompose automatic generation of pose transition descriptions for composed po | arXiv: 2503.22884
- automated model evaluation for object detection via prediction consistency and r | arXiv: 2508.12082
- automated red teaming for text-to-image models through feedback-guided prompt it
- AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting | arXiv: 2502.04981
- autoprompt automated red-teaming of text-to-image models via llm-driven adversar | arXiv: 2510.24034
- avam a universal training-free adaptive visual anchoring embedded into multimoda
- Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars | arXiv: 2502.20220
- b-vllm a vision large language model with balanced spatio-temporal tokens | arXiv: 2412.09919
- babyvlm data-efficient pretraining of vlms inspired by infant learning | arXiv: 2504.09426
- back on track bundle adjustment for dynamic scene reconstruction | arXiv: 2504.14516
- Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction | arXiv: 2504.14516
- backdoor attacks on neural networks via one-bit flip
- backdoor defense via enhanced splitting and trap isolation
- backdoor mitigation by distance-driven detoxification | arXiv: 2411.09585
- backdooring self-supervised contrastive learning by noisy alignment | arXiv: 2508.14015
- background invariance testing according to semantic proximity | arXiv: 2208.09286
- badvideo stealthy backdoor attack against text-to-video generation | arXiv: 2504.16907
- baking gaussian splatting into diffusion denoiser for fast and scalable single-s | arXiv: 2411.14384
- Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction | arXiv: 2411.14384
- balanced image stylization with style matching score | arXiv: 2503.07601
- balancing conservatism and aggressiveness prototype-affinity hybrid network for
- balancing task-invariant interaction and task-specific adaptation for unified im | arXiv: 2504.05164
- banet bilateral aggregation network for mobile stereo matching | arXiv: 2503.03259
- BANet: Bilateral Aggregation Network for Mobile Stereo Matching | arXiv: 2503.03259
- basic boosting visual alignment with intrinsic refined embeddings in multimodal | arXiv: 2508.06895
- batclip bimodal online test-time adaptation for clip | arXiv: 2412.02837
- benchmarking and learning multi-dimensional quality evaluator for text-to-3d gen | arXiv: 2412.11170
- Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation | arXiv: 2412.11170
- benchmarking burst super-resolution for polarization images noise dataset and an | arXiv: 2503.18705
- Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis | arXiv: 2503.18705
- Benchmarking Egocentric Visual-Inertial SLAM at City Scale | arXiv: 2509.26639
- benchmarking multimodal large language models against image corruptions
- benefit from seen enhancing open-vocabulary object detection by bridging visual
- Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI | arXiv: 2403.06361
- beyond isolated words diffusion brush for handwritten text-line generation | arXiv: 2508.03256
- beyond label semantics language-guided action anatomy for few-shot action recogn | arXiv: 2507.16287
- beyond losses reweighting empowering multi-task learning via the generalization | arXiv: 2211.13723
- beyond low-rank tuning model prior-guided rank allocation for effective transfer | arXiv: 2507.00327
- beyond one shot beyond one perspective cross-view and long-horizon distillation | arXiv: 2507.05260
- beyond pixel uncertainty bounding the ood objects in road scenes
- beyond single images retrieval self-augmented unsupervised camouflaged object de | arXiv: 2510.18437
- beyond the frame generating 360deg panoramic videos from perspective videos | arXiv: 2504.07940
- beziergs dynamic urban scene reconstruction with bezier curve gaussian splatting | arXiv: 2506.22099
- bi-level optimization for self-supervised ai-generated face detection | arXiv: 2507.22824
- bias-resilient weakly supervised semantic segmentation using normalizing flows
- bidirectional likelihood estimation with multi-modal large language models for t | arXiv: 2507.23284
- BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis | arXiv: 2411.08508
- bitrate-controlled diffusion for disentangling motion and content in video | arXiv: 2509.08376
- Blended Point Cloud Diffusion for Localized Text-guided Shape Editing | arXiv: 2507.15399
- Blind Noisy Image Deblurring Using Residual Guidance Strategy
- blind noisy image deblurring using residual guidance strategy
- blind2sound self-supervised image denoising without residual noise | arXiv: 2303.05183
- blinktrack feature tracking over 80 fps via events and images | arXiv: 2409.17981
- blueneg a 35mm negative film dataset for restoring channel-heterogeneous deterio
- bokehdiff neural lens blur with one-step diffusion | arXiv: 2507.18060
- Bolt3D: Generating 3D Scenes in Seconds | arXiv: 2503.14445
- boost 3d reconstruction using diffusion-based monocular camera calibration | arXiv: 2411.17240
- Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration | arXiv: 2411.17240
- boosting mllm reasoning with text-debiased hint-grpo | arXiv: 2503.23905
- boosting multi-view indoor 3d object detection via adaptive 3d volume constructi | arXiv: 2507.18331
- Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction | arXiv: 2507.18331
- boosting multimodal learning via disentangled gradient learning | arXiv: 2507.10213
- boosting vision semantic density with anatomy normality modeling for medical vis | arXiv: 2508.03742
- Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training | arXiv: 2508.03742
- bootstrap3d improving multi-view diffusion model with synthetic data | arXiv: 2406.00093
- Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data | arXiv: 2406.00093
- bootstrapping grounded chain-of-thought in multimodal llms for data-efficient mo
- borrowing eyes for the blind spot overcoming data scarcity in malicious video de
- boundary probing for input privacy protection when using lmm services
- boxdreamer dreaming box corners for generalizable object pose estimation | arXiv: 2504.07955
- BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation | arXiv: 2504.07955
- breaking rectangular shackles cross-view object segmentation for fine-grained ob
- breaking the encoder barrier for seamless video-language understanding | arXiv: 2503.18422
- bridging 3d anomaly localization and repair via high-quality continuous geometri
- bridging continuous and discrete tokens for autoregressive visual generation | arXiv: 2503.16430
- bridging diffusion models and 3d representations a 3d consistent super-resolutio | arXiv: 2508.04090
- bridging domain generalization to multimodal domain generalization via unified r | arXiv: 2507.03304
- bridging local inductive bias and long-range dependencies with pixel-mamba for e
- bridging the gap between ideal and real-world evaluation benchmarking ai-generat
- bridging the skeleton-text modality gap diffusion-powered modality alignment for | arXiv: 2411.10745
- bridging the sky and ground towards view-invariant feature learning for aerial-g
- Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation | arXiv: 2503.11652
- bring your rear cameras for egocentric 3d human pose estimation | arXiv: 2503.11652
- buffer-x towards zero-shot point cloud registration in diverse scenes | arXiv: 2503.07940
- BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes | arXiv: 2503.07940
- bvinet unlocking blind video inpainting with zero annotations | arXiv: 2502.01181
- BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting | arXiv: 2506.22099
- c2mil synchronizing semantic and topological causalities in multiple instance le
- C4D: 4D Made from 3D through Dual Correspondences | arXiv: 2510.14960
- ca-i2p channel-adaptive registration network with global optimal selection | arXiv: 2506.21364
- ca2c a prior-knowledge-free approach for robust label noise learning via asymmet
- cad-assistant tool-augmented vllms as generic cad task solvers | arXiv: 2412.13810
- cad-recode reverse engineering cad code from point clouds | arXiv: 2412.14042
- CAD-Recode: Reverse Engineering CAD Code from Point Clouds | arXiv: 2412.14042
- calibrating mllm-as-a-judge via multimodal bayesian prompt ensembles | arXiv: 2509.08777
- can generative geospatial diffusion models excel as discriminative geospatial fo | arXiv: 2503.07890
- can3tok canonical 3d tokenization and latent modeling of scene-level 3d gaussian | arXiv: 2508.01464
- Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians | arXiv: 2508.01464
- cao2 rectifying inconsistencies in diffusion-based dataset distillation | arXiv: 2506.22637
- cap evaluation of persuasive and creative image generation | arXiv: 2412.10426
- capellm support-free category-agnostic pose estimation with multimodal large lan | arXiv: 2411.06869
- captionsmiths flexibly controlling language pattern in image captioning | arXiv: 2507.01409
- capture evaluating spatial reasoning in vision language models via occluded obje | arXiv: 2504.15485
- cargait cross-attention based re-ranking for gait recognition | arXiv: 2503.03501
- CarGait: Cross-Attention based Re-ranking for Gait Recognition | arXiv: 2503.03501
- carl causality-guided architecture representation learning for an interpretable
- casp improving semi-dense feature matching pipeline leveraging cascaded correspo | arXiv: 2507.17312
- cassic towards content-adaptive state-space models for learned image compression
- category-specific selective feature enhancement for long-tailed multi-label imag
- CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image | arXiv: 2412.12906
- causal disentanglement and cross-modal alignment for enhanced few-shot learning | arXiv: 2508.03102
- causal-entity reflected egocentric traffic accident video synthesis | arXiv: 2506.23263
- cavis context-aware video instance segmentation | arXiv: 2407.03010
- ccl-lgs contrastive codebook learning for 3d language gaussian splatting | arXiv: 2505.20469
- ce-fam concept-based explanation via fusion of activation maps | arXiv: 2509.23849
- Certifiably Optimal Anisotropic Rotation Averaging | arXiv: 2503.07353
- cf3 compact and fast 3d feature fields | arXiv: 2508.05254
- characonsist fine-grained consistent character generation | arXiv: 2507.11533
- charm3r towards unseen camera height robust monocular 3d detector | arXiv: 2508.11185
- CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector | arXiv: 2508.11185
- chartcap mitigating hallucination of dense chart captioning | arXiv: 2508.03164
- chartpoint guiding mllms with grounding reflection for chart reasoning | arXiv: 2512.00305
- chatreid open-ended interactive person retrieval via hierarchical progressive tu
- chimera improving generalist model with domain-specific experts | arXiv: 2412.05983
- chords diffusion sampling accelerator with multi-core hierarchical ode solvers | arXiv: 2507.15260
- ciard cyclic iterative adversarial robustness distillation | arXiv: 2509.12633
- citynav a large-scale dataset for real-world aerial navigation | arXiv: 2406.14240
- cl-splats continual learning of gaussian splatting with local optimization | arXiv: 2506.21117
- class token as proxy optimal transport-assisted proxy learning for weakly superv
- class-wise federated averaging for efficient personalization | arXiv: 2406.07800
- cleanpose category-level object pose estimation via causal learning and knowledg | arXiv: 2502.01312
- client2vec improving federated learning by distribution shifts aware client inde | arXiv: 2405.16233
- clip-adapted region-to-text learning for generative open-vocabulary semantic seg
- clip-gs unifying vision-language representation with 3d gaussian splatting | arXiv: 2412.19142
- clipsym delving into symmetry detection with clip | arXiv: 2508.14197
- closed-loop transfer for weakly-supervised affordance grounding | arXiv: 2510.17384
- clot closed loop optimal transport for unsupervised action segmentation | arXiv: 2507.03539
- cmad correlation-aware and modalities-aware distillation for multimodal sentimen
- cmb-ml a cosmic microwave background dataset for the oldest possible computer vi
- cmt a cascade mar with topology predictor for multimodal conditional cad generat | arXiv: 2504.20830
- cns-bench benchmarking image classifier robustness under continuous nuisance shi | arXiv: 2507.17651
- co-painter fine-grained controllable image stylization via implicit decoupling a
- co2-net a physics-informed spatio-temporal model for global surface co2 reconstr
- coa-vla improving vision-language-action models via visual-text chain-of-afforda | arXiv: 2412.20451
- CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance | arXiv: 2412.20451
- cobl toward zero-shot ordinal layering without user prompting | arXiv: 2508.08498
- coda-4dgs dynamic gaussian splatting with context and deformation awareness for | arXiv: 2503.06744
- cohd a counting-aware hierarchical decoding framework for generalized referring
- coin confidence score-guided distillation for annotation-free cell segmentation | arXiv: 2503.11439
- colmdriver llm-based negotiation benefits cooperative autonomous driving | arXiv: 2503.08683
- color matching using hypernetwork-based kolmogorov-arnold networks | arXiv: 2503.11781
- colors see colors ignore clothes changing reid with color disentanglement | arXiv: 2507.07230
- comatch dynamic covisibility-aware transformer for bilateral subpixel-level semi
- combatvla an efficient vision-language-action model for combat tasks in 3d actio | arXiv: 2503.09527
- combinative matching for geometric shape assembly | arXiv: 2508.09780
- communication-efficient multi-vehicle collaborative semantic segmentation via sp
- CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images | arXiv: 2503.05332
- compass enhancing spatial understanding in text-to-image diffusion models | arXiv: 2412.13195
- compcap improving multimodal large language models with composite captions | arXiv: 2412.05243
- competitive distillation a simple learning strategy for improving visual classif | arXiv: 2506.23285
- completeme reference-based human image completion | arXiv: 2504.20042
- completing 3d partial assemblies with view-consistent 2d-3d correspondence
- compression of 3d gaussian splatting with optimized feature planes and standard | arXiv: 2501.03399
- compression-aware one-step diffusion model for jpeg artifact removal | arXiv: 2502.09873
- compslider compositional slider for disentangled multiple-attribute image genera | arXiv: 2509.01028
- conditional visual autoregressive modeling for pathological image restoration
- conformalsam unlocking the potential of foundational segmentation models in semi | arXiv: 2507.15803
- confound from all sides distill with resilience multi-objective adversarial path
- consistent time-of-flight depth denoising via graph-informed geometric attention | arXiv: 2506.23542
- consistentcity semantic flow-guided occupancy dit for temporally consistent driv
- constraint-aware feature learning for parametric point cloud | arXiv: 2411.07747
- constructing ophthalmic mllm for positioning-diagnosis collaboration through cli
- conststyle robust domain generalization with unified style transformation | arXiv: 2509.05975
- contact-aware amodal completion for human-object interaction via multi-regional | arXiv: 2508.00427
- contact-aware refinement of human pose pseudo-ground truth via bioimpedance sens | arXiv: 2512.04862
- context guided transformer entropy modeling for video compression | arXiv: 2508.01852
- contextface generating facial expressions from emotional contexts
- continuous-time human motion field from event cameras
- contrags codebook-condensed and trainable gaussian splatting for fast memory-eff
- contrastive flow matching | arXiv: 2506.05350
- Controllable 3D Outdoor Scene Generation via Scene Graphs | arXiv: 2503.07152
- controllable and expressive one-shot video head swapping | arXiv: 2506.16852
- controllable feature whitening for hyperparameter-free bias mitigation | arXiv: 2507.20284
- controllable latent space augmentation for digital pathology | arXiv: 2508.14588
- controlling multimodal llms via reward-guided decoding | arXiv: 2508.11616
- Controlling Multimodal LLMs via Reward-guided Decoding | arXiv: 2508.11616
- cooperative pseudo labeling for unsupervised federated classification | arXiv: 2510.10100
- cooptrack exploring end-to-end learning for efficient cooperative sequential per | arXiv: 2507.19239
- coordinate-based speed of sound recovery for aberration-corrected photoacoustic | arXiv: 2409.10876
- coralsrt revisiting coral reef semantic segmentation by feature rectification vi
- CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation | arXiv: 2411.10086
- correspondence as video test-time adaption on sam2 for reference segmentation in | arXiv: 2508.07759
- Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild | arXiv: 2508.07759
- corvid improving multimodal large language models towards chain-of-thought reaso | arXiv: 2507.07424
- Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning | arXiv: 2507.07424
- cosmic continual self-supervised learning for multi-domain medical imaging via c
- cosmo combination of selective memorization for low-cost vision-and-language nav | arXiv: 2503.24065
- costodet-ddpm collaborative training of stochastic and deterministic models impr
- cotmr chain-of-thought multi-scale reasoning for training-free zero-shot compose
- Counting Stacked Objects | arXiv: 2411.19149
- countse soft exemplar open-set object counting
- covtrack continuous open-vocabulary tracking via adaptive multi-cue fusion
- cram large scale video continual learning with bootstrapped compression
- cross-architecture distillation made simple with redundancy suppression | arXiv: 2507.21844
- cross-category subjectivity generalization for style-adaptive sketch re-id
- cross-granularity online optimization with masked compensated information for le
- cross-view isolated sign language recognition via view synthesis and feature dis
- cryofastar fast cryo-em ab initio reconstruction made easy | arXiv: 2506.05864
- CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy | arXiv: 2506.05864
- csd-var content-style decomposition in visual autoregressive models | arXiv: 2507.13984
- culture3d a large-scale and diverse dataset of cultural landmarks and terrains f
- cumperlay learning cubical multiparameter persistence vectorizations | arXiv: 2510.12795
- cure cultural gaps in the long tail of text-to-image systems | arXiv: 2506.08071
- curve-aware gaussian splatting for 3d parametric curve reconstruction | arXiv: 2506.21401
- cuts3d cutting semantics in 3d for 2d unsupervised instance segmentation | arXiv: 2411.16319
- cvfusion cross-view fusion of 4d radar and camera for 3d object detection | arXiv: 2507.04587
- cvpt cross visual prompt tuning | arXiv: 2408.14961
- cwnet causal wavelet network for low-light image enhancement | arXiv: 2507.10689
- cycle consistency as reward learning image-text alignment without human preferen | arXiv: 2506.02095
- Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences | arXiv: 2506.02095
- cycle-consistent learning for joint layout-to-image generation and object detect
- d-attn decomposed attention for large vision-and-language model
- d2st-adapter disentangled-and-deformable spatio-temporal adapter for few-shot ac
- d3 training-free ai-generated video detection using second-order features | arXiv: 2508.00701
- d3qe learning discrete distribution discrepancy-aware quantization error for aut
- dacon dino for anime paint bucket colorization with any number of reference imag | arXiv: 2509.14685
- dadet safeguarding image conditional diffusion models against adversarial and ba
- dadm dual alignment of domain and modality for face anti-spoofing | arXiv: 2503.00429
- damap distance-aware mapnet for high quality hd map construction | arXiv: 2510.22675
- danceeditor towards iterative editable music-driven dance generation with open-v
- dap-mae domain-adaptive point cloud masked autoencoder for effective cross-domai
- DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning | arXiv: 2510.21635
- dash detection and assessment of systematic hallucinations of vlms | arXiv: 2503.23573
- dataset distillation via the wasserstein metric | arXiv: 2311.18531
- dataset ownership verification for pre-trained masked models | arXiv: 2507.12022
- david data-efficient and accurate vision models from synthetic data | arXiv: 2507.15365
- dc-ar efficient masked autoregressive image generation with deep compression hyb | arXiv: 2507.04947
- dc-controlnet decoupling inter- and intra-element conditions in image generation
- dchm depth-consistent human modeling for multiview detection | arXiv: 2507.14505
- dct-shield a robust frequency domain defense against malicious image editing | arXiv: 2504.17894
- ddb diffusion driven balancing to address spurious correlations | arXiv: 2503.17226
- debiased curriculum adaptation for safe transfer learning in chest x-ray classif
- debiased teacher for day-to-night domain adaptive object detection
- decad decoupling anomalies in latent space for multi-class unsupervised anomaly
- deciphering cross-modal alignment in large vision-language models via modality i
- decoding correlation-induced misalignment in the stable diffusion workflow for t
- decouple and track benchmarking and improving video diffusion transformers for m | arXiv: 2503.17350
- decouple to reconstruct high quality uhd restoration via active feature disentan | arXiv: 2503.12764
- decoupled diffusion sparks adaptive scene generation | arXiv: 2504.10485
- deep adaptive unfolded network via spatial morphology stripping and spectral fil
- deep incomplete multi-view clustering with distribution dual-consistency recover
- deeply supervised flow-based generative models | arXiv: 2503.14494
- deepmesh auto-regressive artist-mesh creation with reinforcement learning | arXiv: 2503.15265
- deepshield fortifying deepfake video detection with local and global forgery ana | arXiv: 2510.25237
- degauss dynamic-static decomposition with gaussian splatting for distractor-free | arXiv: 2503.13176
- degradation-modeled multipath diffusion for tunable metalens photography | arXiv: 2506.22753
- demeter a parametric model of crop plant morphology from the real world | arXiv: 2510.16377
- denoising token prediction in masked autoregressive models
- dense policy bidirectional autoregressive learning of actions | arXiv: 2503.13217
- dense2moe restructuring diffusion transformer to moe for efficient text-to-image | arXiv: 2510.09094
- Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation | arXiv: 2510.09094
- depth anyevent a cross-modal distillation paradigm for event-based monocular dep | arXiv: 2509.15224
- deris decoupling perception and cognition for enhanced referring image segmentat | arXiv: 2507.01738
- derm1m a million-scale vision-language dataset aligned with clinical ontology kn
- describe adapt and combine empowering clip encoders for open-set 3d object retri | arXiv: 2507.21489
- describe dont dictate semantic image editing with natural language intent | arXiv: 2508.20505
- despite exploring contrastive deep skeleton-pointcloud-imu-text embeddings for a | arXiv: 2506.13897
- despite exploring contrastive deep skeletonpointcloudimutext | arXiv: 2506.13897
- detect anything 3d in the wild | arXiv: 2504.07958
- devil is in the uniformity exploring diverse learners within transformer for ima | arXiv: 2503.20174
- dexvlg dexterous vision-language-grasp model at scale | arXiv: 2507.02747
- dgtalker disentangled generative latent space learning for audio-driven gaussian
- dh-facevid-1k a large-scale high-quality dataset for face video generation | arXiv: 2410.07151
- dia the adversarial exposure of deterministic inversion in diffusion models | arXiv: 2510.00778
- diagnosing pretrained models for out-of-distribution detection
- dice staleness-centric optimizations for parallel diffusion moe inference | arXiv: 2411.16786
- dictas a framework for class-generalizable few-shot anomaly segmentation via dic | arXiv: 2508.13560
- diffdoctor diagnosing image diffusion models before treating | arXiv: 2501.12382
- diffpci large motion point cloud frame interpolation with diffusion model
- diffsim taming diffusion models for evaluating visual similarity | arXiv: 2412.14580
- difftell a high-quality dataset for describing image manipulation changes
- diffuman4d 4d consistent human view synthesis from sparse-view videos with spati
- diffumatch category-agnostic spectral diffusion priors for robust non-rigid shap | arXiv: 2507.23715
- diffusion curriculum synthetic-to-real data curriculum via image-guided diffusio | arXiv: 2410.13674
- diffusion guided adaptive augmentation for generalization in visual reinforcemen
- diffusion image prior | arXiv: 2503.21410
- diffusion-based 3d hand motion recovery with intuitive physics | arXiv: 2508.01835
- diffusion-based source-biased model for single domain generalized object detecti
- diffvsr revealing an effective recipe for taming robust video super-resolution a
- dimcim a quantitative evaluation framework for default-mode diversity and genera
- diorama unleashing zero-shot single-view 3d indoor scene modeling | arXiv: 2411.19492
- Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling | arXiv: 2411.19492
- dirichlet-constrained variational codebook learning for temporally coherent vide
- discontinuity-aware normal integration for generic central camera models | arXiv: 2507.06075
- discopatch taming adversarially-driven batch statistics for improved out-of-dist | arXiv: 2501.08005
- discovering divergent representations between text-to-image models | arXiv: 2509.08940
- discretized gaussian representation for tomographic reconstruction | arXiv: 2411.04844
- disenq disentangling q-former for activity-biometrics | arXiv: 2507.07262
- disentangled world models learning to transfer semantic knowledge from distracti | arXiv: 2503.08751
- disentangling instance and scene contexts for 3d semantic scene completion | arXiv: 2507.08555
- disrupting model merging a parameter-level defense without sacrificing accuracy | arXiv: 2503.07661
- dist-4d disentangled spatiotemporal diffusion with metric depth for 4d driving s | arXiv: 2503.15208
- dista-net dynamic closely-spaced infrared small target unmixing | arXiv: 2505.19148
- distil data-free inversion of suspicious trojan inputs via latent diffusion | arXiv: 2507.22813
- distilling diffusion models to efficient 3d lidar scene completion | arXiv: 2412.03515
- distime distribution-based time representation for video large language models | arXiv: 2505.24329
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy | arXiv: 2503.19757
- ditfastattnv2 head-wise attention compression for multi-modality diffusion trans | arXiv: 2503.22796
- dive taming dino for subject-driven video editing | arXiv: 2412.03347
- diversity-enhanced distribution alignment for dataset distillation
- diving into the fusion of monocular priors for generalized stereo matching | arXiv: 2505.14414
- dlf extreme image compression with dual-generative latent fusion | arXiv: 2503.01428
- dlfr-gen diffusion-based video generation with dynamic latent frame rate
- dm-efs dynamically multiplexed expanded features set form for robust and efficie
- dmesh an efficient differentiable mesh for complex shapes | arXiv: 2412.16776
- dmq dissecting outliers of diffusion models for post-training quantization | arXiv: 2507.12933
- do it yourself learning semantic correspondence from pseudo-labels | arXiv: 2506.05312
- DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding | arXiv: 2508.08589
- dogr towards versatile visual document grounding and referring | arXiv: 2411.17125
- dollar few-step video generation via distillation and latent reward optimization | arXiv: 2412.15689
- DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization | arXiv: 2412.15689
- domain generalizable portrait style transfer | arXiv: 2507.04243
- donut a decoder-only model for trajectory prediction | arXiv: 2506.06854
- doodle your keypoints sketch-based few-shot keypoint detection | arXiv: 2507.07994
- dposer-x diffusion model as robust 3d whole-body human pose prior | arXiv: 2508.00599
- dram-lhm a quaternion framework for iterative camera pose estimation
- drawing developmental trajectory from cortical surface reconstruction
- dreamactor-m1 holistic expressive and robust human image animation with hybrid g | arXiv: 2504.01724
- dreamdance animating human images by enriching 3d geometry cues from 2d poses | arXiv: 2412.00397
- dreamlayer simultaneous multi-layer generation via diffusion model
- dreamrelation relation-centric video customization | arXiv: 2503.07602
- drivex omni scene modeling for learning generalizable world knowledge in autonom | arXiv: 2505.19239
- driving view synthesis on free-form trajectories with generative prior | arXiv: 2412.01717
- drivinggpt unifying driving world modeling and planning with multi-modal autoreg
- dropletvideo a dataset and approach to explore integral spatio-temporal consiste
- dso aligning 3d generators with simulation feedback for physical soundness | arXiv: 2503.22677
- dual domain control via active learning for remote sensing domain incremental ob
- dual reciprocal learning of language-based human motion understanding and genera
- dual recursive feedback on generation and appearance latents for pose-robust tex | arXiv: 2508.09575
- dual-expert consistency model for efficient and high-quality video generation | arXiv: 2506.03123
- dual-level prototype learning for composite degraded image restoration
- dual-rate dynamic teacher for source-free domain adaptive object detection
- dual-temporal exemplar representation network for video semantic segmentation
- dualreal adaptive joint training for lossless identity-motion fusion in video cu | arXiv: 2505.02192
- duet dual incremental object detection via exemplar-free task arithmetic | arXiv: 2506.21260
- duolora cycle-consistent and rank-disentangled content-style personalization | arXiv: 2504.13206
- dwim towards tool-aware visual reasoning via discrepancy-aware workflow generati | arXiv: 2503.19263
- dygs-slam real-time accurate localization and gaussian reconstruction for dynami
- dynamic dictionary learning for remote sensing image segmentation | arXiv: 2503.06683
- dynamic group detection using vlm-augmented temporal groupness graph | arXiv: 2509.04758
- dynamic multimodal prototype learning in vision-language models | arXiv: 2507.03657
- dynamic point maps a versatile representation for dynamic 3d reconstruction | arXiv: 2503.16318
- dynamic reconstruction of hand-object interaction with distributed force-aware c | arXiv: 2411.09572
- Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection | arXiv: 2507.17436
- dynamic-vlm simple dynamic visual token compression for videollm | arXiv: 2412.09530
- dynamicid zero-shot multi-id image personalization with flexible facial editabil | arXiv: 2503.06505
- dynfacerestore balancing fidelity and quality in diffusion-guided blind face res | arXiv: 2507.13797
- dynimg key frames with visual prompts are good representation for multi-modal vi | arXiv: 2507.15569
- e-nemf event-based neural motion field for novel space-time view synthesis of dy
- e-sam training-free segment every entity model | arXiv: 2503.12094
- ea-kd entropy-based adaptive knowledge distillation | arXiv: 2311.13621
- ea-vit efficient adaptation for elastic vision transformer | arXiv: 2507.19360
- eamamba efficient all-around vision state space model for image restoration | arXiv: 2506.22246
- early timestep zero-shot candidate selection for instruction-guided image editin | arXiv: 2504.13490
- easi3r estimating disentangled motion from dust3r without training | arXiv: 2503.24391
- easy3d a simple yet effective method for 3d interactive segmentation | arXiv: 2504.11024
- ec-flow enabling versatile robotic manipulation from action-unlabeled videos via | arXiv: 2507.06224
- edffdnet towards accurate and efficient unsupervised multi-grid image registrati | arXiv: 2509.07662
- edit efficient diffusion transformers with linear compressed attention | arXiv: 2503.16726
- eedit rethinking the spatial and temporal redundancy for efficient image editing | arXiv: 2503.10270
- effective training data synthesis for improving mllm chart understanding | arXiv: 2508.06492
- efficient adaptation of pre-trained vision transformer underpinned by approximat | arXiv: 2507.13260
- efficient autoregressive shape generation via octree-based adaptive tokenization | arXiv: 2504.02817
- efficient concertormer for image deblurring and beyond | arXiv: 2404.06135
- efficient fine-tuning of large models via nested low-rank adaptation
- efficient input-level backdoor defense on text-to-image synthesis via neuron act | arXiv: 2503.06453
- efficient spiking point mamba for point cloud analysis | arXiv: 2504.14371
- efficient visual place recognition through multimodal semantic knowledge integra
- efficientmt efficient temporal adaptation for motion transfer in text-to-video d | arXiv: 2503.19369
- egoadapt adaptive multisensory distillation and policy learning for efficient eg | arXiv: 2506.21080
- egoagent a joint predictive agent model in egocentric worlds | arXiv: 2502.05857
- egocentric action-aware inertial localization in point clouds with vision-langua | arXiv: 2505.14346
- egom2p egocentric multimodal multitask pretraining | arXiv: 2506.07886
- egoppg heart rate estimation from eye-tracking cameras in egocentric systems to | arXiv: 2502.20879
- embodied image captioning self-supervised learning agents for spatially coherent | arXiv: 2504.08531
- embodied navigation with auxiliary task of action description prediction | arXiv: 2510.21809
- embodied representation alignment with mirror neurons | arXiv: 2509.21136
- embodied videoagent persistent memory from egocentric videos and embodied sensor
- embodiedocc embodied 3d occupancy prediction for vision-based online scene under | arXiv: 2412.04380
- embodiedsplat personalized real-to-sim-to-real navigation with gaussian splats f | arXiv: 2509.17430
- emd explicit motion modeling for high-quality street gaussian splatting | arXiv: 2411.15582
- emoticrafter text-to-emotional-image generation based on valence-arousal model | arXiv: 2501.05710
- emotive event-guided trajectory modeling for 3d motion estimation | arXiv: 2503.11371
- emulating self-attention with convolution for efficient image super-resolution | arXiv: 2503.06671
- end-to-end entity-predicate association reasoning for dynamic scene graph genera
- end-to-end multi-modal diffusion mamba | arXiv: 2510.13253
- engage for all making ordinary image descriptions appealing again
- enhanced event-based dense stereo via cross-sensor knowledge distillation
- enhanced pansharpening via quaternion spatial-spectral interactions
- enhancing adversarial transferability by balancing exploration and exploitation | arXiv: 2511.00411
- enhancing few-shot vision-language classification with large multimodal model fe | arXiv: 2412.00142
- enhancing image restoration transformer via adaptive translation equivariance | arXiv: 2506.18520
- enhancing prompt generation with adaptive refinement for camouflaged object dete
- enhancing reward models for high-quality image generation beyond text-image alig | arXiv: 2507.19002
- enhancing transferability of targeted adversarial examples via inverse target gr
- enhancing transformers through conditioned embedded tokens | arXiv: 2505.12789
- enhancing zero-shot object counting via text-guided local ranking and number-evo
- enrich and detect video temporal grounding with multimodal llms | arXiv: 2510.17023
- ensemble foreground management for unsupervised object discovery | arXiv: 2507.20860
- epipolar consistent attention aggregation network for unsupervised light field d
- epona autoregressive diffusion world model for autonomous driving | arXiv: 2506.24113
- equipping vision foundation model with mixture of experts for out-of-distributio
- erasing more than intended how concept erasure degrades the generation of non-ta | arXiv: 2501.09833
- error recognition in procedural videos using generalized task graph
- escnetedge-semantic collaborative network for camouflaged object detection
- estimating 2d camera motion with hybrid motion basis | arXiv: 2507.22480
- eta efficiency through thinking ahead a dual approach to self-driving with large | arXiv: 2506.07725
- eta energy-based test-time adaptation for depth completion | arXiv: 2508.05989
- etch generalizing body fitting to clothed humans via equivariant tightness | arXiv: 2503.10624
- etva evaluation of text-to-video alignment via fine-grained question generation | arXiv: 2503.16867
- evading data provenance in deep neural networks | arXiv: 2508.01074
- evagaussians event stream assisted gaussian splatting from blurry images | arXiv: 2405.20224
- event-based tiny object detection a benchmark dataset and baseline | arXiv: 2506.23575
- event-based visual vibrometry
- event-boosted deformable 3d gaussians for dynamic scene reconstruction | arXiv: 2411.16180
- event-driven storytelling with multiple lifelike humans in a 3d scene | arXiv: 2507.19232
- event-guided unified framework for low-light video enhancement frame interpolati
- eventups uncalibrated photometric stereo using an event camera
- everything is a video unifying modalities through next-frame prediction | arXiv: 2411.10503
- EVEv2: Improved Baselines for Encoder-Free Vision-Language Models | arXiv: 2502.06788
- evidential knowledge distillation
- evolvinggrasp evolutionary grasp generation via efficient preference alignment | arXiv: 2503.14329
- evrt-detr latent space adaptation of image detectors for event-based vision | arXiv: 2412.02890
- evt efficient view transformation for multi-modal 3d object detection | arXiv: 2411.10715
- excap3d expressive 3d scene understanding via object captioning with varying det | arXiv: 2503.17044
- exploiting diffusion prior for task-driven image restoration | arXiv: 2507.22459
- exploiting domain properties in language-driven domain generalization for semant | arXiv: 2512.03508
- exploiting vision language model for training-free 3d point cloud ood detection | arXiv: 2506.22375
- exploring multimodal diffusion transformers for enhanced prompt-based image edit | arXiv: 2508.07519
- exploring probabilistic modeling beyond domain generalization for semantic segme | arXiv: 2507.21367
- exploring view consistency for scene-adaptive low-light light field image enhanc
- exploring weather-aware aggregation and adaptation for semantic segmentation und
- expressive talking human from single-image with imperfect priors
- external knowledge injection for clip-based class-incremental learning | arXiv: 2503.08510
- extrapolated urban view synthesis benchmark | arXiv: 2412.05256
- f-bench rethinking human preference evaluation metrics for benchmarking face gen
- fa forced prompt learning of vision-language models for out-of-distribution dete | arXiv: 2507.04511
- facecraft4d animated 3d facial avatar generation from a single image | arXiv: 2504.15179
- facelift learning generalizable single image 3d face reconstruction from synthet | arXiv: 2412.17812
- factorized learning for temporally grounded video-language models | arXiv: 2512.24097
- failure cases are better learned but boundary says sorry facilitating smooth per | arXiv: 2508.02186
- fair generation without unfair distortions debiasing text-to-image generation wi | arXiv: 2506.13298
- fairgen enhancing fairness in text-to-image diffusion models via self-discoverin
- fakeradar probing forgery outliers to detect unknown deepfake videos | arXiv: 2512.14601
- FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers | arXiv: 2501.16297
- fast image super-resolution via consistency rectified flow
- faster and better 3d splatting via group training | arXiv: 2412.07608
- fastjsma accelerating jacobian-based saliency map attacks through gradient decou
- fastvar linear visual autoregressive modeling via cached token pruning | arXiv: 2503.23367
- fdpt federated discrete prompt tuning for black-box visual-language models
- fe-clip frequency enhanced clip model for zero-shot anomaly detection and segmen
- feather the throttle revisiting visual token pruning for vision-language model a | arXiv: 2412.13180
- Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration | arXiv: 2412.13180
- feature extraction and representation of pre-training point cloud based on diffu
- feature purification matters suppressing outlier propagation for training-free o
- fedagc federated continual learning with asymmetric gradient correction
- feddifrc unlocking the potential of text-to-image diffusion models in heterogene | arXiv: 2507.06482
- federated continual instruction tuning | arXiv: 2503.12897
- federated prompt-tuning with heterogeneous and incomplete multimodal client data | arXiv: 2602.07081
- federated representation angle learning
- fedmenf privacy-preserving federated meta-learning for neural fields | arXiv: 2508.06301
- fedmvp federated multimodal visual prompt tuning for vision-language models | arXiv: 2504.20860
- fedpall prototype-based adversarial and collaborative learning for federated lea
- fedvla federated vision-language-action learning with dual gating mixture-of-exp | arXiv: 2508.02190
- few-shot pattern detection via template matching and regression | arXiv: 2508.17636
- fewer denoising steps or cheaper per-step inference towards compute-optimal diff | arXiv: 2508.06160
- ficgen frequency-inspired contextual disentanglement for layout-driven degraded | arXiv: 2509.01107
- fiffdepth feed-forward transformation of diffusion-based generators for detailed | arXiv: 2412.00671
- find a scapegoat poisoning membership inference attack and defense to federated | arXiv: 2507.00423
- find any part in 3d | arXiv: 2411.13550
- find few-shot anomaly inspection with normal-only multi-modal data
- fine-grained evaluation of large vision-language models in autonomous driving | arXiv: 2503.21505
- fine-grained spatiotemporal grounding on egocentric videos | arXiv: 2508.00518
- finemotion a dataset and benchmark with both spatial and temporal annotation for
- finmmr make financial numerical reasoning more multimodal comprehensive and chal | arXiv: 2508.04625
- fish2mesh transformer 3d human mesh recovery from egocentric vision | arXiv: 2503.06089
- fix-clip dual-branch hierarchical contrastive learning via synthetic captions fo | arXiv: 2507.10095
- fixtalk taming identity leakage for high-quality talking head generation in extr | arXiv: 2507.01390
- flashdepth real-time streaming video depth estimation at 2k resolution | arXiv: 2504.07093
- flexgen flexible multi-view generation from text and image inputs | arXiv: 2410.10745
- float generative motion latent flow matching for audio-driven talking portrait | arXiv: 2412.01064
- FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation | arXiv: 2504.10487
- flow to the mode mode-seeking diffusion autoencoders for state-of-the-art image | arXiv: 2503.11056
- flow-mil constructing highly-expressive latent feature space for whole slide ima
- flow4agent long-form video understanding via motion prior from optical flow | arXiv: 2510.05836
- flowdps flow-driven posterior sampling for inverse problems | arXiv: 2503.08136
- flowedit inversion-free text-based editing using pre-trained flow models | arXiv: 2412.08629
- flowseek optical flow made easier with depth foundation models and motion bases | arXiv: 2509.05297
- flowstyler artistic video stylization via transformation fields transports
- flowtok flowing seamlessly across text and image tokens | arXiv: 2503.10772
- focal plane visual feature generation and matching on a pixel processor array
- folder accelerating multi-modal large language models with enhanced performance | arXiv: 2501.02430
- fontanimate high quality few-shot font generation via animating font transfer pr
- forcennet foreground-centric network for document image rectification | arXiv: 2507.19804
- forensic-moe exploring comprehensive synthetic image detection traces with mixtu
- foresight in motion reinforcing trajectory prediction with reward heuristics | arXiv: 2507.12083
- forgelens data-efficient forgery focus for generalizable forgery image detection | arXiv: 2408.13697
- forgetting through transforming enabling federated unlearning via class-aware re | arXiv: 2410.06848
- foundir unleashing million-scale training data to advance foundation models for | arXiv: 2412.01427
- fpem face prior enhanced facial attractiveness prediction for live videos with f
- free-form motion control controlling the 6d poses of camera and objects in video | arXiv: 2501.01425
- free-merging fourier transform for efficient model merging | arXiv: 2411.16815
- free-moref instantly multiplexing context perception capabilities of video-mllms | arXiv: 2508.02134
- free-running vs synchronous single-photon lidar for high-flux 3d imaging | arXiv: 2507.09386
- free4d tuning-free 4d scene generation with spatial-temporal consistency | arXiv: 2503.20785
- freecus free lunch subject-driven customization in diffusion transformers | arXiv: 2507.15249
- freedance towards harmonic free-number group dance generation via a unified fram
- freedna endowing domain adaptation of diffusion-based dense prediction with trai
- freeflux understanding and exploiting layer-specific roles in rope-based mmdit f
- freemorph tuning-free generalized image morphing with diffusion model | arXiv: 2507.01953
- freescale unleashing the resolution of diffusion models via tuning-free scale fu | arXiv: 2412.09626
- frequency-aligned knowledge distillation for lightweight spatiotemporal forecast | arXiv: 2507.02939
- frequency-guided diffusion for training-free text-driven image translation
- frequency-semantic enhanced variational autoencoder for zero-shot skeleton-based | arXiv: 2506.22179
- fret feature redundancy elimination for test time adaptation | arXiv: 2505.10641
- from easy to hard progressive active learning framework for infrared small targe | arXiv: 2412.11154
- from easy to hard the mir benchmark for progressive interleaved multi-image reas | arXiv: 2509.17040
- from gallery to wrist realistic 3d bracelet insertion in videos | arXiv: 2507.20331
- from gaze to movement predicting visual attention for autonomous driving human-m
- from holistic to localized local enhanced adapters for efficient visual instruct | arXiv: 2411.12787
- from image to video an empirical study of diffusion representations | arXiv: 2502.07001
- from imitation to innovation the emergence of ais unique artistic styles and the
- from linearity to non-linearity how masked autoencoders capture spatial correlat | arXiv: 2508.15404
- from objects to events unlocking complex visual understanding in object detector
- from one to more contextual part latents for 3d generation | arXiv: 2507.08772
- from reflection to perfection scaling inference-time optimization for text-to-im
- from reusing to forecasting accelerating diffusion models with taylorseers | arXiv: 2503.06923
- from sharp to blur unsupervised domain adaptation for 2d human pose estimation u
- from trial to triumph advancing long video understanding via visual context samp
- fross faster-than-real-time online 3d semantic scene graph generation from rgb-d | arXiv: 2507.19993
- fuse before transfer knowledge fusion for heterogeneous distillation | arXiv: 2410.12342
- fusion meets diverse conditions a high-diversity benchmark and baseline for uav-
- fusionphys a flexible framework for fusing complementary sensing modalities in r
- future-aware interaction network for motion forecasting | arXiv: 2503.06565
- fuxi-rtm a physics-guided prediction framework with radiative transfer modeling | arXiv: 2503.19940
- fuzzy contrastive decoding to alleviate object hallucination in large vision-lan
- fvgen accelerating novel-view synthesis with adversarial video diffusion distill | arXiv: 2508.06392
- fw-merging scaling model merging with frank-wolfe optimization | arXiv: 2503.12649
- g2d boosting multimodal learning with gradient-guided distillation | arXiv: 2506.21514
- g2pdiffusion cross-species genotype-to-phenotype prediction via evolutionary dif | arXiv: 2502.04684
- g2sf geometry-guided score fusion for multimodal industrial anomaly detection | arXiv: 2503.10091
- gain-mlp improving hdr gain map encoding via a lightweight mlp | arXiv: 2503.11883
- gait-x exploring x modality for generalized gait recognition
- gamefactory creating new games with generative interactive videos | arXiv: 2501.08325
- gap gaussianize any point clouds with text guidance | arXiv: 2508.05631
- gas generative avatar synthesis from a single image | arXiv: 2502.06957
- gaussian splatting with discretized sdf for relightable assets | arXiv: 2507.15629
- gaussian variation field diffusion for high-fidelity video-to-4d synthesis | arXiv: 2507.23785
- gaussian-based world model gaussian priors for voxel-based occupancy prediction
- gaussianflowocc sparse and weakly supervised occupancy estimation using gaussian | arXiv: 2502.17288
- gaussianocc fully self-supervised and efficient 3d occupancy estimation with gau
- gaussianproperty integrating physical properties to 3d gaussians with lmms | arXiv: 2412.11258
- gaussianupdate continual 3d gaussian splatting update for changing environments | arXiv: 2508.08867
- gaussrender learning 3d occupancy with gaussian rendering | arXiv: 2502.05040
- gauupdate new object insertion in 3d gaussian fields with consistent global illu
- gaze-language alignment for zero-shot prediction of visual search targets from h
- gazegaussian high-fidelity gaze redirection with 3d gaussian splatting | arXiv: 2411.12981
- gdkvm echocardiography video segmentation via spatiotemporal key-value memory wi | arXiv: 2512.10252
- gecko gigapixel vision-concept contrastive pretraining in histopathology | arXiv: 2504.01009
- gemex a large-scale groundable and explainable medical vqa benchmark for chest x | arXiv: 2411.16778
- geminio language-guided gradient inversion attacks in federated learning | arXiv: 2411.14937
- gendop auto-regressive camera trajectory generation as a director of photography | arXiv: 2504.07083
- general compression framework for efficient transformer object tracking | arXiv: 2409.17564
- generalizable non-line-of-sight imaging with learnable physical priors | arXiv: 2409.14011
- generalizable object re-identification via visual in-context prompting | arXiv: 2508.21222
- generalized deep multi-view clustering via causal learning with partially aligne
- generalized tensor-based parameter-efficient fine-tuning via lie group transform | arXiv: 2504.00851
- generate refine and encode leveraging synthesized novel samples for on-the-fly f | arXiv: 2507.04051
- generate transduct adapt iterative transduction with vlms | arXiv: 2501.06031
- generating fast and slow scalable parallel video generation with video interface | arXiv: 2503.17539
- generating multi-image synthetic data for text-to-image customization | arXiv: 2502.01720
- generating physically stable and buildable brick structures from text | arXiv: 2505.05469
- generative active learning for long-tail trajectory prediction via controllable | arXiv: 2507.22615
- generative modeling of shape-dependent self-contact human poses | arXiv: 2509.23393
- generative zoo | arXiv: 2412.08101
- generic event boundary detection via denoising diffusion | arXiv: 2508.12084
- genflow3d generative scene flow estimation and prediction on point cloud sequenc
- genflowrl shaping rewards with generative object-centric flow in visual reinforc | arXiv: 2508.11049
- genhancer imperfect generative models are secretly strong vision-centric enhance | arXiv: 2503.19480
- genhaze pioneering controllable one-step realistic haze generation for real-worl
- genieblue integrating both linguistic and multimodal capabilities for large lang
- genm3 generative pretrained multi-path motion model for text conditional human m | arXiv: 2503.14919
- genmo a generalist model for human motion | arXiv: 2505.01425
- geo4d leveraging video generators for geometric 4d scene reconstruction | arXiv: 2504.07961
- geobench-vlm benchmarking vision-language models for geospatial tasks | arXiv: 2411.19325
- geodistill geometry-guided self-distillation for weakly supervised cross-view lo | arXiv: 2507.10935
- geoexplorer active geo-localization with curiosity-driven exploration | arXiv: 2508.00152
- geoformer geometry point encoder for 3d object detection with graph-based transf
- geometry distributions | arXiv: 2411.16076
- geometrycrafter consistent geometry estimation for open-world videos with diffus
- geoprog3d compositional visual reasoning for city-scale 3d language fields | arXiv: 2506.23352
- geosplatting towards geometry guided gaussian splatting for physically-based inv | arXiv: 2410.24204
- gesturehydra semantic co-speech gesture synthesis via hybrid modality diffusion | arXiv: 2507.22731
- gfpack attention-driven gradient fields for optimizing 2d irregular packing
- ggtalker talking head systhesis with generalizable gaussian priors and identity- | arXiv: 2506.21513
- global and local entailment learning for natural world imagery | arXiv: 2506.21476
- global motion corresponder for 3d point-based scene interpolation under large mo | arXiv: 2508.20136
- global-aware monocular semantic scene completion with state space models | arXiv: 2503.06569
- gm-moe low-light enhancement with gated-mechanism mixture-of-experts | arXiv: 2503.07417
- gmmamba group masking mamba for whole slide image classification
- golden noise for diffusion models a learning framework | arXiv: 2411.09502
- grab a challenging graph analysis benchmark for large multimodal models | arXiv: 2408.11817
- gradient decomposition and alignment for incremental object detection
- gradient extrapolation for debiased representation learning | arXiv: 2503.13236
- gradient short-circuit efficient out-of-distribution detection via feature inter | arXiv: 2507.01417
- gradient-reweighted adversarial camouflage for physical object detection evasion
- granular concept circuits toward a fine-grained circuit discovery for concept re | arXiv: 2508.01728
- graph domain adaptation with dual-branch encoder and two-level alignment for who
- greg geometry-aware region refinement for sign language video generation
- grouped speculative decoding for autoregressive image generation | arXiv: 2508.07747
- growing a twig to accelerate large vision-language models | arXiv: 2503.14075
- gs-id illumination decomposition on gaussian splatting via adaptive light aggreg
- gs-livm real-time photo-realistic lidar-inertial-visual mapping with gaussian sp | arXiv: 2410.17084
- gs-occ3d scaling vision-only occupancy reconstruction with gaussian splatting | arXiv: 2507.19451
- gsot3d towards generic 3d single object tracking in the wild | arXiv: 2412.02129
- gsv3d gaussian splatting-based geometric distillation with stable video diffusio
- gt-mean loss a simple yet effective solution for brightness mismatch in low-ligh
- gtr guided thought reinforcement prevents thought collapse in rl-based vlm agent | arXiv: 2503.08525
- GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training | arXiv: 2503.08525
- guava generalizable upper body 3d gaussian avatar | arXiv: 2505.03351
- guiding diffusion-based articulated object generation by partial point cloud ali | arXiv: 2508.00558
- guiding noisy label conditional diffusion models with score-based discriminator | arXiv: 2508.19581
- guiodyssey a comprehensive dataset for cross-app gui navigation on mobile device | arXiv: 2406.08451
- hades human avatar with dynamic explicit hair strands
- haircup hair compositional universal prior for 3d gaussian avatars | arXiv: 2507.19481
- hallucinatory image tokens a training-free eazy approach to detecting and mitiga
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation | arXiv: 2503.21979
- harmonyseg tubular structure segmentation with deep-shallow feature fusion and g
- harnessing massive satellite imagery with efficient masked image modeling | arXiv: 2406.11933
- harnessing vision foundation models for high-performance training-free open voca
- hcceposebf predicting front back surfaces to construct ultra-dense 2d-3d corresp | arXiv: 2510.10177
- hdr image generation via gain map decomposed diffusion
- head2body body pose generation from multi-sensory head-mounted inputs
- heavy labels out dataset distillation with label space lightening | arXiv: 2408.08201
- hermes a unified self-driving world model for simultaneous 3d scene understandin | arXiv: 2501.14729
- hermes temporal-coherent long-form understanding with episodes and semantics | arXiv: 2408.17443
- heuristic-induced multimodal risk distribution jailbreak attack for multimodal l | arXiv: 2412.05934
- hfd-teacher high-frequency depth distillation from depth foundation models for e
- hi-gaussian hierarchical gaussians under normalized spherical projection for sin
- hi3dgen high-fidelity 3d geometry generation from images via normal bridging | arXiv: 2503.22236
- hierarchical 3d scene graphs construction outdoors
- hierarchical event memory for accurate and low-latency online video temporal gro | arXiv: 2508.04546
- hierarchical material recognition from local appearance | arXiv: 2505.22911
- hierarchical variational test-time prompt generation for zero-shot generalizatio
- hierarchical visual prompt learning for continual video instance segmentation | arXiv: 2508.08612
- hierarchical-aware orthogonal disentanglement framework for fine-grained skeleto
- hiero understanding the hierarchy of human behavior enhances reasoning on egocen | arXiv: 2505.12911
- high-resolution spatiotemporal modeling with global-local state space models for | arXiv: 2510.11017
- himtok learning hierarchical mask tokens for image segmentation with large multi | arXiv: 2503.13026
- hineus high-fidelity neural surface mitigating low-texture and reflective ambigu | arXiv: 2506.23854
- hints of prompt enhancing visual representation for multimodal llms in autonomou | arXiv: 2411.13076
- hipandas hyperspectral image joint denoising and super-resolution by image fusio
- his-gpt towards 3d human-in-scene multimodal understanding | arXiv: 2503.12955
- holistic tokenizer for autoregressive image generation | arXiv: 2507.02358
- holistic unlearning benchmark a multi-faceted evaluation for text-to-image diffu | arXiv: 2410.05664
- hort monocular hand-held objects reconstruction with transformers | arXiv: 2503.21313
- housetour a virtual real estate aigent | arXiv: 2510.18054
- how do multimodal large language models handle complex multimodal reasoning plac
- how do optical flow and textual prompts collaborate to assist in audio-visual se | arXiv: 2601.08133
- how far are ai-generated videos from simulating the 3d visual world a learned 3d | arXiv: 2406.19568
- how would it sound material-controlled multimodal acoustic profile generation fo | arXiv: 2508.02905
- hpsv3 towards wide-spectrum human preference score | arXiv: 2508.03789
- hq-clip leveraging large vision-language models to create high-quality image-tex
- hrscene how far are vlms from effective high-resolution image understanding | arXiv: 2504.18406
- humanolat a large-scale dataset for full-body human relighting and novel-view sy | arXiv: 2508.09137
- humans as checkerboards calibrating camera motion scale for world-coordinate hum
- humoto a 4d dataset of mocap human object interactions | arXiv: 2504.10414
- hust high-fidelity unbiased skin tone estimation via texture quantization
- hvpunet hybrid-voxel point-cloud upsampling network
- hybrid layout control for diffusion transformer fewer annotations superior aesth
- hybrid-tower fine-grained pseudo-query interaction and generation for text-to-vi
- hybrid-tta continual test-time adaptation via dynamic domain shift detection | arXiv: 2409.08566
- hypdae hyperbolic diffusion autoencoders for hierarchical few-shot image generat | arXiv: 2411.17784
- hyper-depth hypergraph-based multi-scale representation fusion for monocular dep
- hypidecoder hybrid pixel decoder for efficient segmentation and detection
- hytip hybrid temporal information propagation for masked conditional residual vi | arXiv: 2508.02072
- i am big you are little i am right you are wrong | arXiv: 2507.23509
- i2-world intra-inter tokenization for efficient dynamic 4d scene forecasting | arXiv: 2507.09144
- iap invisible adversarial patch attack through perceptibility-aware localization | arXiv: 2507.06856
- ideator jailbreaking and benchmarking large vision-language models using themsel | arXiv: 2411.00827
- IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves | arXiv: 2411.00827
- identity preserving 3d head stylization with multiview score distillation | arXiv: 2411.13536
- identity-aware language gaussian splatting for open-vocabulary 3d semantic segme
- idf iterative dynamic filtering networks for generalizable image denoising | arXiv: 2508.19649
- idface face template protection for efficient and secure identification | arXiv: 2507.12050
- igl-nav incremental 3d gaussian localization for image-goal navigation | arXiv: 2508.00823
- illume illuminating your llms to see draw and self-enhance | arXiv: 2412.06673
- im-lut interpolation mixing look-up tables for image super-resolution | arXiv: 2507.09923
- im360 large-scale indoor mapping with 360 cameras | arXiv: 2502.12545
- image as an imu estimating camera motion from a single motion-blurred image | arXiv: 2503.17358
- image intrinsic scale assessment bridging the gap between quality and resolution | arXiv: 2502.06476
- image-guided shape-from-template using mesh inextensibility constraints | arXiv: 2507.22699
- imagegem in-the-wild generative image interaction dataset for generative model p | arXiv: 2510.18433
- imanip skill-incremental learning for robotic manipulation | arXiv: 2503.07087
- imbalance in balance online concept balancing in generation models | arXiv: 2507.13345
- imhead a large-scale implicit morphable model for localized head modeling | arXiv: 2510.10793
- implicit counterfactual learning for audio-visual segmentation | arXiv: 2507.20740
- improved noise schedule for diffusion training | arXiv: 2407.03297
- improving large vision and language models by learning from a panel of peers | arXiv: 2509.01610
- incremental few-shot semantic segmentation via multi-level switchable visual pro
- inference-time diffusion model distillation | arXiv: 2412.08871
- infgen a resolution-agnostic paradigm for scalable image synthesis | arXiv: 2509.10441
- infinidreamer arbitrarily long human motion generation via segment score distill | arXiv: 2411.18303
- information density principle for mllm benchmarks | arXiv: 2503.10079
- information-bottleneck driven binary neural network for change detection | arXiv: 2507.03504
- inpaint4drag repurposing inpainting models for drag-based image editing via bidi | arXiv: 2509.04582
- insideout integrated rgb-radiative gaussian splatting for comprehensive 3d objec | arXiv: 2510.17864
- instadrive instance-aware driving world models for realistic and consistent vide
- instance-level video depth in groups beyond occlusions
- instascene towards complete 3d instance decomposition and reconstruction from cl | arXiv: 2507.08416
- instinct instance-level interaction architecture for query-based collaborative p | arXiv: 2509.23700
- instruction-grounded visual projectors for continual learning of generative visi | arXiv: 2508.00260
- instruction-oriented preference alignment for enhancing multi-modal comprehensio | arXiv: 2503.20309
- integrating biological knowledge for robust microscopy image profiling on de nov | arXiv: 2507.10737
- integrating task-specific and universal adapters for pre-trained model-based cla | arXiv: 2508.08165
- integrating visual interpretation and linguistic reasoning for geometric problem
- inter2former dynamic hybrid attention for efficient high-precision interactive s | arXiv: 2507.09612
- interactavatar modeling hand-face interaction in photorealistic avatars with def
- interaction-merged motion planning effectively leveraging diverse motion dataset | arXiv: 2507.04790
- intergsedit interactive 3d gaussian splatting editing with 3d geometry-consisten
- interpretable point cloud classification using multiple instance learning
- interpretable zero-shot learning with locally-aligned vision-language model | arXiv: 2506.23822
- intersyn interleaved learning for dynamic motion synthesis in the wild | arXiv: 2508.10297
- intervening in black box concept bottleneck model for enhancing human neural net | arXiv: 2506.22803
- intra-modal and cross-modal synchronization for audio-visual deepfake detection
- intra-view and inter-view correlation guided multi-view novel class discovery | arXiv: 2507.12029
- introstyle training-free introspective style attribution using diffusion feature | arXiv: 2412.14432
- invisible watermarks visible gains steering machine unlearning with bi-level wat | arXiv: 2508.10065
- irgpt understanding real-world infrared image with bi-cross-modal curriculum on | arXiv: 2507.14449
- iris breaking gui complexity with adaptive focus and self-refining | arXiv: 2412.10342
- is less more exploring token condensation as training-free test-time adaptation | arXiv: 2410.14729
- is meta-learning out rethinking unsupervised few-shot classification with limite | arXiv: 2509.13185
- jailbreaking multimodal large language models via shuffle inconsistency | arXiv: 2501.04931
- jigsaw imagining complete shape priors for object reassembly | arXiv: 2410.11816
- joint asymmetric loss for learning with noisy labels | arXiv: 2507.17692
- joint diffusion models in continual learning | arXiv: 2411.08224
- joint self-supervised video alignment and action segmentation | arXiv: 2503.16832
- jointdit enhancing rgb-depth joint modeling with diffusion transformers | arXiv: 2505.00482
- jpeg processing neural operator for backward-compatible coding | arXiv: 2507.23521
- kaputt a large-scale dataset for visual defect detection | arXiv: 2510.05903
- kda knowledge diffusion alignment with enhanced context for video temporal groun
- keep your friends close and your enemies farther distance-aware voxel-wise contr
- keyframe-oriented vision token pruning enhancing efficiency of large vision lang
- kh symmetry understanding of 3d shapes via chirality disentanglement | arXiv: 2508.05505
- kinmo kinematic-aware human motion understanding and generation | arXiv: 2411.15472
- know no better a data-driven approach for enhancing negation awareness in clip | arXiv: 2501.10913
- know your attention maps class-specific token masking for weakly supervised sema | arXiv: 2507.06848
- knowledge distillation for learned image compression
- knowledge distillation with refined logits | arXiv: 2408.07703
- knowledge-guided part segmentation
- la-motr end-to-end multi-object tracking by learnable association
- laconic a 3d layout adapter for controllable image creation | arXiv: 2507.03257
- lacoot layer collapse through optimal transport | arXiv: 2406.08933
- langbridge interpreting image as a combination of language embeddings | arXiv: 2503.19404
- langtraj diffusion model and dataset for language-conditioned trajectory simulat | arXiv: 2504.11521
- language decoupling with fine-grained knowledge guidance for referring multi-obj
- language driven occupancy prediction | arXiv: 2411.16072
- larender training-free occlusion control in image generation via latent renderin | arXiv: 2508.07647
- large multi-modal models can interpret features in large multi-modal models | arXiv: 2411.14982
- large scene generation with cube-absorb discrete diffusion
- large-scale pre-training for grounded video caption generation | arXiv: 2503.10781
- lark low-rank updates after knowledge localization for few-shot class-incrementa
- latent diffusion models with masked autoencoders | arXiv: 2507.09984
- latent expression generation for referring image segmentation and grounding | arXiv: 2508.05123
- latent swap joint diffusion for 2d long-form latent generation | arXiv: 2502.05130
- latino-pro latent consistency inverse solver with prompt optimization | arXiv: 2503.12615
- latte collaborative test-time adaptation of vision-language models in federated | arXiv: 2507.21494
- lawdis language-window-based controllable dichotomous image segmentation | arXiv: 2508.01152
- lay-your-scene natural scene layout generation with diffusion transformers | arXiv: 2505.04718
- lay2story extending diffusion transformers for layout-togglable story generation | arXiv: 2508.08949
- layeranimate layer-level control for animation | arXiv: 2501.08295
- layerd decomposing raster graphic designs into layers | arXiv: 2509.25134
- layerlock non-collapsing representation learning with progressive freezing | arXiv: 2509.10156
- layertracer cognitive-aligned layered svg synthesis via diffusion transformer | arXiv: 2502.01105
- lazymar accelerating masked autoregressive models via feature caching | arXiv: 2503.12450
- ld-rps zero-shot unified image restoration via latent diffusion recurrent poster | arXiv: 2507.00790
- ldip long distance information propagation for video super-resolution
- leanvae an ultra-efficient reconstruction vae for video diffusion models | arXiv: 2503.14325
- leaps and bounds an improved point cloud winding number formulation for fast nor
- learn2synth learning optimal data synthesis using hypergradients for brain image | arXiv: 2411.16719
- learnable feature patches and vectors for boosting low-light image enhancement w
- learnable fractional reaction-diffusion dynamics for under-display tof imaging a | arXiv: 2511.01704
- learnable logit adjustment for imbalanced semi-supervised learning under class d
- learnable retrieval enhanced visual-text alignment and fusion for radiology repo
- learned image compression with hierarchical progressive context modeling | arXiv: 2507.19125
- learning 3d object spatial relationships from pre-trained 2d diffusion models | arXiv: 2503.19914
- learning 3d scene analogies with neural contextual scene maps | arXiv: 2503.15897
- learning 4d embodied world models | arXiv: 2504.20995
- learning a unified template for gait recognition
- learning deblurring texture prior from unpaired data with diffusion model | arXiv: 2507.13599
- learning few-step diffusion models by trajectory distribution matching | arXiv: 2503.06674
- learning hierarchical line buffer for image processing
- learning implicit features with flow-infused transformations for realistic virtu
- learning interpretable queries for explainable image classification with informa | arXiv: 2312.11548
- learning neural scene representation from itof imaging
- learning normal flow directly from events
- learning on the go a meta-learning object navigation model
- learning pixel-adaptive multi-layer perceptrons for real-time image enhancement | arXiv: 2507.12135
- learning precise affordances from egocentric videos for robotic manipulation | arXiv: 2408.10123
- learning robust image watermarking with lossless cover recovery
- learning robust stereo matching in the wild with selective mixture-of-experts | arXiv: 2507.04631
- learning separable fine-grained representation via dendrogram construction from
- learning to generalize without bias for open-vocabulary action recognition | arXiv: 2502.20158
- learning to see in the extremely dark | arXiv: 2506.21132
- learning to see inside opaque liquid containers using speckle vibrometry | arXiv: 2507.20757
- learning visual hierarchies in hyperbolic space for image retrieval | arXiv: 2411.17490
- learning visual proxy for compositional zero-shot learning | arXiv: 2501.13859
- legion learning to ground and explain for synthetic image detection | arXiv: 2503.15264
- lego-maker a semantic-driven algorithm for text-to-3d generation
- legrad an explainability method for vision transformers via feature formation se | arXiv: 2404.03214
- less is more empowering gui agent with context-aware simplification | arXiv: 2507.03730
- less is more improving motion diffusion models with sparse keyframes | arXiv: 2503.13859
- less-to-more generalization unlocking more controllability by in-context generat | arXiv: 2504.02160
- leveraging 2d priors and sdf guidance for urban scene rendering | arXiv: 2510.13381
- leveraging bev paradigm for ground-to-aerial image synthesis | arXiv: 2408.01812
- leveraging panoptic scene graph for evaluating fine-grained text-to-image genera
- leveraging spatial invariance to boost adversarial transferability
- lga-net learning local and global affinities for sparse scribble based image col
- lhm large animatable human reconstruction model for single image to 3d in second
- liberated-gs 3d gaussian splatting independent from sfm point clouds
- lift latent implicit functions for task- and data-agnostic encoding | arXiv: 2503.15420
- lifting the structural morphing for wide-angle images rectification unified cont
- lightcity an urban dataset for outdoor inverse rendering and reconstruction unde
- lightsout diffusion-based outpainting for enhanced lens flare removal | arXiv: 2510.15868
- lightweight and fast real-time image enhancement via decomposition of the spatia | arXiv: 2508.16121
- lightweight gradient-aware upscaling of 3d gaussian splatting images | arXiv: 2503.14171
- linr-pcgc lossless implicit neural representations for point cloud geometry comp | arXiv: 2507.15686
- lion-lora rethinking lora fusion to unify controllable spatial and temporal gene
- lira reasoning reconstruction via multimodal large language models
- lit delving into a simple linear diffusion transformer for image generation | arXiv: 2501.12976
- llava-3d a simple yet effective pathway to empowering lmms with 3d capabilities | arXiv: 2409.18125
- llava-cot let vision language models reason step-by-step | arXiv: 2411.10440
- LLaVA-CoT: Let Vision Language Models Reason Step-by-Step | arXiv: 2411.10440
- llava-kd a framework of distilling multimodal large language models | arXiv: 2410.16236
- llava-prumerge adaptive token reduction for efficient large multimodal models | arXiv: 2403.15388
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | arXiv: 2403.15388
- llm thought divergence and convergence for dialogue-based image generation contr
- llm-assisted entropy-based adaptive distillation for unsupervised fine-grained v
- lmm-det make large multimodal models excel in object detection | arXiv: 2507.18300
- local dense logit relations for enhanced knowledge distillation | arXiv: 2507.15911
- localdygs multi-view global dynamic scene modeling via adaptive local implicit f | arXiv: 2507.02363
- LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models | arXiv: 2504.14032
- long context tuning for video generation | arXiv: 2503.10589
- long-context state-space video world models | arXiv: 2505.20171
- long-term traffic simulation with interleaved autoregressive motion and scenario | arXiv: 2506.17213
- long3r long sequence streaming 3d reconstruction | arXiv: 2507.18255
- longsplat robust unposed 3d gaussian splatting for casual long videos | arXiv: 2508.14041
- looking in the mirror a faithful counterfactual explanation method for interpret | arXiv: 2509.16822
- lookout real-world humanoid egocentric navigation | arXiv: 2508.14466
- lora-fair federated lora fine-tuning with aggregation and initialization refinem | arXiv: 2411.14961
- loraverse a submodular framework to retrieve diverse adapters for diffusion mode | arXiv: 2510.15022
- loss functions for predictor-based neural architecture search | arXiv: 2506.05869
- low-light image enhancement using event-based illumination estimation | arXiv: 2504.09379
- lusd localized update score distillation for text-guided image editing | arXiv: 2503.11054
- lvface progressive cluster optimization for large vision models in face recognit | arXiv: 2501.13420
- lyra an efficient and speech-centric framework for omni-cognition | arXiv: 2412.09501
- Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | arXiv: 2412.09501
- m-net mri brain tumor sequential segmentation network via mesh-cast | arXiv: 2507.20582
- m2eit multi-domain mixture of experts for robust neural inertial tracking
- m2sformer multi-spectral and multi-scale attention with edge-aware difficulty gu | arXiv: 2506.20922
- ma-cir a multimodal arithmetic benchmark for composed image retrieval
- maestro task-relevant optimization via adaptive feature enhancement and suppress | arXiv: 2509.17462
- magic insert style-aware drag-and-drop | arXiv: 2407.02489
- magiccity geometry-aware 3d city generation from satellite imagery with multi-vi
- magicdrive-v2 high-resolution long video generation for autonomous driving with | arXiv: 2411.13807
- magichoi leveraging 3d priors for accurate hand-object reconstruction from short
- magicid hybrid preference optimization for id-consistent and dynamic-preserved v | arXiv: 2503.12689
- magicmirror id-preserved video generation in video diffusion transformers | arXiv: 2501.03931
- mags reconstructing and simulating dynamic 3d objects with mesh-adsorbed gaussia
- magshield towards better robustness in sparse inertial motion capture under magn | arXiv: 2506.22907
- make me happier evoking emotions through image diffusion models | arXiv: 2403.08255
- make your training flexible towards deployment-efficient video models | arXiv: 2503.14237
- mambaml exploring state space models for multi-label image classification
- mamtiff-cad multi-scale latent diffusion with mamba for complex parametric seque | arXiv: 2511.17647
- manual-pa learning 3d part assembly from instruction diagrams | arXiv: 2411.18011
- maskcontrol spatio-temporal control for masked motion synthesis | arXiv: 2410.10780
- maskhand generative masked modeling for robust hand mesh reconstruction in the w | arXiv: 2412.13393
- masksam auto-prompt sam with mask classification for volumetric medical image se
- mastering collaborative multi-modal data selection a focus on informativeness un | arXiv: 2412.06293
- matchdiffusion training-free generation of match-cuts | arXiv: 2411.18677
- mate images are all you need for material transfer via diffusion transformer
- materialmvp illumination-invariant material generation via multi-view pbr diffus | arXiv: 2503.10289
- matvlm hybrid mamba-transformer for efficient vision-language modeling | arXiv: 2503.13440
- mavflow preserving paralinguistic elements with conditional flow matching for ze | arXiv: 2503.11026
- mavias mitigate any visual bias | arXiv: 2412.06632
- mbti masked blending transformers with implicit positional encoding for frame-ra
- mc-bench a benchmark for multi-context visual grounding in the era of mllms | arXiv: 2410.12332
- mcam multimodal causal analysis model for ego-vehicle-level driving video unders | arXiv: 2507.06072
- mcid multi-aspect copyright infringement detection for generated images
- mdd a dataset for text-and-music conditioned duet dance generation | arXiv: 2508.16911
- mdp-omni parameter-free multimodal depth prior-based sampling for omnidirectiona
- mdp3 a training-free approach for list-wise frame selection in video-llms | arXiv: 2501.02885
- measurexpert automatic anthropometric measurement extraction from two unregister
- measuring the impact of rotation equivariance on aerial object detection | arXiv: 2507.09896
- mega memory-efficient 4d gaussian splatting for dynamic scenes | arXiv: 2410.13613
- meh a multi-style dataset and toolkit for advancing egyptian hieroglyph recognit
- membership inference attacks with false discovery rate control | arXiv: 2508.07066
- memdistill distilling lidar knowledge into memory for camera-only 3d object dete
- memfof high-resolution training for memory-efficient multi-frame optical flow es | arXiv: 2506.23151
- memory-efficient 4-bit preconditioned stochastic optimization | arXiv: 2412.10663
- memory-efficient generative models via product quantization
- memorytalker personalized speech-driven 3d facial animation via audio-guided sty | arXiv: 2507.20562
- meshanything v2 artist-created mesh generation with adjacent mesh tokenization | arXiv: 2408.02555
- meshllm empowering large language models to progressively understand and generat
- meshmamba state space models for articulated 3d mesh generation and reconstructi | arXiv: 2507.15212
- meshpad interactive sketch-conditioned artist-reminiscent mesh generation and ed | arXiv: 2503.01425
- met2net a decoupled two-stage spatio-temporal forecasting model for complex mete
- meta-learning dynamic center distance hard sample mining for learning with noisy
- meta-unlearning on diffusion models preventing relearning unlearned concepts | arXiv: 2410.12777
- metamorph multimodal understanding and generation via instruction tuning | arXiv: 2412.14164
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning | arXiv: 2412.14164
- meteor multi-encoder collaborative token pruning for efficient vision language m | arXiv: 2507.20842
- metric convolutions a unifying theory to adaptive image convolutions | arXiv: 2406.05400
- mgsfm multi-camera geometry driven global structure-from-motion | arXiv: 2507.03306
- mh-lvc multi-hypothesis temporal prediction for learned conditional residual vid
- mikudance animating character art with mixed motion dynamics | arXiv: 2411.08656
- mincd-pnp learning 2d-3d correspondences with approximate blind pnp | arXiv: 2507.15257
- mind the cost of scaffold benign clients may even become accomplices of backdoor | arXiv: 2411.16167
- mind the gap aligning vision foundation models to image feature matching | arXiv: 2507.10318
- minerva evaluating complex video reasoning | arXiv: 2505.00681
- miore var-miore benchmarks to push the boundaries of restoration | arXiv: 2509.06803
- missrag addressing the missing modality challenge in multimodal large language m
- mistsense versatile online detection of procedural and execution mistakes
- mitigating catastrophic overfitting in fast adversarial training via label infor
- mitigating object hallucinations via sentence-level early intervention | arXiv: 2507.12455
- mixa-q revisiting activation sparsity for vision transformers from a mixed-preci | arXiv: 2507.19131
- mixant observation-dependent memory propagation for stochastic dense action anti | arXiv: 2509.11394
- mixed signals a diverse point cloud dataset for heterogeneous lidar v2x collabor | arXiv: 2502.14156
- mixri mixing features of reference images for novel object pose estimation | arXiv: 2601.06883
- mixture-of-scores robust image-text data valuation via three lines of code
- mm-ifengine towards multimodal instruction following | arXiv: 2504.07957
- mm-spatial exploring 3d spatial understanding in multimodal llms | arXiv: 2503.13111
- mmaif multi-task and multi-degradation all-in-one for image fusion with language | arXiv: 2503.14944
- MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning | arXiv: 2507.21924
- mmone representing multiple modalities in one scene | arXiv: 2507.11129
- mmreason an open-ended multi-modal multi-step reasoning benchmark for mllms towa
- mobileie an extremely lightweight and effective convnet for real-time image enha | arXiv: 2507.01838
- mobileviclip an efficient video-text model for mobile devices | arXiv: 2508.07312
- modaltune fine-tuning slide-level foundation models with multi-modal information
- moerl when mixture-of-experts meet reinforcement learning for adverse weather im
- mofrr mixture of diffusion models for face retouching restoration | arXiv: 2507.19770
- moga 3d generative avatar prior for monocular gaussian avatar reconstruction | arXiv: 2507.23597
- molparser end-to-end visual recognition of molecule structures in the wild | arXiv: 2411.11098
- moma-kitchen a 100k benchmark for affordance-grounded last-mile navigation in mo
- moment quantization for video temporal grounding | arXiv: 2504.02286
- momentum-gs momentum gaussian self-distillation for high-quality large scene rec | arXiv: 2412.04887
- monocular facial appearance capture in the wild | arXiv: 2412.12765
- monocular semantic scene completion via masked recurrent networks | arXiv: 2507.17661
- monomobility zero-shot 3d mobility analysis from monocular videos | arXiv: 2505.11868
- monosowa scalable monocular 3d object detector without human annotations | arXiv: 2501.09481
- monovln bridging the observation gap between monocular and panoramic vision and
- monster a unified model for motion scene text retrieval | arXiv: 2510.03200
- morphogen efficient unconditional generation of long-range projection neuronal m
- mosaic generating consistent privacy-preserving scenes from multiple depth views
- mosaicdiff training-free structural pruning for diffusion model acceleration ref | arXiv: 2510.11962
- mosic optimal-transport motion trajectory for dense self-supervised learning | arXiv: 2506.08694
- motal unsupervised 3d object detection by modality and task-specific knowledge t
- motion-2-to-3 leveraging 2d motion data for 3d motion generations
- motionagent fine-grained controllable video generation via motion field agent | arXiv: 2502.03207
- motionctrl a real-time controllable vision-language-motion model
- motiondiff training-free zero-shot interactive motion editing via flow-assisted | arXiv: 2503.17695
- motionfollower editing video motion via score-guided diffusion | arXiv: 2405.20325
- motionshot adaptive motion transfer across arbitrary objects for text-to-video g | arXiv: 2507.16310
- motionstreamer streaming motion generation via diffusion-based autoregressive mo | arXiv: 2503.15451
- moto latent motion token as the bridging language for learning robot manipulatio | arXiv: 2412.04445
- move motion-guided few-shot video object segmentation | arXiv: 2507.22061
- mp-hsir a multi-prompt framework for universal hyperspectral image restoration | arXiv: 2503.09131
- mr-fiqa face image quality assessment with multi-reference representations from
- mrgen segmentation data engine for underrepresented mri modalities | arXiv: 2412.04106
- ms3d high-quality 3d generation via multi-scale representation modeling
- msa2 multi-task framework with structure-aware and style-adaptive character repr
- msq memory-efficient bit sparsification quantization | arXiv: 2507.22349
- mug pseudo labeling augmented audio-visual mamba network for audio-visual video | arXiv: 2507.01384
- mugs multi-baseline generalizable gaussian splatting reconstruction | arXiv: 2508.04297
- multi-cache enhanced prototype learning for test-time generalization of vision-l | arXiv: 2508.01225
- multi-identity human image animation with structural video diffusion | arXiv: 2504.04126
- multi-modal few-shot temporal action segmentation
- multi-modal multi-platform person re-identification benchmark and method | arXiv: 2503.17096
- multi-modal multi-task unified embedding model m3t-uem a task-adaptive represent
- multi-modal segment anything model for camouflaged scene segmentation
- multi-object sketch animation by scene decomposition and motion planning | arXiv: 2503.19351
- multi-scenario overlapping text segmentation with depth awareness
- multi-turn consistent image editing | arXiv: 2505.04320
- multi-view 3d point tracking | arXiv: 2508.21060
- multi-view gaze target estimation | arXiv: 2508.05857
- multimodal action conditioned video simulation
- multimodal latent diffusion model for complex sewing pattern generation | arXiv: 2412.14453
- multimodal llms as customized reward models for text-to-image generation | arXiv: 2507.21391
- multiverse a multi-turn conversation benchmark for evaluating large vision and l | arXiv: 2510.16641
- multiverseg scalable interactive segmentation of biomedical imaging datasets wit | arXiv: 2412.15058
- munba machine unlearning via nash bargaining | arXiv: 2411.15537
- muse-vl modeling unified vlm through semantic discrete encoding | arXiv: 2411.17762
- MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding | arXiv: 2411.17762
- music-aligned holistic 3d dance generation via hierarchical motion modeling | arXiv: 2507.14915
- mv-adapter multi-view consistent image generation made easy | arXiv: 2412.03632
- mvgbench a comprehensive benchmark for multi-view generation models | arXiv: 2507.00006
- nappure adversarial purification for robust image classification under non-addit | arXiv: 2510.14025
- natra noise-agnostic framework for trajectory prediction with noisy observations
- nautilus locality-aware autoencoder for scalable mesh generation | arXiv: 2501.14317
- navmorph a self-evolving world model for vision-and-language navigation in conti | arXiv: 2506.23468
- navq learning a q-model for foresighted vision-and-language navigation | arXiv: 2510.16457
- negrefine refining negative label-based zero-shot ood detection | arXiv: 2507.09795
- netracer a topology-aware iterative tracing approach for tubular structure extra
- neuframeq neural frame fields for scalable and generalizable anisotropic quadran
- neural architecture search driven by locally guided diffusion for personalized f
- neural compression for 3d geometry sets | arXiv: 2405.15034
- neural inverse rendering for high-accuracy 3d measurement of moving objects with
- neural multi-view self-calibrated photometric stereo without photometric stereo | arXiv: 2507.23162
- neural solver of dichromatic reflection model for specular highlight removal
- neuraleaf neural parametric leaf models with shape and deformation disentangleme | arXiv: 2507.12714
- neuromanifold-regularized kans for shape-fair feature representations
- neurons emulating the human visual cortex improves fidelity and interpretability | arXiv: 2503.11167
- ngd neural gradient based deformation for monocular garment reconstruction | arXiv: 2508.17712
- no more sibling rivalry debiasing human-object interaction detection | arXiv: 2509.00760
- no pose at all self-supervised pose-free 3d gaussian splatting from sparse views | arXiv: 2508.01171
- noise2score3d tweedies approach for unsupervised point cloud denoising | arXiv: 2503.09283
- noisecontroller towards consistent multi-view video generation via noise decompo
- normalcrafter learning temporally consistent normals from video diffusion priors | arXiv: 2504.11427
- not all degradations are equal a targeted feature denoising framework for genera
- not all frame features are equal video-to-4d generation via decoupling dynamic-s | arXiv: 2502.08377
- not only vision evolve visual speech recognition via peripheral information
- nuiscene exploring efficient generation of unbounded outdoor scenes | arXiv: 2503.16375
- nullswap proactive identity cloaking against deepfake face swapping | arXiv: 2503.18678
- o-mama learning object mask matching between egocentric and exocentric views | arXiv: 2506.06026
- oasis one image is all you need for multimodal instruction data synthesis | arXiv: 2503.08741
- object-level correlation for few-shot segmentation | arXiv: 2509.07917
- objectrelator enabling cross-view object relation understanding across ego-centr
- occlugaussian occlusion-aware gaussian splatting for large scene reconstruction | arXiv: 2503.16177
- occupancy learning with spatiotemporal memory | arXiv: 2508.04705
- ock unsupervised dynamic video prediction with object-centric kinematics | arXiv: 2404.18423
- ocr hinders rag evaluating the cascading impact of ocr on retrieval-augmented ge | arXiv: 2412.02592
- od-rase ontology-driven risk assessment and safety enhancement for autonomous dr | arXiv: 2603.05936
- odp-bench benchmarking out-of-distribution performance prediction | arXiv: 2510.27263
- omegance a single parameter for various granularities in diffusion-based synthes | arXiv: 2411.17769
- ominicontrol minimal and universal control for diffusion transformer | arXiv: 2411.15098
- omni-dc highly robust depth completion with multiresolution depth integration | arXiv: 2411.19278
- omni-scene perception-oriented point cloud geometry enhancement for coordinate q
- omnidiff a comprehensive benchmark for fine-grained image difference captioning | arXiv: 2503.11093
- omnihuman-1 rethinking the scaling-up of one-stage conditioned human animation m | arXiv: 2502.01061
- omnipaint mastering object-oriented editing via disentangled insertion-removal i | arXiv: 2503.08677
- omnisam omnidirectional segment anything model for uda in panoramic semantic seg | arXiv: 2503.07098
- omnivton training-free universal virtual try-on | arXiv: 2507.15037
- on large multimodal models as open-world image classifiers | arXiv: 2503.21851
- on the complexity-faithfulness trade-off of gradient-based explanations | arXiv: 2508.10490
- on the generalization of representation uncertainty in earth observation | arXiv: 2503.07082
- on the provable importance of gradients for autonomous language-assisted image c
- on the recovery of cameras from fundamental matrices
- on the robustness tradeoff in fine-tuning | arXiv: 2503.14836
- one look is enough seamless patchwise refinement for zero-shot monocular depth e | arXiv: 2503.22351
- one perturbation is enough on generating universal adversarial perturbations aga | arXiv: 2406.05491
- one polyp identifies all one-shot polyp segmentation with sam via cascaded prior
- one-shot knowledge transfer for scalable person re-identification | arXiv: 2511.06016
- onegt one-shot geometry-texture neural rendering for head avatars
- online dense point tracking with streaming memory | arXiv: 2503.06471
- online generic event boundary detection | arXiv: 2510.06855
- online language splatting | arXiv: 2503.09447
- online reasoning video segmentation with just-in-time digital twins | arXiv: 2503.21056
- ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models | arXiv: 2507.00898
- open-vocabulary octree-graph for 3d scene understanding | arXiv: 2411.16253
- open-world skill discovery from unsegmented demonstration videos | arXiv: 2503.10684
- openanimals revisiting person re-identification for animals towards better gener | arXiv: 2410.00204
- openrsd towards open-prompts for object detection in remote sensing images | arXiv: 2503.06146
- openvision a fully-open cost-effective family of advanced vision encoders for mu | arXiv: 2505.04601
- ophclip hierarchical retrieval-augmented learning for ophthalmic surgical video-
- optical model-driven sharpness mapping for autofocus in small depth-of-field and
- optimal transport for brain-image alignment unveiling redundancy and synergy in
- oraclefusion assisting the decipherment of oracle bone script with structurally | arXiv: 2506.21101
- orderchain towards general instruct-tuning for stimulating the ordinal understan | arXiv: 2504.04801
- orion a holistic end-to-end autonomous driving framework by vision-language inst
- ouroboros single-step diffusion models for cycle-consistent forward and inverse | arXiv: 2508.14461
- ouromamba a data-free quantization framework for vision mamba | arXiv: 2503.10959
- outdoor monocular slam with global scale-consistent 3d gaussian pointmaps | arXiv: 2507.03737
- outlier-aware post-training quantization for image super-resolution | arXiv: 2511.00682
- ov-scan semantically consistent alignment for novel object discovery in open-voc
- ovg-hq online video grounding with hybrid-modal queries | arXiv: 2508.11903
- p-avas can physics-integrated audio-visual modeling boost neural acoustic synthe
- pacgdc label-efficient generalizable depth completion with projection ambiguity | arXiv: 2507.07374
- pan-crafter learning modality-consistent alignment for pan-sharpening | arXiv: 2505.23367
- panollama generating endless and coherent panoramas with next-token-prediction l | arXiv: 2411.15867
- panst3r multi-view consistent panoptic segmentation | arXiv: 2506.21348
- partfield learning 3d feature fields for part segmentation and beyond | arXiv: 2504.11451
- partial forward blocking a novel data pruning paradigm for lossless training acc | arXiv: 2506.23674
- PASDF: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation | arXiv: 2505.24431
- pasg a closed-loop framework for automated geometric primitive extraction and se | arXiv: 2508.05976
- passing the driving knowledge test | arXiv: 2508.21824
- pasta part-aware sketch-to-3d shape generation with text-aligned prior | arXiv: 2503.12834
- patchscaler an efficient patch-independent diffusion model for image super-resol | arXiv: 2405.17158
- pathfinder a multi-modal multi-agent system for medical diagnostic decision-maki
- pbcat patch-based composite adversarial training against physically realizable a | arXiv: 2506.23581
- pcr-gs colmap-free 3d gaussian splatting via pose co-regularizations | arXiv: 2507.13891
- penalizing boundary activation for object completeness in diffusion models | arXiv: 2509.16968
- perception-as-control fine-grained controllable image animation with 3d-aware mo
- personacraft personalized and controllable full-body multi-human scene generatio
- personalvideo high id-fidelity video customization without dynamic and semantic | arXiv: 2411.17048
- perspective-aware reasoning in vision-language models via mental imagery simulat | arXiv: 2504.17207
- perspective-aware teaching adapting knowledge for heterogeneous distillation | arXiv: 2501.08885
- perspose 3d human pose estimation with perspective encoding and perspective rota | arXiv: 2508.17239
- ph-gan physics-inspired gan for generating sar images under limited data | arXiv: 2503.02242
- phatnet a physics-guided haze transfer network for domain-adaptive real-world im | arXiv: 2507.14826
- phd personalized 3d human body fitting with point diffusion | arXiv: 2508.21257
- photolithography overlay map generation with implicit knowledge distillation dif
- physical degradation model-guided interferometric hyperspectral reconstruction w
- physics context builders a modular framework for physical reasoning in vision-la | arXiv: 2412.08619
- physsplat efficient physics simulation for 3d scenes via mllm-guided gaussian sp | arXiv: 2411.12789
- phystwin physics-informed reconstruction and simulation of deformable objects fr
- pi-gps enhancing geometry problem solving by unleashing the power of diagrammati | arXiv: 2503.05543
- pinco position-induced consistent adapter for diffusion transformer in foregroun
- pino person-interaction noise optimization for long-duration and customizable mo | arXiv: 2507.19292
- pixelstitch structure-preserving pixel-wise bidirectional warps for unsupervised
- pla prompt learning attack against text-to-image generative models | arXiv: 2508.03696
- placeit3d language-guided object placement in real 3d scenes | arXiv: 2505.05288
- plan proactive low-rank allocation for continual learning | arXiv: 2510.21188
- planar affine rectification from local change of scale and orientation
- planeras learning planar primitives for 3d plane recovery
- plangen towards unified layout planning and image generation in auto-regressive
- plmp - point-line minimal problems for projective sfm | arXiv: 2503.04351
- polaranything diffusion-based polarimetric image synthesis | arXiv: 2507.17268
- polarimetric neural field via unified complex-valued wave representation
- poseanchor robust root position estimation for 3d human pose estimation
- posesyn synthesizing diverse 3d pose data from in-the-wild 2d data | arXiv: 2503.13025
- possloss a reliable and sensitive facial landmark detection loss function
- pre-mamba a 4d state space model for ultra-high-frequent event camera deraining | arXiv: 2505.05307
- predict-optimize-distill a self-improving cycle for 4d object understanding | arXiv: 2504.17441
- pretrained reversible generation as unsupervised visual representation learning | arXiv: 2412.01787
- primhoi compositional human-object interaction via reusable primitives
- principal components enable a new language of images | arXiv: 2503.08685
- prior-aware dynamic temporal modeling framework for sequential 3d hand pose esti
- prior-flow enhancing primitive panoramic optical flow with orthogonal view | arXiv: 2506.23897
- prior2former - evidential modeling of mask transformers for assumption-free open
- priormotion generative class-agnostic motion prediction with raster-vector motio
- privacy-centric deep motion retargeting for anonymization of skeleton-based moti
- pro-vpt distribution-adaptive visual prompt tuning via prompt relocation | arXiv: 2503.06901
- proactive scene decomposition and reconstruction | arXiv: 2510.16272
- probabilistic inertial poser probip uncertainty-aware human motion modeling from
- probres probabilistic jump diffusion for open-world egocentric activity recognit | arXiv: 2504.03948
- processing and acquisition traces in visual encoders what does clip know about y | arXiv: 2508.10637
- progait a multi-purpose video dataset and benchmark for transfemoral prosthesis | arXiv: 2507.10223
- progressive artwork outpainting via latent diffusion models
- progressive test time energy adaptation for medical image segmentation | arXiv: 2503.16616
- progressor a perceptually guided reward estimator with self-supervised online re | arXiv: 2411.17764
- prompt guidance and human proximal perception for hot prediction with regional j | arXiv: 2507.01630
- prompt-a-video prompt your video diffusion model via preference-aligned llm | arXiv: 2412.15156
- prompt-driven transferable adversarial attack on person re-identification with a
- promptdresser improving the quality and controllability of virtual try-on via ge
- propvg end-to-end proposal-driven visual grounding with multi-granularity discri | arXiv: 2509.04833
- prototype guided backdoor defense via activation space manipulation
- prototype-based contrastive learning with stage-wise progressive augmentation fo
- proxy-bridged game transformer for interactive extreme motion prediction
- pruning all-rounder rethinking and improving inference efficiency for large visi
- pseudo-sd pseudo controlled stable diffusion for semi-supervised and cross-domai
- pseudomaptrainer learning online mapping without hd maps | arXiv: 2508.18788
- purge-gate backpropagation-free test-time adaptation for point clouds classifica
- pvchat personalized video chat with one-shot learning | arXiv: 2503.17069
- pvmamba parallelizing vision mamba via dynamic state aggregation
- q-frame query-aware frame selection and multi-resolution adaptation for video-ll | arXiv: 2506.22139
- qk-edit revisiting attention-based injection in mm-dit for image and video editi
- quadratic gaussian splatting high quality surface reconstruction with second-ord
- quantcache adaptive importance-guided quantization with hierarchical latent and
- quantifying and narrowing the unknown interactive text-to-video retrieval via un | arXiv: 2507.15504
- r-livit a lidar-visual-thermal dataset enabling vulnerable road user focused roa
- r1-onevision advancing generalized multimodal reasoning through cross-modal form | arXiv: 2503.10615
- r1-vl learning to reason with multimodal large language models via step-wise gro | arXiv: 2503.12937
- ra-busseg relation-aware semi-supervised breast ultrasound image segmentation vi
- radarsplat radar gaussian splatting for high-fidelity data synthesis and 3d reco
- radgpt constructing 3d image-text tumor datasets | arXiv: 2501.04678
- radiant foam real-time differentiable ray tracing | arXiv: 2502.01157
- ragnet large-scale reasoning-based affordance segmentation benchmark towards gen | arXiv: 2507.23734
- rainbowprompt diversity-enhanced prompt-evolving for continual learning | arXiv: 2507.22553
- raloc enhancing outdoor lidar localization via rotation awareness
- randomized autoregressive visual generation | arXiv: 2411.00776
- rapverse coherent vocals and whole-body motion generation from text | arXiv: 2405.20336
- rareclip rarity-aware online zero-shot industrial anomaly detection
- raygaussx accelerating gaussian-based ray marching for real-time and high-qualit
- rayletdf raylet distance fields for generalizable 3d surface reconstruction from | arXiv: 2508.09830
- raypose ray bundling diffusion for template views in unseen 6d object pose estim | arXiv: 2510.18521
- rayzer a self-supervised large view synthesis model | arXiv: 2505.00702
- real3d towards scaling large reconstruction models with real images
- realcam-i2v real-world image-to-video generation with interactive complex camera | arXiv: 2502.10059
- reangle-a-video 4d video generation as video-to-video translation | arXiv: 2503.09151
- reasonvqa a multi-hop reasoning benchmark with structural knowledge for visual q | arXiv: 2507.16403
- recammaster camera-controlled generative rendering from a single video | arXiv: 2503.11647
- recondreamer harmonizing generative and reconstructive models for driving scene | arXiv: 2503.18438
- recot reflective self-correction training for mitigating confirmation bias in la
- recover biological structure from sparse-view diffraction images with neural vol | arXiv: 2510.16391
- recovering parametric scenes from very few time-of-flight pixels | arXiv: 2509.16132
- rectifying magnitude neglect in linear attention | arXiv: 2507.00698
- reducing unimodal bias in multi-modal semantic segmentation with multi-scale fun
- reducio generating 1k video within 16 seconds using extremely compressed motion | arXiv: 2411.13552
- refedit a benchmark and method for improving instruction-based image editing mod
- refer to any segmentation mask group with vision-language prompts | arXiv: 2506.05342
- referdino referring video object segmentation with visual grounding foundations | arXiv: 2501.14607
- reference-based super-resolution via image-based retrieval-augmented generation
- refereverything towards segmenting everything we can speak of in videos | arXiv: 2410.23287
- referring expression comprehension for small objects | arXiv: 2510.03701
- reflex text-guided editing of real images in rectified flow via mid-step feature | arXiv: 2507.01496
- regen learning compact video embedding with re-generative decoder | arXiv: 2503.08665
- reggs unposed sparse views gaussian splatting with 3dgs registration | arXiv: 2507.08136
- region-aware anchoring mechanism for efficient referring visual grounding
- region-based cluster discrimination for visual representation learning | arXiv: 2507.20025
- region-level data attribution for text-to-image generative models
- registration beyond points general affine subspace alignment via geodesic distan
- reinforcement learning-guided data selection via redundancy assessment | arXiv: 2506.21037
- relative illumination fields learning medium and light independent underwater sc | arXiv: 2504.10024
- removing out-of-focus reflective flares via color alignment
- remp-ad retrieval-enhanced multi-modal prompt fusion for few-shot industrial vis
- rep-mtl unleashing the power of representation-level task saliency for multi-tas | arXiv: 2507.21049
- repa-e unlocking vae for end-to-end tuning of latent diffusion transformers | arXiv: 2504.10483
- REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers | arXiv: 2504.10483
- reparo compositional 3d assets generation with differentiable 3d layout alignmen | arXiv: 2405.18525
- reposed efficient relative pose estimation with known depth information | arXiv: 2501.07742
- representation shift unifying token compression with flashattention | arXiv: 2508.00367
- representing 3d shapes with 64 latent vectors for 3d diffusion models | arXiv: 2503.08737
- repurposing 2d diffusion models with gaussian atlas for 3d generation | arXiv: 2503.15877
- rescue crowd evacuation simulation via controlling sdm-united characters | arXiv: 2507.20117
- resgs residual densification of 3d gaussian for efficient detail recovery | arXiv: 2412.07494
- residualvit for efficient temporally dense video encoding | arXiv: 2509.13255
- resolving token-space gradient conflicts token space manipulation for transforme | arXiv: 2507.07485
- resonance learning to predict social-aware pedestrian trajectories as co-vibrati | arXiv: 2412.02447
- resq a novel framework to implement residual neural networks on analog rydberg a | arXiv: 2506.21537
- rethink sparse signals for pose-guided text-to-image generation | arXiv: 2506.20983
- rethinking cross-modal interaction in multimodal diffusion transformers | arXiv: 2506.07986
- rethinking detecting salient and camouflaged objects in unconstrained scenes | arXiv: 2412.10943
- rethinking dpo-style diffusion aligning frameworks
- rethinking few shot clip benchmarks a critical analysis in the inductive setting | arXiv: 2507.20834
- rethinking key-frame-based micro-expression recognition a robust and accurate fr
- rethinking layered graphic design generation with a top-down approach | arXiv: 2507.05601
- rethinking multi-modal object detection from the perspective of mono-modality fe
- rethinking the embodied gap in vision-and-language navigation a holistic study o | arXiv: 2507.13019
- rethinking the upsampling process in light field super-resolution with spatial-e
- retinexmcnet a memory controller dominated network for low-light video enhanceme
- revelio interpreting and leveraging semantic information in diffusion models | arXiv: 2411.16725
- revisiting adversarial patch defenses on object detectors unified evaluation lar | arXiv: 2508.00649
- revisiting image fusion for multi-illuminant white-balance correction | arXiv: 2503.14774
- revisiting point cloud completion are we ready for the real-world | arXiv: 2411.17580
- rhythmguassian repurposing generalizable gaussian model for remote physiological
- ri3d few-shot gaussian splatting with repair and inpainting diffusion priors | arXiv: 2503.10860
- riocc efficient cross-modal fusion transformer with collaborative feature refine
- rmultiplex200k toward reliable multimodal process supervision for visual languag
- roadwork a dataset and benchmark for learning to recognize observe analyze and d | arXiv: 2406.07661
- robava a large-scale dataset and baseline towards video based robotic arm action
- robofactory exploring embodied agent collaboration with compositional constraint | arXiv: 2503.16408
- robopearls editable video simulation for robot manipulation | arXiv: 2506.22756
- robotrom-nav a unified framework for embodied navigation integrating perception
- robotron-mani all-in-one multimodal large model for robotic manipulation | arXiv: 2412.07215
- robotron-sim improving real-world driving via simulated hard-case | arXiv: 2508.04642
- robridge a hierarchical architecture bridging cognition and execution for genera
- robust 3d object detection using probabilistic point clouds from single-photon l | arXiv: 2508.00169
- robust 3d-masked part-level editing in 3d gaussian splatting with regularized sc
- robust adverse weather removal via spectral-based spatial grouping | arXiv: 2507.22498
- robust and efficient 3d gaussian splatting for urban scene reconstruction | arXiv: 2507.23006
- robust dataset condensation using supervised contrastive learning
- robust machine unlearning for quantized neural networks via adaptive gradient re
- robust multi-view learning via representation fusion of sample-level attention a
- robustereo robust zero-shot stereo matching under adverse weather | arXiv: 2507.01653
- robustsplat decoupling densification and dynamics for transient-free 3dgs | arXiv: 2506.02751
- roco-sim enhancing roadside collaborative perception through foreground simulati | arXiv: 2503.10410
- ross3d reconstructive visual instruction tuning with 3d-awareness | arXiv: 2504.01901
- rs-vheat heat conduction guided efficient remote sensing foundation model | arXiv: 2411.17984
- rtmap real-time recursive mapping with change detection and localization | arXiv: 2507.00980
- s2m2 scalable stereo matching model for reliable depth estimation
- s3e self-supervised state estimation for radar-inertial system | arXiv: 2509.25984
- s3r-gs streamlining the pipeline for large-scale street scene reconstruction | arXiv: 2503.08217
- s4m boosting semi-supervised instance segmentation with sam
- sa-lut spatial adaptive 4d look-up table for photorealistic style transfer | arXiv: 2506.13465
- sa-occ satellite-assisted 3d occupancy prediction in real world | arXiv: 2503.16399
- sac-gnc sample consensus for adaptive graduated non-convexity
- safeguarding vision-language models mitigating vulnerabilities to gaussian noise | arXiv: 2504.01308
- saft shape and appearance of fabrics from template via differentiable physical s
- saliency-aware quantized imitation learning for efficient robotic control | arXiv: 2505.15304
- salvaging the overlooked leveraging class-aware contrastive learning for multi-c
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree | arXiv: 2410.16268
- sam4d segment anything in camera and lidar streams | arXiv: 2506.21547
- samo a lightweight sharpness-aware approach for multi-task optimization with joi | arXiv: 2507.07883
- sample semantic alignment through temporal-adaptive multimodal prompt learning f
- sana-sprint one-step diffusion with continuous-time consistency distillation | arXiv: 2503.09641
- SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation | arXiv: 2503.09641
- sas segment any 3d scene with integrated 2d priors | arXiv: 2503.08512
- sat2city 3d city generation from a single satellite image with cascaded latent d | arXiv: 2507.04403
- sauce selective concept unlearning in vision-language models with sparse autoenc | arXiv: 2503.14530
- sc-captioner improving image captioning with self-correction by reinforcement le | arXiv: 2508.06125
- scaling 3d compositional models for robust classification and pose estimation
- scaling action detection adatad with transformer-enhanced temporal-spatial adapt
- scaling inference-time search with vision value model for improved visual compre | arXiv: 2412.03704
- Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension | arXiv: 2412.03704
- Scaling Language-Free Visual Representation Learning | arXiv: 2504.01017
- Scaling Laws for Native Multimodal Models | arXiv: 2504.07951
- scaling omni-modal pretraining with multimodal context advancing universal repre
- scaling tumor segmentation best lessons from real and synthetic data | arXiv: 2510.14831
- scan bootstrapping contrastive pre-training for data efficiency | arXiv: 2411.09126
- scene coordinate reconstruction priors | arXiv: 2510.12387
- scenemi motion in-betweening for modeling human-scene interaction | arXiv: 2503.16289
- scenepainter semantically consistent perpetual 3d scene generation with concept
- scflow implicitly learning style and content disentanglement with flow models | arXiv: 2508.03402
- scheduling weight transitions for quantization-aware training | arXiv: 2404.19248
- scivid cross-domain evaluation of video models in scientific applications | arXiv: 2507.03578
- score scene context matters in open-vocabulary remote sensing instance segmentat | arXiv: 2507.12857
- SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation | arXiv: 2507.12857
- scorehoi physically plausible reconstruction of human-object interaction via sco | arXiv: 2509.07920
- sculpting memory multi-concept forgetting in diffusion models via dynamic mask a
- sd2actor continuous state decomposition via diffusion embeddings for robotic man
- sdmatte grafting diffusion models for interactive matting | arXiv: 2508.00443
- seeing and seeing through the glass real and synthetic data for multi-layer dept | arXiv: 2503.11633
- seganypet universal promptable segmentation from positron emission tomography im | arXiv: 2502.14351
- segmentdreamer towards high-fidelity text-to-3d synthesis with segmented consist | arXiv: 2507.05256
- sehdr single-exposure hdr novel view synthesis via 3d gaussian bracketing | arXiv: 2509.20400
- selective contrastive learning for weakly supervised affordance grounding | arXiv: 2508.07877
- self-calibrated variance-stabilizing transformations for real-world image denois | arXiv: 2407.17399
- self-calibrating gaussian splatting for large field-of-view reconstruction
- self-ensembling gaussian splatting for few-shot novel view synthesis | arXiv: 2411.00144
- self-supervised learning of hybrid part-aware 3d representations of 2d gaussians | arXiv: 2408.10789
- self-supervised sparse sensor fusion for long range perception | arXiv: 2508.13995
- semantic alignment and reinforcement for data-free quantization of vision transf | arXiv: 2412.16553
- semantic causality-aware vision-based 3d occupancy prediction | arXiv: 2509.08388
- semantic discrepancy-aware detector for image forgery identification | arXiv: 2508.12341
- semantic watermarking reinvented enhancing robustness and generation quality wit | arXiv: 2509.07647
- semges semantics-aware co-speech gesture generation using semantic coherence and | arXiv: 2507.19359
- semi-supervised deep transfer for regression without domain alignment | arXiv: 2509.05092
- semivisbooster boosting semi-supervised learning for fine-grained classification
- semtalk holistic co-speech motion generation with frame-level semantic emphasis | arXiv: 2412.16563
- separation for better integration disentangling edge and motion in event-based d
- seqgrowgraph learning lane topology as a chain of graph expansions | arXiv: 2507.04822
- sequential gaussian avatars with hierarchical motion context | arXiv: 2411.16768
- sequential keypoint density estimator an overlooked baseline of skeleton-based v | arXiv: 2506.18368
- serep semantic facial expression representation for robust in-the-wild capture a
- serialization based point cloud oversegmentation
- sfuod source-free unknown object detection | arXiv: 2507.17373
- shadowhack hacking shadows via luminance-color divide and conquer | arXiv: 2412.02545
- shape of motion 4d reconstruction from a single video | arXiv: 2407.13764
- sheap self-supervised head geometry predictor learned via 2d gaussians | arXiv: 2504.12292
- shift smoothing hallucinations by information flow tuning for multimodal large l
- shortft diffusion model alignment via shortcut-based fine-tuning | arXiv: 2507.22604
- shortv efficient multimodal large language models by freezing visual tokens in i | arXiv: 2504.00502
- ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers | arXiv: 2504.00502
- sibai a few-shot meta-classifier for poisoning detection in federated learning
- sic similarity-based interpretable image classification with neural networks | arXiv: 2501.17328
- signrep enhancing self-supervised sign representations | arXiv: 2503.08529
- signs as tokens a retrieval-enhanced multilingual sign language generator | arXiv: 2411.17799
- sim-detr unlock detr for temporal sentence grounding | arXiv: 2509.23867
- sim3d single-instance multiview multimodal and multisetup 3d anomaly detection b | arXiv: 2506.21549
- simmlm a simple framework for multi-modal learning with missing modality | arXiv: 2507.19264
- simplevqa multimodal factuality evaluation for multimodal large language models | arXiv: 2502.13059
- simulating dual-pixel images from ray tracing for depth estimation | arXiv: 2503.11213
- simultaneous motion and noise estimation with event cameras | arXiv: 2504.04029
- single-scanline relative pose estimation for rolling shutter cameras | arXiv: 2506.22069
- site towards spatial intelligence thorough evaluation | arXiv: 2505.05456
- skeleton motion words for unsupervised skeleton-based temporal action segmentati | arXiv: 2508.04513
- sketchsplat 3d edge reconstruction via differentiable multi-view sketch splattin | arXiv: 2503.14786
- skip-vision efficient and scalable acceleration of vision-language models via ad
- skysense v2 a unified foundation model for multi-modal remote sensing | arXiv: 2507.13812
- sl2a-inr single-layer learnable activation for implicit neural representation | arXiv: 2409.10836
- sliderspace decomposing the visual capabilities of diffusion models | arXiv: 2502.01639
- smarties spectrum-aware multi-sensor auto-encoder for remote sensing images | arXiv: 2506.19585
- smgdiff soccer motion generation using diffusion probabilistic models | arXiv: 2411.16216
- smolora exploring and defying dual catastrophic forgetting in continual visual i | arXiv: 2411.13949
- social debiasing for fair multi-modal llms | arXiv: 2408.06569
- soft separation and distillation toward global uniformity in federated unsupervi | arXiv: 2508.01251
- spade spatial-aware denoising network for open-vocabulary panoptic scene graph g | arXiv: 2507.05798
- sparfels fast reconstruction from sparse unposed imagery | arXiv: 2505.02178
- sparse-dense side-tuner for efficient video temporal grounding | arXiv: 2507.07744
- sparselanestp leveraging spatio-temporal priors with sparse transformers for 3d | arXiv: 2601.04968
- sparsemm head sparsity emerges from visual concept responses in mllms | arXiv: 2506.05344
- SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs | arXiv: 2506.05344
- sparsevila decoupling visual sparsity for efficient vlm inference | arXiv: 2510.17777
- SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference | arXiv: 2510.17777
- sparsity outperforms low-rank projections in few-shot adaptation | arXiv: 2504.12436
- spatial preference rewarding for mllms spatial understanding | arXiv: 2510.14374
- spatial-temporal aware visuomotor diffusion policy learning | arXiv: 2507.06710
- spatial-temporal forgery trace based forgery image identification
- spatially-varying autofocus
- spatialsplat efficient semantic 3d from sparse unposed images | arXiv: 2505.23044
- spatialtrackerv2 advancing 3d point tracking with explicit camera motion
- specguard spectral projection-based advanced invisible watermarking | arXiv: 2510.07302
- spectral image tokenizer | arXiv: 2412.09607
- spectral sensitivity estimation with an uncalibrated diffraction grating | arXiv: 2508.00330
- spherical epipolar rectification for deep two-view absolute depth estimation
- spikediff zero-shot high-quality video reconstruction from chromatic spike camer
- spinmeround consistent multi-view identity generation using diffusion models | arXiv: 2504.10716
- splat-based 3d scene reconstruction with extreme motion-blur
- splat-loam gaussian splatting lidar odometry and mapping | arXiv: 2503.17491
- splattalk 3d vqa with gaussian splatting | arXiv: 2503.06271
- split-and-combine enhancing style augmentation for single domain generalization
- srefiner soft-braid attention for multi-agent trajectory refinement | arXiv: 2507.04263
- ssvq unleashing the potential of vector quantization with sign-splitting | arXiv: 2503.08668
- stable score distillation | arXiv: 2507.09168
- staining and locking computer vision models without retraining | arXiv: 2507.22000
- star spatial-temporal augmentation with text-to-video models for real-world vide
- std-gs exploring frame-event interaction for spatiotemporal-disentangled gaussia
- stealthattack robust 3d gaussian splatting poisoning via density-guided illusion | arXiv: 2510.02314
- stealthy backdoor attack in federated learning via adaptive layer-wise gradient
- steerx creating any camera-free 3d and 4d scenes with geometric steering | arXiv: 2503.12024
- step-detr advancing detr-based semi-supervised object detection with super teach
- stepping out of similar semantic space for open-vocabulary segmentation | arXiv: 2506.16058
- stereo any video temporally consistent stereo matching | arXiv: 2503.05549
- sti-bench are mllms ready for precise spatial-temporal world understanding | arXiv: 2503.23765
- stiv scalable text and image conditioned video generation | arXiv: 2412.07730
- stochastic interpolants for revealing stylistic flows across the history of art
- stochasticsplats stochastic rasterization for sorting-free 3d gaussian splatting | arXiv: 2503.24366
- stolenlora exploring lora extraction attacks via synthetic data | arXiv: 2509.23594
- straighten viscous rectified flow via noise optimization | arXiv: 2507.10218
- strandhead text to hair-disentangled 3d head avatars using human-centric priors | arXiv: 2412.11586
- streamdiffusion a pipeline-level solution for real-time interactive generation | arXiv: 2312.12491
- streamgs online generalizable gaussian splatting reconstruction for unposed imag
- streaming videollms for real-time procedural video understanding
- streammind unlocking full frame rate streaming video dialogue through event-gate | arXiv: 2503.06220
- stroke2sketch harnessing stroke attributes for training-free sketch generation | arXiv: 2510.16319
- structure-aware semantic discrepancy and consistency for 3d medical image self-s
- structure-guided diffusion models for high-fidelity portrait shadow removal | arXiv: 2507.04692
- strumamba3d exploring structural mamba for self-supervised point cloud represent | arXiv: 2506.21541
- stylekeeper prevent content leakage using negative visual query guidance | arXiv: 2510.06827
- stylemotif multi-modal motion stylization using style-content cross fusion | arXiv: 2503.21775
- stylesrn scene text image super-resolution with text style embedding
- stylized-face a million-level stylized face dataset for face recognition
- su-rgs relightable 3d gaussian splatting from sparse views under unconstrained i
- subjective camera 10 bridging human cognition and visual reconstruction through
- suma a subspace mapping approach for robust and effective concept erasure in tex
- summdiff generative modeling of video summarization with diffusion | arXiv: 2510.08458
- supercharging floorplan localization with semantic rays | arXiv: 2507.09291
- superdec 3d scene decomposition with superquadrics primitives | arXiv: 2504.00992
- superedit rectifying and facilitating supervision for instruction-based image ed | arXiv: 2505.02370
- supermat physically consistent pbr material estimation at interactive rates | arXiv: 2411.17515
- supervised exploratory learning for long-tailed visual recognition
- surfacesplat connecting surface reconstruction and gaussian splatting | arXiv: 2507.15602
- sv4d 20 enhancing spatio-temporal consistency in multi-view video diffusion for
- svg-head hybrid surface-volumetric gaussians for high-fidelity head reconstructi | arXiv: 2508.09597
- svip semantically contextualized visual patches for zero-shot learning | arXiv: 2503.10252
- svtrv2 ctc beats encoder-decoder models in scene text recognition | arXiv: 2411.15858
- sweettok semantic-aware spatial-temporal tokenizer for compact video discretizat | arXiv: 2412.10443
- switch-a-view view selection learned from unlabeled in-the-wild videos | arXiv: 2412.18386
- synad enhancing real-world end-to-end autonomous driving models through syntheti
- syncdiff synchronized motion diffusion for multi-body human-object interaction s | arXiv: 2412.20104
- synchronization of multiple videos | arXiv: 2510.14051
- syncity training-free generation of 3d worlds | arXiv: 2503.16420
- synergistic prompting for robust visual recognition with missing modalities | arXiv: 2507.07802
- synfer towards boosting facial expression recognition with synthetic data | arXiv: 2410.09865
- syntag enhancing the geometric robustness of inversion-based generative image wa
- synthesizing near-boundary ood samples for out-of-distribution detection | arXiv: 2507.10225
- tab transformer attention bottlenecks enable user intervention and debugging in | arXiv: 2412.18675
- taming the untamed graph-based knowledge retrieval and reasoning for mllms to co | arXiv: 2506.17589
- tapnext tracking any point tap as next token prediction | arXiv: 2504.05579
- tar3d creating high-quality 3d assets via next-part prediction | arXiv: 2412.16919
- target bias is all you need zero-shot debiasing of vision-language models with b
- tars traffic-aware radar scene flow estimation | arXiv: 2503.10210
- task vector quantization for memory-efficient model merging | arXiv: 2503.06921
- task-aware prompt gradient projection for parameter-efficient tuning federated c
- tavis text-bridged audio-visual segmentation with foundation models | arXiv: 2506.11436
- taxadiffusion progressively trained diffusion model for fine-grained species gen | arXiv: 2506.01923
- tcfg truncated classifier-free guidance for efficient and scalable text-to-image
- teaching ai the anatomy behind the scan addressing anatomical flaws in medical i
- teefusion blending text embeddings to distill classifier-free guidance | arXiv: 2507.18192
- teeth reconstruction and performance capture using a phone camera
- teethgenerator a two-stage framework for paired pre- and post-orthodontic 3d den | arXiv: 2507.04685
- temperature in cosine-based softmax loss
- temporal overlapping prediction a self-supervised pre-training method for lidar
- temporal rate reduction clustering for human motion segmentation | arXiv: 2506.21249
- temporal unlearnable examples preventing personal video data from unauthorized e | arXiv: 2507.07483
- temporal-aware query routing for real-time video instance segmentation
- tera rethinking text-guided realistic 3d avatar generation | arXiv: 2509.02466
- test-time prompt tuning for zero-shot depth completion
- test-time retrieval-augmented adaptation for vision-language models
- text embedding knows how to quantize text-guided diffusion models | arXiv: 2507.10340
- text2outfit controllable outfit generation with multimodal language models
- text2vdm text to vector displacement maps for expressive and interactive 3d scul | arXiv: 2502.20045
- textured 3d regenerative morphing with 3d diffusion prior | arXiv: 2502.14316
- the best of both worlds integrating language models and diffusion models for vid
- the curse of conditions analyzing and improving optimal transport for conditiona | arXiv: 2503.10636
- the devil is in the spurious correlations boosting moment retrieval with dynamic | arXiv: 2501.07305
- the inter-intra modal measure a predictive lens on fine-tuning outcomes in visio | arXiv: 2407.15731
- the scalability of simplicity empirical analysis of vision-language learning wit
- the silent assistant noisequery as implicit guidance for goal-driven image gener | arXiv: 2412.05101
- thermal polarimetric multi-view stereo | arXiv: 2510.20972
- tikzero zero-shot text-guided graphics program synthesis | arXiv: 2503.11509
- time-aware auto white balance in mobile photography | arXiv: 2504.05623
- timeexpert an expert-guided video llm for video temporal grounding | arXiv: 2508.01699
- timeformer capturing temporal relationships of deformable 3d gaussians for robus | arXiv: 2411.11941
- timestep-aware diffusion model for extreme image rescaling | arXiv: 2408.09151
- tinyvim frequency decoupling for tiny hybrid vision mamba | arXiv: 2411.17473
- tip-i2v a million-scale real text and image prompt dataset for image-to-video ge | arXiv: 2411.04709
- tlb-vfi temporal-aware latent brownian bridge diffusion for video frame interpol | arXiv: 2507.04984
- to label or not to label palm - a predictive model for evaluating sample efficie | arXiv: 2507.15381
- toga temporally grounded open-ended video qa with weak supervision | arXiv: 2506.09445
- token-efficient vlm high-resolution image understanding via dynamic region propo
- TokenBridge: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation | arXiv: 2503.16430
- tokensgen harnessing condensed tokens for long video generation
- tokenunify scaling up autoregressive pretraining for neuron segmentation | arXiv: 2405.16847
- toolvqa a dataset for multi-step reasoning vqa with external tools | arXiv: 2508.03284
- ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools | arXiv: 2508.03284
- topotta topology-enhanced test-time adaptation for tubular structure segmentatio | arXiv: 2508.00442
- totp transferable online pedestrian trajectory prediction with temporal-adaptive
- toward better out-painting improving the image composition with initialization p
- toward long-tailed online anomaly detection through class-agnostic concepts | arXiv: 2507.16946
- toward material-agnostic system identification from videos | arXiv: 2508.01112
- Towards a Unified Copernicus Foundation Model for Earth Vision | arXiv: 2503.11849
- towards a unified copernicus foundation model for earth vision | arXiv: 2503.11849
- towards a universal 3d medical multi-modality generalization via learning person
- towards a universal image degradation model via content-degradation disentanglem | arXiv: 2505.12860
- towards adversarial robustness via debiased high-confidence logit alignment | arXiv: 2408.06079
- towards comprehensive lecture slides understanding large-scale dataset and effec
- towards cross-modal backward-compatible representation learning for vision-langu
- towards efficient general feature prediction in masked skeleton modeling | arXiv: 2509.03609
- towards long-horizon vision-language-action system reasoning acting and memory
- towards more diverse and challenging pre-training for point cloud learning self- | arXiv: 2509.01250
- towards omnimodal expressions and reasoning in referring audio-visual segmentati | arXiv: 2507.22886
- towards open-world generation of stereo images and unsupervised matching | arXiv: 2503.12720
- towards performance consistency in multi-level model collaboration
- towards privacy-preserved pre-training of remote sensing foundation models with
- towards robust defense against customization via protective perturbation resista | arXiv: 2509.13922
- towards robustness of person search against corruptions
- towards scalable spatial intelligence via 2d-to-3d data lifting | arXiv: 2507.18678
- towards stabilized and efficient diffusion transformers through long-skip-connec
- towards video thinking test a holistic benchmark for advanced video reasoning an | arXiv: 2507.15028
- tpg-inr target prior-guided implicit 3d ct reconstruction for enhanced sparse-vi
- tr-pts task-relevant parameter and token selection for efficient tuning | arXiv: 2507.22872
- trace learning 3d gaussian physical dynamics from multi-view videos | arXiv: 2508.09811
- trace3d consistent segmentation lifting via gaussian instance tracing | arXiv: 2508.03227
- trackany3d transferring pretrained 3d models for category-unified 3d point cloud | arXiv: 2507.19908
- tracking tiny drones against clutter large-scale infrared benchmark with motion-
- trade-offs in image generation how do different dimensions interact | arXiv: 2507.22100
- trafficloc localizing traffic surveillance cameras in 3d scenes | arXiv: 2412.10308
- training-free class purification for open-vocabulary semantic segmentation | arXiv: 2508.00557
- training-free generation of temporally consistent rewards from vlms | arXiv: 2507.04789
- training-free industrial defect generation with diffusion models
- training-free personalization via retrieval and reasoning on fingerprints | arXiv: 2503.18623
- TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update | arXiv: 2507.11069
- trans-adapter a plug-and-play framework for transparent image inpainting | arXiv: 2508.01098
- transformed low-rank adaptation via tensor decomposition and its applications to | arXiv: 2501.08727
- transit transient transformer for non-line-of-sight videography | arXiv: 2503.11328
- transparent vision a theory of hierarchical invariant representations
- trce towards reliable malicious concept erasure in text-to-image diffusion model | arXiv: 2503.07389
- trial-oriented visual rearrangement
- tridi trilateral diffusion of 3d humans objects and interactions | arXiv: 2412.06334
- trokens semantic-aware relational trajectory tokens for few-shot action recognit | arXiv: 2508.03695
- trust but verify programmatic vlm evaluation in the wild | arXiv: 2410.13121
- trustmark robust watermarking and watermark removal for arbitrary resolution ima
- tryon-refiner conditional rectified-flow-based tryon refiner for more accurate d
- tune-your-style intensity-tunable 3d style transfer with gaussian splatting | arXiv: 2602.00618
- turboreg turboclique for robust and efficient point cloud registration | arXiv: 2507.01439
- twist scout grounding multimodal llm-experts by forget-free tuning
- two losses one goal balancing conflict gradients for semi-supervised semantic se
- u-vilar uncertainty-aware visual localization for autonomous driving via differe
- uavscenes a multi-modal dataset for uavs | arXiv: 2507.22412
- udc-vit a real-world video dataset for under-display cameras | arXiv: 2501.18545
- uipro unleashing superior interaction capability for gui agents | arXiv: 2509.17328
- ukbob one billion mri labeled masks for generalizable 3d medical image segmentat | arXiv: 2504.06908
- ultho ultra-lightweight yet efficient hyperparameter optimization in deep reinfo
- ultra-precision 6dof pose estimation using 2-d interpolated discrete fourier tra
- umdatrack unified multi-domain adaptive tracking under adverse weather condition | arXiv: 2507.00648
- uncalibrated structure from motion on a sphere
- uncertainty-aware gradient stabilization for small object detection | arXiv: 2303.01803
- understanding co-speech gestures in-the-wild | arXiv: 2503.22668
- understanding flatness in generative models its role and benefits | arXiv: 2503.11078
- understanding museum exhibits using vision-language reasoning | arXiv: 2412.01370
- understanding personal concept in open-vocabulary semantic segmentation | arXiv: 2507.11030
- unfolding-associative encoder-decoder network with progressive alignment for pan
- unicombine unified multi-conditional combination with diffusion transformer | arXiv: 2503.09277
- uniconvnet expanding effective receptive field while maintaining asymptotically | arXiv: 2508.09000
- unidxmd towards unified representation for cross-modal unsupervised domain adapt
- uniegomotion a unified model for egocentric motion reconstruction forecasting an | arXiv: 2508.01126
- unified category-level object detection and pose estimation from rgb images usin | arXiv: 2508.02157
- unified multi-agent trajectory modeling with masked trajectory diffusion
- unified multimodal understanding via byte-pair visual encoding | arXiv: 2506.23639
- uniglyph unified segmentation-conditioned diffusion for precise visual text synt | arXiv: 2507.00992
- uniocc a unified benchmark for occupancy forecasting and prediction in autonomou | arXiv: 2503.24381
- uniphys unified planner and controller with diffusion for flexible physics-based | arXiv: 2504.12540
- uniportrait a unified framework for identity-preserving single- and multi-human
- unires universal image restoration for complex degradations | arXiv: 2506.05599
- universe unleashing the scene prior of video diffusion models for robust radianc
- univg a generalist diffusion model for unified image generation and editing | arXiv: 2503.12652
- unleashing high-quality image generation in diffusion sampling using second-orde
- unleashing the temporal potential of stereo event cameras for continuous-time 3d | arXiv: 2508.02288
- unleashing vecset diffusion model for fast shape generation | arXiv: 2503.16302
- unlocking the potential of diffusion priors in blind face restoration | arXiv: 2508.08556
- unraveling the effects of synthetic data on end-to-end autonomous driving | arXiv: 2503.18108
- unsupervised identification of protein compositions and conformations via implic
- unsupervised imaging inverse problems with diffusion distribution matching | arXiv: 2506.14605
- unsupervised joint learning of optical flow and intensity with event cameras | arXiv: 2503.17262
- unsupervised rgb-d point cloud registration for scenes with low overlap and phot
- unsupervised visible-infrared person re-identification under unpaired settings
- unsupervised visual chain-of-thought reasoning via preference optimization | arXiv: 2504.18397
- unziplora separating content and style from a single image | arXiv: 2412.04465
- upp unified point-level prompting for robust point cloud analysis | arXiv: 2507.18997
- upre zero-shot domain adaptation for object detection via unified prompt and rep | arXiv: 2507.00721
- ust-ssm unified spatio-temporal state space models for point cloud video modelin | arXiv: 2508.14604
- v2pe improving multimodal long-context capability of vision-language models with
- v2xpnp vehicle-to-everything spatio-temporal fusion for multi-agent perception a | arXiv: 2412.01812
- va-moe variables-adaptive mixture of experts for incremental weather forecasting | arXiv: 2412.02503
- vace all-in-one video creation and editing | arXiv: 2503.07598
- VACE: All-in-One Video Creation and Editing | arXiv: 2503.07598
- vaflow video-to-audio generation with cross-modality flow matching
- vamba understanding hour-long videos with hybrid mamba-transformers | arXiv: 2503.11579
- variance-based pruning for accelerating and compressing trained networks | arXiv: 2507.12988
- vector contrastive learning for pixel-wise pretraining in medical vision | arXiv: 2506.20850
- veggie instructional editing and reasoning video concepts with grounded generati | arXiv: 2503.14350
- versatile transition generation with image-to-video diffusion | arXiv: 2508.01698
- vertexregen mesh generation with continuous level of detail | arXiv: 2508.09062
- vggsounder audio-visual evaluations for foundation models | arXiv: 2508.08237
- vgmamba attribute-to-location clue reasoning for quantity-agnostic 3d visual gro
- victr vital consistency transfer for pathology aware image synthesis | arXiv: 2505.04963
- vid-group temporal video grounding pretraining from unlabeled videos in the wild
- video color grading via look-up table generation | arXiv: 2508.00548
- video motion graphs | arXiv: 2503.20218
- video-t1 test-time scaling for video generation | arXiv: 2503.18942
- videollamb long streaming video understanding with recurrent memory bridges | arXiv: 2409.01071
- videominer iteratively grounding key frames of hour-long videos via tree-based g | arXiv: 2510.06040
- videosetdiff identifying and reasoning similarities and differences in similar v
- videovae large motion video autoencoding with cross-modal video vae
- viewsrd 3d visual grounding via structured multi-view decomposition | arXiv: 2507.11261
- vigface virtual identity generation for privacy-free face recognition dataset | arXiv: 2403.08277
- vilu learning vision-language uncertainties for failure prediction | arXiv: 2507.07620
- vip iterative online preference distillation for efficient video diffusion model | arXiv: 2508.03254
- vishall3d monocular semantic scene completion from reconstructing the visible re
- vision-language interactive relation mining for open-vocabulary scene graph gene
- vision-language models cant see the obvious | arXiv: 2507.04741
- vision-language neural graph featurization for extracting retinal lesions
- visionmath vision-form mathematical problem-solving
- visnumbench evaluating number sense of multimodal large language models | arXiv: 2503.14939
- visrl intention-driven visual perception via reinforced reasoning | arXiv: 2503.07523
- visual chronicles using multimodal llms to analyze massive collections of images | arXiv: 2504.08727
- visual intention grounding for egocentric assistants | arXiv: 2504.13621
- visual interestingness decoded how gpt-4o mirrors human interests | arXiv: 2510.13316
- visual modality prompt for adapting vision-language object detectors | arXiv: 2412.00622
- visual relation diffusion for human-object interaction detection
- visual surface wave elastography revealing subsurface physical properties via vi | arXiv: 2507.09207
- visual-oriented fine-grained knowledge editing for multimodal large language mod | arXiv: 2411.12790
- visual-rft visual reinforcement fine-tuning | arXiv: 2503.01785
- visualcloze a universal image generation framework via visual in-context learnin | arXiv: 2504.07960
- VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning | arXiv: 2504.07960
- vit-ensembleattack augmenting ensemble models for stronger adversarial transfera
- vit-linearizer distilling quadratic knowledge into linear-time vision models | arXiv: 2504.00037
- vit-split unleashing the power of vision foundation models via efficient splitti | arXiv: 2506.03433
- vital more understandable feature visualization through distribution alignment a | arXiv: 2503.22399
- vivid4d improving 4d reconstruction from monocular video by video inpainting | arXiv: 2504.11092
- vlabench a large-scale benchmark for language-conditioned robotics manipulation
- vlipp towards physically plausible video generation with vision and language inf
- vlr-driver large vision-language-reasoning models for embodied autonomous drivin
- vlrmbench a comprehensive and challenging benchmark for vision-language reward m | arXiv: 2503.07478
- vmbench a benchmark for perception-aligned video motion generation | arXiv: 2503.10076
- voccl3d a video benchmark dataset for 3d human pose and shape estimation under r | arXiv: 2508.06757
- volume - authentic 3d video calls from live gaussian splat prediction | arXiv: 2507.21311
- volumetricsmpl a neural volumetric body model for efficient interactions contact | arXiv: 2506.23236
- vovtrack exploring the potentiality in raw videos for open-vocabulary multi-obje
- vpo aligning text-to-video generation models with prompt optimization | arXiv: 2503.20491
- vq-sgen a vector quantized stroke representation for creative sketch generation | arXiv: 2411.16446
- vq-vla improving vision-language-action models via scaling vector-quantized acti | arXiv: 2507.01016
- vsc visual search compositional text-to-image diffusion model | arXiv: 2505.01104
- vsp diagnosing the dual challenges of perception and reasoning in spatial planni
- vsrm a robust mamba-based framework for video super-resolution | arXiv: 2506.22762
- vssd vision mamba with non-causal state space duality | arXiv: 2407.18559
- vtimecot thinking by drawing for video temporal grounding and reasoning | arXiv: 2510.14672
- vulnerability-aware spatio-temporal learning for generalizable deepfake video de | arXiv: 2501.01184
- walkvlm aid visually impaired people walking by vision language model
- wasserstein style distribution analysis and transform for stylized image generat
- wave-mambaad wavelet-driven state space model for multi-class unsupervised anoma
- wavelet policy lifting scheme for policy learning in long-horizon tasks | arXiv: 2507.04331
- weakly supervised visible-infrared person re-identification via heterogeneous ex | arXiv: 2507.12942
- weakly-supervised learning of dense functional correspondences | arXiv: 2509.03893
- weaveseg iterative contrast-weaving and spectral feature-refining for nuclei ins
- what changed and what could have changed state-change counterfactuals for proced
- what changed detecting and evaluating instruction-guided image edits with multim
- what if understanding motion through sparse interactions | arXiv: 2510.12777
- what makes for text to 360-degree panorama generation with stable diffusion | arXiv: 2505.22129
- what you have is what you track adaptive and robust multimodal tracking | arXiv: 2507.05899
- whats in a latent leveraging diffusion latent space for domain generalization | arXiv: 2503.06698
- whats making that sound right now video-centric audio-visual localization | arXiv: 2507.04667
- when large vision-language model meets large remote sensing imagery coarse-to-fi
- when lighting deceives exposing vision-language models illumination vulnerabilit
- when pixel difference patterns meet vit pidivit for few-shot object detection
- where am i cross-view geo-localization with natural language descriptions | arXiv: 2412.17007
- where what why towards explainable driver attention prediction | arXiv: 2506.23088
- who controls the authorization invertible networks for copyright protection in t
- who is a better talker subjective and objective quality assessment for ai-genera
- why lvlms are more prone to hallucinations in longer responses the role of conte | arXiv: 2510.20229
- wikiautogen towards multi-modal wikipedia-style article generation | arXiv: 2503.19065
- wildsat learning satellite image representations from wildlife observations | arXiv: 2412.14428
- wildseg3d segment any 3d objects in the wild from 2d images | arXiv: 2503.08407
- wins winograd structured pruning for fast winograd convolution
- wir3d visually-informed and geometry-aware 3d shape abstraction | arXiv: 2505.04813
- wonderplay dynamic 3d scene generation from a single image and actions | arXiv: 2505.18151
- wonderturbo generating interactive 3d world in 072 seconds | arXiv: 2504.02261
- world4drive end-to-end autonomous driving via intention-aware physical latent wo | arXiv: 2507.00603
- worldscore a unified evaluation benchmark for world generation | arXiv: 2504.00983
- x-dancer expressive music to human dance video generation | arXiv: 2502.17414
- x-prompt generalizable auto-regressive visual learning with in-context prompting
- xtrack multimodal training boosts rgb-x video object trackers | arXiv: 2405.17773
- yolo-count differentiable object counting for text-to-image generation | arXiv: 2508.00728
- YOLOE: Real-Time Seeing Anything | arXiv: 2503.07465
- you share beliefs i adapt progressive heterogeneous collaborative perception | arXiv: 2509.09310
- your text encoder can be an object-level watermarking controller | arXiv: 2503.11945
- zero-avsr zero-shot audio-visual speech recognition with llms by learning langua | arXiv: 2503.06273
- zero-shot depth aware image editing with diffusion models
- zero-shot inexact cad model alignment from a single image | arXiv: 2507.03292
- zerostereo zero-shot stereo matching from single images | arXiv: 2501.08654
- zeroth-order fine-tuning of llms in random subspaces | arXiv: 2410.08989
- zfusion efficient deep compositional zero-shot learning for blind image super-re
- zim zero-shot image matting for anything | arXiv: 2411.00626
- zipvl accelerating vision-language models through dynamic token sparsity
- 3dgs lm faster gaussian splatting optimization with levenberg marquardt
- aaa gaussians anti aliased artifact free 3d gaussian rendering
- alltracker efficient dense point tracking at high resolution
- argmatch adaptive refinement gathering for efficient dense matching
- beziergs dynamic urban scene reconstruction with bezier curve gaussian splatting | arXiv: 2506.22099
- boosting multi-view indoor 3d object detection via adaptive 3d volume | arXiv: 2507.18331
- bridging 3d anomaly localization and repair via high-qualit | arXiv: 2505.24431
- dap-mae domain-adaptive point cloud masked autoencoder for e | arXiv: 2510.21635
- ask and remember a questions only replay strategy for continual visual question answering
- backdoor attacks on neural networks via one bit flip
- acam kd adaptive cooperative attention masking knowledge distillation | arXiv: 2503.06307
- ad gs object aware bspline gaussian splatting self supervised autonomous driving
- resonance learning to predict social aware pedestrian trajectories as co vibrations | arXiv: 2412.02447
- tikzero zero-shot text-guided graphics program synthesis | arXiv: 2503.11509
- cargait cross attention based re ranking for gait recognition | arXiv: 2503.03501
- dynfacerestore balancing fidelity and quality in diffusion-guided blind face res | arXiv: 2507.13797
- a0 affordance aware hierarchical model robotic manipulation | arXiv: 2504.12636
- adaptive routing of text to image generation requests between large cloud model and light weight edge model
- addressing text embedding leakage in diffusion based image editing
- adiee automatic dataset creation and scorer for instruction guided image editing evaluation
- ale attribute leakage free editing | arXiv: 2412.04715
- bridging diffusion models and 3d representations a 3d consis | arXiv: 2508.04090
- bridging the skeleton text modality gap diffusion powered modality alignment for | arXiv: 2411.10745
- chords diffusion sampling accelerator with multi core hierarchical ode solvers | arXiv: 2507.15260
- ec-flow enabling versatile robotic manipulation from action-unlabeled videos via | arXiv: 2507.06224
- aligning information capacity between vision and language via dense-to-sparse fe
- aligning information capacity between vision and language via dense to sparse feature distillation
- langbridge interpreting image as a combination of language embeddings
- monster a unified model for motion scene text retrieval
- ocr hinders rag evaluating the cascading impact of ocr on retrieval-augmented ge | arXiv: 2412.02592
- representation shift unifying token compression with flashattention | arXiv: 2508.00367
- vilu learning vision-language uncertainties for failure prediction | arXiv: 2507.07620
- aim amending inherent interpretability via self-supervised masking | arXiv: 2508.11502
- argotweak towards self-updating hd maps through structured priors | arXiv: 2509.08764
- ce-fam concept-based explanation via fusion of activation maps | arXiv: 2509.23849
- granular concept circuits toward a fine-grained circuit discovery for concept re | arXiv: 2508.01728
- learnable fractional reaction-diffusion dynamics for under-display tof imaging a | arXiv: 2511.01704
- minerva evaluating complex video reasoning | arXiv: 2505.00681
- principal components enable a new language of images | arXiv: 2503.08685
- svip semantically contextualized visual patches for zero-shot learning | arXiv: 2503.10252
- vital more understandable feature visualization through distribution alignment a | arXiv: 2503.22399
- 3dsrbench a comprehensive 3d spatial reasoning benchmark | arXiv: 2412.07825
- a conditional probability framework for compositional zero-shot learning | arXiv: 2507.17377
- a conditional probability framework for compositional zerosh | arXiv: 2507.17377
- a real-world display inverse rendering dataset | arXiv: 2508.14411
- a realworld display inverse rendering dataset | arXiv: 2508.14411
- batclip bimodal online test-time adaptation for clip | arXiv: 2412.02837
- discopatch taming adversarially-driven batch statistics for improved out-of-dist | arXiv: 2501.08005
- dista-net dynamic closely-spaced infrared small target unmixing | arXiv: 2505.19148
- forcennet foreground-centric network for document image rectification | arXiv: 2507.19804
- generative zoo | arXiv: 2412.08101
- hiero understanding the hierarchy of human behavior enhances reasoning on egocen | arXiv: 2505.12911
- imbalance in balance online concept balancing in generation models | arXiv: 2507.13345
- intersyn interleaved learning for dynamic motion synthesis in the wild | arXiv: 2508.10297
- odp-bench benchmarking out-of-distribution performance prediction | arXiv: 2510.27263
- omnidiff a comprehensive benchmark for fine-grained image difference captioning | arXiv: 2503.11093
- on the robustness tradeoff in fine-tuning | arXiv: 2503.14836
- rethinking few shot clip benchmarks a critical analysis in the inductive setting | arXiv: 2507.20834
- shadowhack hacking shadows via luminance-color divide and conquer | arXiv: 2412.02545
- spectral sensitivity estimation with an uncalibrated diffraction grating | arXiv: 2508.00330
- supercharging floorplan localization with semantic rays | arXiv: 2507.09291
- svtrv2 ctc beats encoder-decoder models in scene text recognition | arXiv: 2411.15858
- any-ssr how recursive least squares works in continual learning of large languag
- any ssr how recursive least squares works in continual learning of large language models
- va gpt aligning effective tokens video anomaly | arXiv: 2508.06350
- vim versatile interactive motion language model | arXiv: 2410.05628
- ace-g improving generalization of scene coordinate regression through query pre- | arXiv: 2510.11605
- aceg improving generalization of scene coordinate regression | arXiv: 2510.11605
- conststyle robust domain generalization with unified style transformation | arXiv: 2509.05975
- dataset ownership verification for pre-trained masked models | arXiv: 2507.12022
- eta energy-based test-time adaptation for depth completion | arXiv: 2508.05989
- flow to the mode mode-seeking diffusion autoencoders for state-of-the-art image | arXiv: 2503.11056
- image intrinsic scale assessment bridging the gap between quality and resolution | arXiv: 2502.06476
- make your training flexible towards deployment-efficient video models | arXiv: 2503.14237
- adversarial robust memory-based continual learner | arXiv: 2311.17608
- chartcap mitigating hallucination of dense chart captioning | arXiv: 2508.03164
- forgetting through transforming enabling federated unlearning via class-aware re | arXiv: 2410.06848
- temporal unlearnable examples preventing personal video data from unauthorized e | arXiv: 2507.07483
- b vllm a vision large language model with balanced spatio temporal tokens
- motionfollower editing video motion via score-guided diffusion | arXiv: 2405.20325
- adaptive prompt learning via gaussian outlier synthesis for out of distribution detection
- aigi holmes towards explainable and generalizable ai generated image detection via mllm
- aircache activating inter modal relevancy kv cache compression for efficient large vision language model
- coa-vla improving vision-language-action models via visual-text chain-of-afforda | arXiv: 2412.20451
- gtr guided thought reinforcement prevents thought collapse in rl-based vlm agent | arXiv: 2503.08525
- vq focusambiguity acknowledging focus ambiguity visual questions | arXiv: 2501.02201
- learning 4d embodied world models | arXiv: 2504.20995
- a plug-and-play physical motion restoration approach for in- | arXiv: 2412.17377
- lawdis language-window-based controllable dichotomous image segmentati | arXiv: 2508.01152
- gradient extrapolation for debiased representation learning | arXiv: 2503.13236
- propvg end-to-end proposal-driven visual grounding with multi-granularity discri | arXiv: 2509.04833
- i2-world intra-inter tokenization for efficient dynamic 4d scene forecasting | arXiv: 2507.09144
- adversarial distribution matching for diffusion distillation towards efficient i | arXiv: 2507.18569
- adversarial distribution matching for diffusion distillation towards efficient image and video synthesis
- aid adapting image2video diffusion models for instruction-guided video predictio | arXiv: 2406.06465
- aligning moments in time using video queries | arXiv: 2508.15439
- badvideo stealthy backdoor attack against text-to-video generation | arXiv: 2504.16907
- causal-entity reflected egocentric traffic accident video synthesis | arXiv: 2506.23263
- d3 training-free ai-generated video detection using second-order features | arXiv: 2508.00701
- dacon dino for anime paint bucket colorization with any number of reference imag | arXiv: 2509.14685
- decouple and track benchmarking and improving video diffusion transformers for m | arXiv: 2503.17350
- dh-facevid-1k a large-scale high-quality dataset for face video generation | arXiv: 2410.07151
- disentangled world models learning to transfer semantic knowledge from distracti | arXiv: 2503.08751
- dive taming dino for subject-driven video editing
- dollar few-step video generation via distillation and latent reward optimization | arXiv: 2412.15689
- dollar fewstep video generation via distillation and latent | arXiv: 2412.15689
- dreamrelation relation-centric video customization | arXiv: 2503.07602
- dual-expert consistency model for efficient and high-quality video generation | arXiv: 2506.03123
- dualreal adaptive joint training for lossless identity-motion fusion in video cu | arXiv: 2505.02192
- efficientmt efficient temporal adaptation for motion transfer in text-to-video d
- etva evaluation of text-to-video alignment via fine-grained question generation | arXiv: 2503.16867
- free-form motion control controlling the 6d poses of camera and objects in video | arXiv: 2501.01425
- fuxi-rtm a physics-guided prediction framework with radiative transfer modeling | arXiv: 2503.19940
- fvgen accelerating novel-view synthesis with adversarial video diffusion distill | arXiv: 2508.06392
- generating fast and slow scalable parallel video generation with video interface | arXiv: 2503.17539
- leanvae an ultra-efficient reconstruction vae for video diffusion models | arXiv: 2503.14325
- long context tuning for video generation | arXiv: 2503.10589
- magicdrive-v2 high-resolution long video generation for autonomous driving with | arXiv: 2411.13807
- magicmirror id-preserved video generation in video diffusion transformers | arXiv: 2501.03931
- motionagent fine-grained controllable video generation via motion field agent | arXiv: 2502.03207
- motionshot adaptive motion transfer across arbitrary objects for text-to-video g | arXiv: 2507.16310
- multi-identity human image animation with structural video diffusion | arXiv: 2504.04126
- normalcrafter learning temporally consistent normals from video diffusion priors | arXiv: 2504.11427
- ock unsupervised dynamic video prediction with object-centric kinematics | arXiv: 2404.18423
- omnihuman-1 rethinking the scaling-up of one-stage conditioned human animation m | arXiv: 2502.01061
- prompt-a-video prompt your video diffusion model via preference-aligned llm | arXiv: 2412.15156
- quantifying and narrowing the unknown interactive text-to-video retrieval via un | arXiv: 2507.15504
- realcam-i2v real-world image-to-video generation with interactive complex camera | arXiv: 2502.10059
- reangle-a-video 4d video generation as video-to-video translation | arXiv: 2503.09151
- recammaster camera-controlled generative rendering from a single video | arXiv: 2503.11647
- steerx creating any camera-free 3d and 4d scenes with geometric steering | arXiv: 2503.12024
- stiv scalable text and image conditioned video generation | arXiv: 2412.07730
- sweettok semantic-aware spatial-temporal tokenizer for compact video discretizat | arXiv: 2412.10443
- tip-i2v a million-scale real text and image prompt dataset for image-to-video ge | arXiv: 2411.04709
- vace all-in-one video creation and editing | arXiv: 2503.07598
- vace allinone video creation and editing | arXiv: 2503.07598
- versatile transition generation with image-to-video diffusion | arXiv: 2508.01698
- vip iterative online preference distillation for efficient video diffusion model | arXiv: 2508.03254
- vmbench a benchmark for perception-aligned video motion generation | arXiv: 2503.10076
- vpo aligning text-to-video generation models with prompt optimization | arXiv: 2503.20491
- vsrm a robust mamba-based framework for video super-resolution | arXiv: 2506.22762
- worldscore a unified evaluation benchmark for world generation | arXiv: 2504.00983
- x-dancer expressive music to human dance video generation | arXiv: 2502.17414
- 4d bench benchmarking multimodal llms for 4d object understanding
- adaptive hyper graph convolution network skeleton action recognition
- aim adaptive inference multimodal llms token merging pruning | arXiv: 2412.03248
- aim adaptive inference of multi modal llms via token merging and pruning
- despite exploring contrastive deep skeleton-pointcloud-imu-text embeddings for a | arXiv: 2506.13897
- prior-flow enhancing primitive panoramic optical flow with o | arXiv: 2506.23897