ICCV2025 论文笔记 TODO¶

总计: 2019 篇 | 已完成: 1518 | 待更新: 501

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining | arXiv: 2501.00958
25 years in class a multimodal textbook for vision-language pretraining | arXiv: 2501.00958
2handedafforder learning precise actionable bimanual affordances from human vide | arXiv: 2503.09320
3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
3d gaussian splatting driven multi-view robust physical adversarial camouflage g | arXiv: 2507.01367
3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation | arXiv: 2507.01367
3D Mesh Editing using Masked LRMs | arXiv: 2412.08641
3d test-time adaptation via graph spectral driven point shift | arXiv: 2507.18225
3D Test-time Adaptation via Graph Spectral Driven Point Shift | arXiv: 2507.18225
3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection | arXiv: 2507.23567
3dgraphllm combining semantic graphs and large language models for 3d scene unde | arXiv: 2412.18450
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding | arXiv: 2412.18450
3dgs-lm faster gaussian-splatting optimization with levenberg-marquardt | arXiv: 2409.12892
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt | arXiv: 2409.12892
3drealcar an in-the-wild rgb-d car dataset with 360-degree views | arXiv: 2406.04875
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views | arXiv: 2406.04875
3DSR: Bridging Diffusion Models and 3D Representations for 3D Consistent Super-Resolution | arXiv: 2508.04090
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark | arXiv: 2412.07825
4D Gaussian Splatting SLAM | arXiv: 2503.16710
4d visual pre-training for robot learning | arXiv: 2508.17230
4D Visual Pre-training for Robot Learning | arXiv: 2508.17230
4d-bench benchmarking multi-modal large language models for 4d object understand | arXiv: 2503.17827
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding | arXiv: 2503.17827
4dsegstreamer streaming 4d panoptic segmentation via dual threads | arXiv: 2510.17664
4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads | arXiv: 2510.17664
6dope-gs online 6d object pose estimation using gaussian splatting | arXiv: 2412.01543
6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting | arXiv: 2412.01543
7dgs unified spatial-temporal-angular gaussian splatting | arXiv: 2503.07946
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting | arXiv: 2503.07946
a conditional probability framework for compositional zero-shot learning | arXiv: 2507.17377
A Conditional Probability Framework for Compositional Zero-shot Learning | arXiv: 2507.17377
a constrained optimization approach for gaussian splatting from coarsely-posed i
A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy LiDAR Point Clouds | arXiv: 2504.09129
a framework for double-blind federated adaptation of foundation models | arXiv: 2502.01289
A Framework for Double-Blind Federated Adaptation of Foundation Models | arXiv: 2502.01289
A Good Teacher Adapts Their Knowledge for Distillation
A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention | arXiv: 2507.14315
a hyperdimensional one place signature to represent them all stackable descripto | arXiv: 2412.06153
a lesson in splats teacher-guided diffusion for 3d gaussian splats generation wi | arXiv: 2412.00623
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision | arXiv: 2412.00623
a linear n-point solver for structure and motion from asynchronous tracks | arXiv: 2507.22733
a plug-and-play physical motion restoration approach for in-the-wild high-diffic
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions | arXiv: 2412.17377
a quality-guided mixture of score-fusion experts framework for human recognition | arXiv: 2508.00053
A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition | arXiv: 2508.00053
a real-world display inverse rendering dataset | arXiv: 2508.14411
A Real-world Display Inverse Rendering Dataset | arXiv: 2508.14411
A Recipe for Generating 3D Worlds from a Single Image | arXiv: 2503.16611
a simple yet mighty hartley diffusion versatilist for generalizable dense vision
A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks
a tiny change a giant leap long-tailed class-incremental learning via geometric
A Token-level Text Image Foundation Model for Document Understanding (TokenFD/TokenVL) | arXiv: 2503.02304
a unified framework for industrial cel-animation colorization with temporal-stru
a unified framework for motion reasoning and generation in human interaction | arXiv: 2410.05628
a unified interpretation of training-time out-of-distribution detection
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets | arXiv: 2507.04699
a0 an affordance-aware hierarchical model for general robotic manipulation | arXiv: 2504.12636
a3gs arbitrary artistic style into arbitrary 3d gaussian splatting
A3GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting
aaa-gaussians anti-aliased and artifact-free 3d gaussian rendering | arXiv: 2504.12811
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering | arXiv: 2504.12811
acam-kd adaptive and cooperative attention masking for knowledge distillation | arXiv: 2503.06307
accelerate 3d object detection models via zero-shot attention key pruning | arXiv: 2503.08101
accelerating diffusion sampling via exploiting local transition coherence | arXiv: 2503.09675
ace-g improving generalization of scene coordinate regression through query pre- | arXiv: 2510.11605
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training | arXiv: 2510.11605
achieving more with less additive prompt tuning for rehearsal-free class-increme
acknowledging focus ambiguity in visual questions | arXiv: 2501.02201
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning | arXiv: 2509.07879
active membership inference test amint enhancing model auditability with multi-t | arXiv: 2509.07879
AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images
ad-gs object-aware b-spline gaussian splatting for self-supervised autonomous dr | arXiv: 2507.12137
adadcp learning an adapter with discrete cosine prior for clear-to-adverse domai
adadrive self-adaptive slow-fast system for language-grounded autonomous driving | arXiv: 2511.06253
adahuman animatable detailed 3d human generation with compositional multiview di | arXiv: 2505.24877
AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion | arXiv: 2505.24877
adapt foundational segmentation models with heterogeneous searching space
adaptive articulated object manipulation on the fly with foundation model reason | arXiv: 2507.18276
adaptive dual uncertainty optimization boosting monocular 3d object detection un | arXiv: 2508.20488
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts | arXiv: 2508.20488
adaptive hyper-graph convolution network for skeleton-based human action recogni
adaptive learning of high-value regions for semi-supervised medical image segmen
adaptive prompt learning via gaussian outlier synthesis for out-of-distribution
adaptive routing of text-to-image generation requests between large cloud model
AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes | arXiv: 2508.13503
addressing representation collapse in vector quantized models with one linear la | arXiv: 2411.02038
addressing text embedding leakage in diffusion-based image editing | arXiv: 2412.04715
adiee automatic dataset creation and scorer for instruction-guided image editing | arXiv: 2507.07317
advancing text-to-3d generation with linearized lookahead variational score dist | arXiv: 2507.09748
advancing textual prompt learning with anchored attributes | arXiv: 2412.09442
advancing visual large language model for multi-granular versatile perception | arXiv: 2507.16213
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? | arXiv: 2412.03002
adversarial attention perturbations for large object detection transformers | arXiv: 2508.02987
adversarial data augmentation for single domain generalization via lyapunov expo | arXiv: 2507.04302
adversarial distribution matching for diffusion distillation towards efficient i | arXiv: 2507.18569
Adversarial Exploitation of Data Diversity Improves Visual Localization | arXiv: 2412.00138
adversarial exploitation of data diversity improves visual localization | arXiv: 2412.00138
adversarial robust memory-based continual learner | arXiv: 2311.17608
adversarial training for probabilistic robustness
aether geometric-aware unified world modeling | arXiv: 2503.18945
Aether: Geometric-Aware Unified World Modeling | arXiv: 2503.18945
AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm | arXiv: 2506.23537
AGO: Adaptive Grounding for Open World 3D Occupancy Prediction | arXiv: 2504.10117
ahcptq accurate and hardware-compatible post-training quantization for segment a
aicomposer any style and content image composition via feature integration | arXiv: 2507.20721
aid adapting image2video diffusion models for instruction-guided video predictio | arXiv: 2406.06465
aigi-holmes towards explainable and generalizable ai-generated image detection v
aim adaptive inference of multi-modal llms via token merging and pruning | arXiv: 2412.03248
aim amending inherent interpretability via self-supervised masking | arXiv: 2508.11502
aira activation-informed low-rank adaptation for large models
aircache activating inter-modal relevancy kv cache compression for efficient lar
AJAHR: Amputated Joint Aware 3D Human Mesh Recovery | arXiv: 2509.19939
align your rhythm generating highly aligned dance poses with gating-enhanced rhy | arXiv: 2503.17340
aligning effective tokens with video anomaly in large language models | arXiv: 2508.06350
aligning information capacity between vision and language via dense-to-sparse fe
aligning moments in time using video queries | arXiv: 2508.15439
alleviating textual reliance in medical language-guided segmentation via prototy | arXiv: 2507.11055
alltracker efficient dense point tracking at high resolution | arXiv: 2506.07310
alocc adaptive lifting-based 3d semantic occupancy and cost volume-based flow pr | arXiv: 2411.07725
ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions | arXiv: 2411.07725
always skip attention | arXiv: 2505.01996
am-adapter appearance matching adapter for exemplar-based semantic image synthes
Amodal Depth Anything: Amodal Depth Estimation in the Wild | arXiv: 2412.02336
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images | arXiv: 2503.13439
an empirical study of autoregressive pre-training from videos | arXiv: 2501.05453
an openmind for 3d medical vision self-supervised learning | arXiv: 2412.17041
An OpenMind for 3D Medical Vision Self-supervised Learning | arXiv: 2412.17041
analyzing finetuning representation shift for multimodal llms steering | arXiv: 2501.03012
anchor token matching implicit structure locking for training-free ar image edit | arXiv: 2504.10434
animalclue recognizing animals by their traces | arXiv: 2507.20240
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation | arXiv: 2506.09982
animegamer infinite anime life simulation with next game state prediction | arXiv: 2504.01014
annofreeod detecting all classes at low frame rates without human annotations
anomaly detection of integrated circuits package substrates using the large visi
anti-tamper protection for unauthorized individual image generation | arXiv: 2508.06325
any-ssr how recursive least squares works in continual learning of large languag
any2anytryon leveraging adaptive position embeddings for versatile virtual cloth
anybimanual transferring unimanual policy for general bimanual manipulation | arXiv: 2412.06779
AnyI2V: Animating Any Conditional Image with Motion Control | arXiv: 2507.02857
anyportal zero-shot consistent video background replacement | arXiv: 2509.07472
AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction | arXiv: 2503.12929
ar-vrm imitating human motions for visual robot manipulation with analogical rea | arXiv: 2508.07626
are they the same exploring visual correspondence shortcomings of multimodal llm | arXiv: 2501.04670
are vlms ready for autonomous driving an empirical study from the reliability da
argmatch adaptive refinement gathering for efficient dense matching
argotweak towards self-updating hd maps through structured priors | arXiv: 2509.08764
arteditor learning customized instructional image editor from few-shot examples
articulate3d holistic understanding of 3d scenes as universal scene description | arXiv: 2412.01398
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description | arXiv: 2412.01398
ascent annotation-free self-supervised contrastive embeddings for 3d neuron trac
asgs single-domain generalizable open-set object detection via adaptive subgraph
ask and remember a questions-only replay strategy for continual visual question | arXiv: 2502.04469
astroloc robust space to ground image localizer | arXiv: 2502.07003
asynchronous event error-minimizing noise for safeguarding event dataset | arXiv: 2507.05728
atlas decoupling skeletal and shape parameters for expressive parametric human m | arXiv: 2508.15767
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling | arXiv: 2508.15767
attention to neural plagiarism diffusion models can plagiarize your copyrighted | arXiv: 2603.00150
attention to the burstiness in visual prompt tuning | arXiv: 2506.22908
attention to trajectory trajectory-aware open-vocabulary tracking | arXiv: 2503.08145
augmenting moment retrieval zero-dependency two-stage learning | arXiv: 2510.19622
authentic 4d driving simulation with a video generation model
auto-controlled image perception in mllms via visual perception tokens
auto-regressively generating multi-view consistent images | arXiv: 2506.18527
Auto-Regressively Generating Multi-View Consistent Images (MV-AR) | arXiv: 2506.18527
auto-vocabulary semantic segmentation | arXiv: 2312.04539
autocompose automatic generation of pose transition descriptions for composed po | arXiv: 2503.22884
automated model evaluation for object detection via prediction consistency and r | arXiv: 2508.12082
automated red teaming for text-to-image models through feedback-guided prompt it
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting | arXiv: 2502.04981
autoprompt automated red-teaming of text-to-image models via llm-driven adversar | arXiv: 2510.24034
avam a universal training-free adaptive visual anchoring embedded into multimoda
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars | arXiv: 2502.20220
b-vllm a vision large language model with balanced spatio-temporal tokens | arXiv: 2412.09919
babyvlm data-efficient pretraining of vlms inspired by infant learning | arXiv: 2504.09426
back on track bundle adjustment for dynamic scene reconstruction | arXiv: 2504.14516
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction | arXiv: 2504.14516
backdoor attacks on neural networks via one-bit flip
backdoor defense via enhanced splitting and trap isolation
backdoor mitigation by distance-driven detoxification | arXiv: 2411.09585
backdooring self-supervised contrastive learning by noisy alignment | arXiv: 2508.14015
background invariance testing according to semantic proximity | arXiv: 2208.09286
badvideo stealthy backdoor attack against text-to-video generation | arXiv: 2504.16907
baking gaussian splatting into diffusion denoiser for fast and scalable single-s | arXiv: 2411.14384
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction | arXiv: 2411.14384
balanced image stylization with style matching score | arXiv: 2503.07601
balancing conservatism and aggressiveness prototype-affinity hybrid network for
balancing task-invariant interaction and task-specific adaptation for unified im | arXiv: 2504.05164
banet bilateral aggregation network for mobile stereo matching | arXiv: 2503.03259
BANet: Bilateral Aggregation Network for Mobile Stereo Matching | arXiv: 2503.03259
basic boosting visual alignment with intrinsic refined embeddings in multimodal | arXiv: 2508.06895
batclip bimodal online test-time adaptation for clip | arXiv: 2412.02837
benchmarking and learning multi-dimensional quality evaluator for text-to-3d gen | arXiv: 2412.11170
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation | arXiv: 2412.11170
benchmarking burst super-resolution for polarization images noise dataset and an | arXiv: 2503.18705
Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis | arXiv: 2503.18705
Benchmarking Egocentric Visual-Inertial SLAM at City Scale | arXiv: 2509.26639
benchmarking multimodal large language models against image corruptions
benefit from seen enhancing open-vocabulary object detection by bridging visual
Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI | arXiv: 2403.06361
beyond isolated words diffusion brush for handwritten text-line generation | arXiv: 2508.03256
beyond label semantics language-guided action anatomy for few-shot action recogn | arXiv: 2507.16287
beyond losses reweighting empowering multi-task learning via the generalization | arXiv: 2211.13723
beyond low-rank tuning model prior-guided rank allocation for effective transfer | arXiv: 2507.00327
beyond one shot beyond one perspective cross-view and long-horizon distillation | arXiv: 2507.05260
beyond pixel uncertainty bounding the ood objects in road scenes
beyond single images retrieval self-augmented unsupervised camouflaged object de | arXiv: 2510.18437
beyond the frame generating 360deg panoramic videos from perspective videos | arXiv: 2504.07940
beziergs dynamic urban scene reconstruction with bezier curve gaussian splatting | arXiv: 2506.22099
bi-level optimization for self-supervised ai-generated face detection | arXiv: 2507.22824
bias-resilient weakly supervised semantic segmentation using normalizing flows
bidirectional likelihood estimation with multi-modal large language models for t | arXiv: 2507.23284
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis | arXiv: 2411.08508
bitrate-controlled diffusion for disentangling motion and content in video | arXiv: 2509.08376
Blended Point Cloud Diffusion for Localized Text-guided Shape Editing | arXiv: 2507.15399
Blind Noisy Image Deblurring Using Residual Guidance Strategy
blind noisy image deblurring using residual guidance strategy
blind2sound self-supervised image denoising without residual noise | arXiv: 2303.05183
blinktrack feature tracking over 80 fps via events and images | arXiv: 2409.17981
blueneg a 35mm negative film dataset for restoring channel-heterogeneous deterio
bokehdiff neural lens blur with one-step diffusion | arXiv: 2507.18060
Bolt3D: Generating 3D Scenes in Seconds | arXiv: 2503.14445
boost 3d reconstruction using diffusion-based monocular camera calibration | arXiv: 2411.17240
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration | arXiv: 2411.17240
boosting mllm reasoning with text-debiased hint-grpo | arXiv: 2503.23905
boosting multi-view indoor 3d object detection via adaptive 3d volume constructi | arXiv: 2507.18331
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction | arXiv: 2507.18331
boosting multimodal learning via disentangled gradient learning | arXiv: 2507.10213
boosting vision semantic density with anatomy normality modeling for medical vis | arXiv: 2508.03742
Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training | arXiv: 2508.03742
bootstrap3d improving multi-view diffusion model with synthetic data | arXiv: 2406.00093
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data | arXiv: 2406.00093
bootstrapping grounded chain-of-thought in multimodal llms for data-efficient mo
borrowing eyes for the blind spot overcoming data scarcity in malicious video de
boundary probing for input privacy protection when using lmm services
boxdreamer dreaming box corners for generalizable object pose estimation | arXiv: 2504.07955
BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation | arXiv: 2504.07955
breaking rectangular shackles cross-view object segmentation for fine-grained ob
breaking the encoder barrier for seamless video-language understanding | arXiv: 2503.18422
bridging 3d anomaly localization and repair via high-quality continuous geometri
bridging continuous and discrete tokens for autoregressive visual generation | arXiv: 2503.16430
bridging diffusion models and 3d representations a 3d consistent super-resolutio | arXiv: 2508.04090
bridging domain generalization to multimodal domain generalization via unified r | arXiv: 2507.03304
bridging local inductive bias and long-range dependencies with pixel-mamba for e
bridging the gap between ideal and real-world evaluation benchmarking ai-generat
bridging the skeleton-text modality gap diffusion-powered modality alignment for | arXiv: 2411.10745
bridging the sky and ground towards view-invariant feature learning for aerial-g
Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation | arXiv: 2503.11652
bring your rear cameras for egocentric 3d human pose estimation | arXiv: 2503.11652
buffer-x towards zero-shot point cloud registration in diverse scenes | arXiv: 2503.07940
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes | arXiv: 2503.07940
bvinet unlocking blind video inpainting with zero annotations | arXiv: 2502.01181
BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting | arXiv: 2506.22099
c2mil synchronizing semantic and topological causalities in multiple instance le
C4D: 4D Made from 3D through Dual Correspondences | arXiv: 2510.14960
ca-i2p channel-adaptive registration network with global optimal selection | arXiv: 2506.21364
ca2c a prior-knowledge-free approach for robust label noise learning via asymmet
cad-assistant tool-augmented vllms as generic cad task solvers | arXiv: 2412.13810
cad-recode reverse engineering cad code from point clouds | arXiv: 2412.14042
CAD-Recode: Reverse Engineering CAD Code from Point Clouds | arXiv: 2412.14042
calibrating mllm-as-a-judge via multimodal bayesian prompt ensembles | arXiv: 2509.08777
can generative geospatial diffusion models excel as discriminative geospatial fo | arXiv: 2503.07890
can3tok canonical 3d tokenization and latent modeling of scene-level 3d gaussian | arXiv: 2508.01464
Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians | arXiv: 2508.01464
cao2 rectifying inconsistencies in diffusion-based dataset distillation | arXiv: 2506.22637
cap evaluation of persuasive and creative image generation | arXiv: 2412.10426
capellm support-free category-agnostic pose estimation with multimodal large lan | arXiv: 2411.06869
captionsmiths flexibly controlling language pattern in image captioning | arXiv: 2507.01409
capture evaluating spatial reasoning in vision language models via occluded obje | arXiv: 2504.15485
cargait cross-attention based re-ranking for gait recognition | arXiv: 2503.03501
CarGait: Cross-Attention based Re-ranking for Gait Recognition | arXiv: 2503.03501
carl causality-guided architecture representation learning for an interpretable
casp improving semi-dense feature matching pipeline leveraging cascaded correspo | arXiv: 2507.17312
cassic towards content-adaptive state-space models for learned image compression
category-specific selective feature enhancement for long-tailed multi-label imag
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image | arXiv: 2412.12906
causal disentanglement and cross-modal alignment for enhanced few-shot learning | arXiv: 2508.03102
causal-entity reflected egocentric traffic accident video synthesis | arXiv: 2506.23263
cavis context-aware video instance segmentation | arXiv: 2407.03010
ccl-lgs contrastive codebook learning for 3d language gaussian splatting | arXiv: 2505.20469
ce-fam concept-based explanation via fusion of activation maps | arXiv: 2509.23849
Certifiably Optimal Anisotropic Rotation Averaging | arXiv: 2503.07353
cf3 compact and fast 3d feature fields | arXiv: 2508.05254
characonsist fine-grained consistent character generation | arXiv: 2507.11533
charm3r towards unseen camera height robust monocular 3d detector | arXiv: 2508.11185
CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector | arXiv: 2508.11185
chartcap mitigating hallucination of dense chart captioning | arXiv: 2508.03164
chartpoint guiding mllms with grounding reflection for chart reasoning | arXiv: 2512.00305
chatreid open-ended interactive person retrieval via hierarchical progressive tu
chimera improving generalist model with domain-specific experts | arXiv: 2412.05983
chords diffusion sampling accelerator with multi-core hierarchical ode solvers | arXiv: 2507.15260
ciard cyclic iterative adversarial robustness distillation | arXiv: 2509.12633
citynav a large-scale dataset for real-world aerial navigation | arXiv: 2406.14240
cl-splats continual learning of gaussian splatting with local optimization | arXiv: 2506.21117
class token as proxy optimal transport-assisted proxy learning for weakly superv
class-wise federated averaging for efficient personalization | arXiv: 2406.07800
cleanpose category-level object pose estimation via causal learning and knowledg | arXiv: 2502.01312
client2vec improving federated learning by distribution shifts aware client inde | arXiv: 2405.16233
clip-adapted region-to-text learning for generative open-vocabulary semantic seg
clip-gs unifying vision-language representation with 3d gaussian splatting | arXiv: 2412.19142
clipsym delving into symmetry detection with clip | arXiv: 2508.14197
closed-loop transfer for weakly-supervised affordance grounding | arXiv: 2510.17384
clot closed loop optimal transport for unsupervised action segmentation | arXiv: 2507.03539
cmad correlation-aware and modalities-aware distillation for multimodal sentimen
cmb-ml a cosmic microwave background dataset for the oldest possible computer vi
cmt a cascade mar with topology predictor for multimodal conditional cad generat | arXiv: 2504.20830
cns-bench benchmarking image classifier robustness under continuous nuisance shi | arXiv: 2507.17651
co-painter fine-grained controllable image stylization via implicit decoupling a
co2-net a physics-informed spatio-temporal model for global surface co2 reconstr
coa-vla improving vision-language-action models via visual-text chain-of-afforda | arXiv: 2412.20451
CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance | arXiv: 2412.20451
cobl toward zero-shot ordinal layering without user prompting | arXiv: 2508.08498
coda-4dgs dynamic gaussian splatting with context and deformation awareness for | arXiv: 2503.06744
cohd a counting-aware hierarchical decoding framework for generalized referring
coin confidence score-guided distillation for annotation-free cell segmentation | arXiv: 2503.11439
colmdriver llm-based negotiation benefits cooperative autonomous driving | arXiv: 2503.08683
color matching using hypernetwork-based kolmogorov-arnold networks | arXiv: 2503.11781
colors see colors ignore clothes changing reid with color disentanglement | arXiv: 2507.07230
comatch dynamic covisibility-aware transformer for bilateral subpixel-level semi
combatvla an efficient vision-language-action model for combat tasks in 3d actio | arXiv: 2503.09527
combinative matching for geometric shape assembly | arXiv: 2508.09780
communication-efficient multi-vehicle collaborative semantic segmentation via sp
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images | arXiv: 2503.05332
compass enhancing spatial understanding in text-to-image diffusion models | arXiv: 2412.13195
compcap improving multimodal large language models with composite captions | arXiv: 2412.05243
competitive distillation a simple learning strategy for improving visual classif | arXiv: 2506.23285
completeme reference-based human image completion | arXiv: 2504.20042
completing 3d partial assemblies with view-consistent 2d-3d correspondence
compression of 3d gaussian splatting with optimized feature planes and standard | arXiv: 2501.03399
compression-aware one-step diffusion model for jpeg artifact removal | arXiv: 2502.09873
compslider compositional slider for disentangled multiple-attribute image genera | arXiv: 2509.01028
conditional visual autoregressive modeling for pathological image restoration
conformalsam unlocking the potential of foundational segmentation models in semi | arXiv: 2507.15803
confound from all sides distill with resilience multi-objective adversarial path
consistent time-of-flight depth denoising via graph-informed geometric attention | arXiv: 2506.23542
consistentcity semantic flow-guided occupancy dit for temporally consistent driv
constraint-aware feature learning for parametric point cloud | arXiv: 2411.07747
constructing ophthalmic mllm for positioning-diagnosis collaboration through cli
conststyle robust domain generalization with unified style transformation | arXiv: 2509.05975
contact-aware amodal completion for human-object interaction via multi-regional | arXiv: 2508.00427
contact-aware refinement of human pose pseudo-ground truth via bioimpedance sens | arXiv: 2512.04862
context guided transformer entropy modeling for video compression | arXiv: 2508.01852
contextface generating facial expressions from emotional contexts
continuous-time human motion field from event cameras
contrags codebook-condensed and trainable gaussian splatting for fast memory-eff
contrastive flow matching | arXiv: 2506.05350
Controllable 3D Outdoor Scene Generation via Scene Graphs | arXiv: 2503.07152
controllable and expressive one-shot video head swapping | arXiv: 2506.16852
controllable feature whitening for hyperparameter-free bias mitigation | arXiv: 2507.20284
controllable latent space augmentation for digital pathology | arXiv: 2508.14588
controlling multimodal llms via reward-guided decoding | arXiv: 2508.11616
Controlling Multimodal LLMs via Reward-guided Decoding | arXiv: 2508.11616
cooperative pseudo labeling for unsupervised federated classification | arXiv: 2510.10100
cooptrack exploring end-to-end learning for efficient cooperative sequential per | arXiv: 2507.19239
coordinate-based speed of sound recovery for aberration-corrected photoacoustic | arXiv: 2409.10876
coralsrt revisiting coral reef semantic segmentation by feature rectification vi
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation | arXiv: 2411.10086
correspondence as video test-time adaption on sam2 for reference segmentation in | arXiv: 2508.07759
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild | arXiv: 2508.07759
corvid improving multimodal large language models towards chain-of-thought reaso | arXiv: 2507.07424
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning | arXiv: 2507.07424
cosmic continual self-supervised learning for multi-domain medical imaging via c
cosmo combination of selective memorization for low-cost vision-and-language nav | arXiv: 2503.24065
costodet-ddpm collaborative training of stochastic and deterministic models impr
cotmr chain-of-thought multi-scale reasoning for training-free zero-shot compose
Counting Stacked Objects | arXiv: 2411.19149
countse soft exemplar open-set object counting
covtrack continuous open-vocabulary tracking via adaptive multi-cue fusion
cram large scale video continual learning with bootstrapped compression
cross-architecture distillation made simple with redundancy suppression | arXiv: 2507.21844
cross-category subjectivity generalization for style-adaptive sketch re-id
cross-granularity online optimization with masked compensated information for le
cross-view isolated sign language recognition via view synthesis and feature dis
cryofastar fast cryo-em ab initio reconstruction made easy | arXiv: 2506.05864
CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy | arXiv: 2506.05864
csd-var content-style decomposition in visual autoregressive models | arXiv: 2507.13984
culture3d a large-scale and diverse dataset of cultural landmarks and terrains f
cumperlay learning cubical multiparameter persistence vectorizations | arXiv: 2510.12795
cure cultural gaps in the long tail of text-to-image systems | arXiv: 2506.08071
curve-aware gaussian splatting for 3d parametric curve reconstruction | arXiv: 2506.21401
cuts3d cutting semantics in 3d for 2d unsupervised instance segmentation | arXiv: 2411.16319
cvfusion cross-view fusion of 4d radar and camera for 3d object detection | arXiv: 2507.04587
cvpt cross visual prompt tuning | arXiv: 2408.14961
cwnet causal wavelet network for low-light image enhancement | arXiv: 2507.10689
cycle consistency as reward learning image-text alignment without human preferen | arXiv: 2506.02095
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences | arXiv: 2506.02095
cycle-consistent learning for joint layout-to-image generation and object detect
d-attn decomposed attention for large vision-and-language model
d2st-adapter disentangled-and-deformable spatio-temporal adapter for few-shot ac
d3 training-free ai-generated video detection using second-order features | arXiv: 2508.00701
d3qe learning discrete distribution discrepancy-aware quantization error for aut
dacon dino for anime paint bucket colorization with any number of reference imag | arXiv: 2509.14685
dadet safeguarding image conditional diffusion models against adversarial and ba
dadm dual alignment of domain and modality for face anti-spoofing | arXiv: 2503.00429
damap distance-aware mapnet for high quality hd map construction | arXiv: 2510.22675
danceeditor towards iterative editable music-driven dance generation with open-v
dap-mae domain-adaptive point cloud masked autoencoder for effective cross-domai
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning | arXiv: 2510.21635
dash detection and assessment of systematic hallucinations of vlms | arXiv: 2503.23573
dataset distillation via the wasserstein metric | arXiv: 2311.18531
dataset ownership verification for pre-trained masked models | arXiv: 2507.12022
david data-efficient and accurate vision models from synthetic data | arXiv: 2507.15365
dc-ar efficient masked autoregressive image generation with deep compression hyb | arXiv: 2507.04947
dc-controlnet decoupling inter- and intra-element conditions in image generation
dchm depth-consistent human modeling for multiview detection | arXiv: 2507.14505
dct-shield a robust frequency domain defense against malicious image editing | arXiv: 2504.17894
ddb diffusion driven balancing to address spurious correlations | arXiv: 2503.17226
debiased curriculum adaptation for safe transfer learning in chest x-ray classif
debiased teacher for day-to-night domain adaptive object detection
decad decoupling anomalies in latent space for multi-class unsupervised anomaly
deciphering cross-modal alignment in large vision-language models via modality i
decoding correlation-induced misalignment in the stable diffusion workflow for t
decouple and track benchmarking and improving video diffusion transformers for m | arXiv: 2503.17350
decouple to reconstruct high quality uhd restoration via active feature disentan | arXiv: 2503.12764
decoupled diffusion sparks adaptive scene generation | arXiv: 2504.10485
deep adaptive unfolded network via spatial morphology stripping and spectral fil
deep incomplete multi-view clustering with distribution dual-consistency recover
deeply supervised flow-based generative models | arXiv: 2503.14494
deepmesh auto-regressive artist-mesh creation with reinforcement learning | arXiv: 2503.15265
deepshield fortifying deepfake video detection with local and global forgery ana | arXiv: 2510.25237
degauss dynamic-static decomposition with gaussian splatting for distractor-free | arXiv: 2503.13176
degradation-modeled multipath diffusion for tunable metalens photography | arXiv: 2506.22753
demeter a parametric model of crop plant morphology from the real world | arXiv: 2510.16377
denoising token prediction in masked autoregressive models
dense policy bidirectional autoregressive learning of actions | arXiv: 2503.13217
dense2moe restructuring diffusion transformer to moe for efficient text-to-image | arXiv: 2510.09094
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation | arXiv: 2510.09094
depth anyevent a cross-modal distillation paradigm for event-based monocular dep | arXiv: 2509.15224
deris decoupling perception and cognition for enhanced referring image segmentat | arXiv: 2507.01738
derm1m a million-scale vision-language dataset aligned with clinical ontology kn
describe adapt and combine empowering clip encoders for open-set 3d object retri | arXiv: 2507.21489
describe dont dictate semantic image editing with natural language intent | arXiv: 2508.20505
despite exploring contrastive deep skeleton-pointcloud-imu-text embeddings for a | arXiv: 2506.13897
despite exploring contrastive deep skeletonpointcloudimutext | arXiv: 2506.13897
detect anything 3d in the wild | arXiv: 2504.07958
devil is in the uniformity exploring diverse learners within transformer for ima | arXiv: 2503.20174
dexvlg dexterous vision-language-grasp model at scale | arXiv: 2507.02747
dgtalker disentangled generative latent space learning for audio-driven gaussian
dh-facevid-1k a large-scale high-quality dataset for face video generation | arXiv: 2410.07151
dia the adversarial exposure of deterministic inversion in diffusion models | arXiv: 2510.00778
diagnosing pretrained models for out-of-distribution detection
dice staleness-centric optimizations for parallel diffusion moe inference | arXiv: 2411.16786
dictas a framework for class-generalizable few-shot anomaly segmentation via dic | arXiv: 2508.13560
diffdoctor diagnosing image diffusion models before treating | arXiv: 2501.12382
diffpci large motion point cloud frame interpolation with diffusion model
diffsim taming diffusion models for evaluating visual similarity | arXiv: 2412.14580
difftell a high-quality dataset for describing image manipulation changes
diffuman4d 4d consistent human view synthesis from sparse-view videos with spati
diffumatch category-agnostic spectral diffusion priors for robust non-rigid shap | arXiv: 2507.23715
diffusion curriculum synthetic-to-real data curriculum via image-guided diffusio | arXiv: 2410.13674
diffusion guided adaptive augmentation for generalization in visual reinforcemen
diffusion image prior | arXiv: 2503.21410
diffusion-based 3d hand motion recovery with intuitive physics | arXiv: 2508.01835
diffusion-based source-biased model for single domain generalized object detecti
diffvsr revealing an effective recipe for taming robust video super-resolution a
dimcim a quantitative evaluation framework for default-mode diversity and genera
diorama unleashing zero-shot single-view 3d indoor scene modeling | arXiv: 2411.19492
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling | arXiv: 2411.19492
dirichlet-constrained variational codebook learning for temporally coherent vide
discontinuity-aware normal integration for generic central camera models | arXiv: 2507.06075
discopatch taming adversarially-driven batch statistics for improved out-of-dist | arXiv: 2501.08005
discovering divergent representations between text-to-image models | arXiv: 2509.08940
discretized gaussian representation for tomographic reconstruction | arXiv: 2411.04844
disenq disentangling q-former for activity-biometrics | arXiv: 2507.07262
disentangled world models learning to transfer semantic knowledge from distracti | arXiv: 2503.08751
disentangling instance and scene contexts for 3d semantic scene completion | arXiv: 2507.08555
disrupting model merging a parameter-level defense without sacrificing accuracy | arXiv: 2503.07661
dist-4d disentangled spatiotemporal diffusion with metric depth for 4d driving s | arXiv: 2503.15208
dista-net dynamic closely-spaced infrared small target unmixing | arXiv: 2505.19148
distil data-free inversion of suspicious trojan inputs via latent diffusion | arXiv: 2507.22813
distilling diffusion models to efficient 3d lidar scene completion | arXiv: 2412.03515
distime distribution-based time representation for video large language models | arXiv: 2505.24329
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy | arXiv: 2503.19757
ditfastattnv2 head-wise attention compression for multi-modality diffusion trans | arXiv: 2503.22796
dive taming dino for subject-driven video editing | arXiv: 2412.03347
diversity-enhanced distribution alignment for dataset distillation
diving into the fusion of monocular priors for generalized stereo matching | arXiv: 2505.14414
dlf extreme image compression with dual-generative latent fusion | arXiv: 2503.01428
dlfr-gen diffusion-based video generation with dynamic latent frame rate
dm-efs dynamically multiplexed expanded features set form for robust and efficie
dmesh an efficient differentiable mesh for complex shapes | arXiv: 2412.16776
dmq dissecting outliers of diffusion models for post-training quantization | arXiv: 2507.12933
do it yourself learning semantic correspondence from pseudo-labels | arXiv: 2506.05312
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding | arXiv: 2508.08589
dogr towards versatile visual document grounding and referring | arXiv: 2411.17125
dollar few-step video generation via distillation and latent reward optimization | arXiv: 2412.15689
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization | arXiv: 2412.15689
domain generalizable portrait style transfer | arXiv: 2507.04243
donut a decoder-only model for trajectory prediction | arXiv: 2506.06854
doodle your keypoints sketch-based few-shot keypoint detection | arXiv: 2507.07994
dposer-x diffusion model as robust 3d whole-body human pose prior | arXiv: 2508.00599
dram-lhm a quaternion framework for iterative camera pose estimation
drawing developmental trajectory from cortical surface reconstruction
dreamactor-m1 holistic expressive and robust human image animation with hybrid g | arXiv: 2504.01724
dreamdance animating human images by enriching 3d geometry cues from 2d poses | arXiv: 2412.00397
dreamlayer simultaneous multi-layer generation via diffusion model
dreamrelation relation-centric video customization | arXiv: 2503.07602
drivex omni scene modeling for learning generalizable world knowledge in autonom | arXiv: 2505.19239
driving view synthesis on free-form trajectories with generative prior | arXiv: 2412.01717
drivinggpt unifying driving world modeling and planning with multi-modal autoreg
dropletvideo a dataset and approach to explore integral spatio-temporal consiste
dso aligning 3d generators with simulation feedback for physical soundness | arXiv: 2503.22677
dual domain control via active learning for remote sensing domain incremental ob
dual reciprocal learning of language-based human motion understanding and genera
dual recursive feedback on generation and appearance latents for pose-robust tex | arXiv: 2508.09575
dual-expert consistency model for efficient and high-quality video generation | arXiv: 2506.03123
dual-level prototype learning for composite degraded image restoration
dual-rate dynamic teacher for source-free domain adaptive object detection
dual-temporal exemplar representation network for video semantic segmentation
dualreal adaptive joint training for lossless identity-motion fusion in video cu | arXiv: 2505.02192
duet dual incremental object detection via exemplar-free task arithmetic | arXiv: 2506.21260
duolora cycle-consistent and rank-disentangled content-style personalization | arXiv: 2504.13206
dwim towards tool-aware visual reasoning via discrepancy-aware workflow generati | arXiv: 2503.19263
dygs-slam real-time accurate localization and gaussian reconstruction for dynami
dynamic dictionary learning for remote sensing image segmentation | arXiv: 2503.06683
dynamic group detection using vlm-augmented temporal groupness graph | arXiv: 2509.04758
dynamic multimodal prototype learning in vision-language models | arXiv: 2507.03657
dynamic point maps a versatile representation for dynamic 3d reconstruction | arXiv: 2503.16318
dynamic reconstruction of hand-object interaction with distributed force-aware c | arXiv: 2411.09572
Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection | arXiv: 2507.17436
dynamic-vlm simple dynamic visual token compression for videollm | arXiv: 2412.09530
dynamicid zero-shot multi-id image personalization with flexible facial editabil | arXiv: 2503.06505
dynfacerestore balancing fidelity and quality in diffusion-guided blind face res | arXiv: 2507.13797
dynimg key frames with visual prompts are good representation for multi-modal vi | arXiv: 2507.15569
e-nemf event-based neural motion field for novel space-time view synthesis of dy
e-sam training-free segment every entity model | arXiv: 2503.12094
ea-kd entropy-based adaptive knowledge distillation | arXiv: 2311.13621
ea-vit efficient adaptation for elastic vision transformer | arXiv: 2507.19360
eamamba efficient all-around vision state space model for image restoration | arXiv: 2506.22246
early timestep zero-shot candidate selection for instruction-guided image editin | arXiv: 2504.13490
easi3r estimating disentangled motion from dust3r without training | arXiv: 2503.24391
easy3d a simple yet effective method for 3d interactive segmentation | arXiv: 2504.11024
ec-flow enabling versatile robotic manipulation from action-unlabeled videos via | arXiv: 2507.06224
edffdnet towards accurate and efficient unsupervised multi-grid image registrati | arXiv: 2509.07662
edit efficient diffusion transformers with linear compressed attention | arXiv: 2503.16726
eedit rethinking the spatial and temporal redundancy for efficient image editing | arXiv: 2503.10270
effective training data synthesis for improving mllm chart understanding | arXiv: 2508.06492
efficient adaptation of pre-trained vision transformer underpinned by approximat | arXiv: 2507.13260
efficient autoregressive shape generation via octree-based adaptive tokenization | arXiv: 2504.02817
efficient concertormer for image deblurring and beyond | arXiv: 2404.06135
efficient fine-tuning of large models via nested low-rank adaptation
efficient input-level backdoor defense on text-to-image synthesis via neuron act | arXiv: 2503.06453
efficient spiking point mamba for point cloud analysis | arXiv: 2504.14371
efficient visual place recognition through multimodal semantic knowledge integra
efficientmt efficient temporal adaptation for motion transfer in text-to-video d | arXiv: 2503.19369
egoadapt adaptive multisensory distillation and policy learning for efficient eg | arXiv: 2506.21080
egoagent a joint predictive agent model in egocentric worlds | arXiv: 2502.05857
egocentric action-aware inertial localization in point clouds with vision-langua | arXiv: 2505.14346
egom2p egocentric multimodal multitask pretraining | arXiv: 2506.07886
egoppg heart rate estimation from eye-tracking cameras in egocentric systems to | arXiv: 2502.20879
embodied image captioning self-supervised learning agents for spatially coherent | arXiv: 2504.08531
embodied navigation with auxiliary task of action description prediction | arXiv: 2510.21809
embodied representation alignment with mirror neurons | arXiv: 2509.21136
embodied videoagent persistent memory from egocentric videos and embodied sensor
embodiedocc embodied 3d occupancy prediction for vision-based online scene under | arXiv: 2412.04380
embodiedsplat personalized real-to-sim-to-real navigation with gaussian splats f | arXiv: 2509.17430
emd explicit motion modeling for high-quality street gaussian splatting | arXiv: 2411.15582
emoticrafter text-to-emotional-image generation based on valence-arousal model | arXiv: 2501.05710
emotive event-guided trajectory modeling for 3d motion estimation | arXiv: 2503.11371
emulating self-attention with convolution for efficient image super-resolution | arXiv: 2503.06671
end-to-end entity-predicate association reasoning for dynamic scene graph genera
end-to-end multi-modal diffusion mamba | arXiv: 2510.13253
engage for all making ordinary image descriptions appealing again
enhanced event-based dense stereo via cross-sensor knowledge distillation
enhanced pansharpening via quaternion spatial-spectral interactions
enhancing adversarial transferability by balancing exploration and exploitation | arXiv: 2511.00411
enhancing few-shot vision-language classification with large multimodal model fe | arXiv: 2412.00142
enhancing image restoration transformer via adaptive translation equivariance | arXiv: 2506.18520
enhancing prompt generation with adaptive refinement for camouflaged object dete
enhancing reward models for high-quality image generation beyond text-image alig | arXiv: 2507.19002
enhancing transferability of targeted adversarial examples via inverse target gr
enhancing transformers through conditioned embedded tokens | arXiv: 2505.12789
enhancing zero-shot object counting via text-guided local ranking and number-evo
enrich and detect video temporal grounding with multimodal llms | arXiv: 2510.17023
ensemble foreground management for unsupervised object discovery | arXiv: 2507.20860
epipolar consistent attention aggregation network for unsupervised light field d
epona autoregressive diffusion world model for autonomous driving | arXiv: 2506.24113
equipping vision foundation model with mixture of experts for out-of-distributio
erasing more than intended how concept erasure degrades the generation of non-ta | arXiv: 2501.09833
error recognition in procedural videos using generalized task graph
escnetedge-semantic collaborative network for camouflaged object detection
estimating 2d camera motion with hybrid motion basis | arXiv: 2507.22480
eta efficiency through thinking ahead a dual approach to self-driving with large | arXiv: 2506.07725
eta energy-based test-time adaptation for depth completion | arXiv: 2508.05989
etch generalizing body fitting to clothed humans via equivariant tightness | arXiv: 2503.10624
etva evaluation of text-to-video alignment via fine-grained question generation | arXiv: 2503.16867
evading data provenance in deep neural networks | arXiv: 2508.01074
evagaussians event stream assisted gaussian splatting from blurry images | arXiv: 2405.20224
event-based tiny object detection a benchmark dataset and baseline | arXiv: 2506.23575
event-based visual vibrometry
event-boosted deformable 3d gaussians for dynamic scene reconstruction | arXiv: 2411.16180
event-driven storytelling with multiple lifelike humans in a 3d scene | arXiv: 2507.19232
event-guided unified framework for low-light video enhancement frame interpolati
eventups uncalibrated photometric stereo using an event camera
everything is a video unifying modalities through next-frame prediction | arXiv: 2411.10503
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models | arXiv: 2502.06788
evidential knowledge distillation
evolvinggrasp evolutionary grasp generation via efficient preference alignment | arXiv: 2503.14329
evrt-detr latent space adaptation of image detectors for event-based vision | arXiv: 2412.02890
evt efficient view transformation for multi-modal 3d object detection | arXiv: 2411.10715
excap3d expressive 3d scene understanding via object captioning with varying det | arXiv: 2503.17044
exploiting diffusion prior for task-driven image restoration | arXiv: 2507.22459
exploiting domain properties in language-driven domain generalization for semant | arXiv: 2512.03508
exploiting vision language model for training-free 3d point cloud ood detection | arXiv: 2506.22375
exploring multimodal diffusion transformers for enhanced prompt-based image edit | arXiv: 2508.07519
exploring probabilistic modeling beyond domain generalization for semantic segme | arXiv: 2507.21367
exploring view consistency for scene-adaptive low-light light field image enhanc
exploring weather-aware aggregation and adaptation for semantic segmentation und
expressive talking human from single-image with imperfect priors
external knowledge injection for clip-based class-incremental learning | arXiv: 2503.08510
extrapolated urban view synthesis benchmark | arXiv: 2412.05256
f-bench rethinking human preference evaluation metrics for benchmarking face gen
fa forced prompt learning of vision-language models for out-of-distribution dete | arXiv: 2507.04511
facecraft4d animated 3d facial avatar generation from a single image | arXiv: 2504.15179
facelift learning generalizable single image 3d face reconstruction from synthet | arXiv: 2412.17812
factorized learning for temporally grounded video-language models | arXiv: 2512.24097
failure cases are better learned but boundary says sorry facilitating smooth per | arXiv: 2508.02186
fair generation without unfair distortions debiasing text-to-image generation wi | arXiv: 2506.13298
fairgen enhancing fairness in text-to-image diffusion models via self-discoverin
fakeradar probing forgery outliers to detect unknown deepfake videos | arXiv: 2512.14601
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers | arXiv: 2501.16297
fast image super-resolution via consistency rectified flow
faster and better 3d splatting via group training | arXiv: 2412.07608
fastjsma accelerating jacobian-based saliency map attacks through gradient decou
fastvar linear visual autoregressive modeling via cached token pruning | arXiv: 2503.23367
fdpt federated discrete prompt tuning for black-box visual-language models
fe-clip frequency enhanced clip model for zero-shot anomaly detection and segmen
feather the throttle revisiting visual token pruning for vision-language model a | arXiv: 2412.13180
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration | arXiv: 2412.13180
feature extraction and representation of pre-training point cloud based on diffu
feature purification matters suppressing outlier propagation for training-free o
fedagc federated continual learning with asymmetric gradient correction
feddifrc unlocking the potential of text-to-image diffusion models in heterogene | arXiv: 2507.06482
federated continual instruction tuning | arXiv: 2503.12897
federated prompt-tuning with heterogeneous and incomplete multimodal client data | arXiv: 2602.07081
federated representation angle learning
fedmenf privacy-preserving federated meta-learning for neural fields | arXiv: 2508.06301
fedmvp federated multimodal visual prompt tuning for vision-language models | arXiv: 2504.20860
fedpall prototype-based adversarial and collaborative learning for federated lea
fedvla federated vision-language-action learning with dual gating mixture-of-exp | arXiv: 2508.02190
few-shot pattern detection via template matching and regression | arXiv: 2508.17636
fewer denoising steps or cheaper per-step inference towards compute-optimal diff | arXiv: 2508.06160
ficgen frequency-inspired contextual disentanglement for layout-driven degraded | arXiv: 2509.01107
fiffdepth feed-forward transformation of diffusion-based generators for detailed | arXiv: 2412.00671
find a scapegoat poisoning membership inference attack and defense to federated | arXiv: 2507.00423
find any part in 3d | arXiv: 2411.13550
find few-shot anomaly inspection with normal-only multi-modal data
fine-grained evaluation of large vision-language models in autonomous driving | arXiv: 2503.21505
fine-grained spatiotemporal grounding on egocentric videos | arXiv: 2508.00518
finemotion a dataset and benchmark with both spatial and temporal annotation for
finmmr make financial numerical reasoning more multimodal comprehensive and chal | arXiv: 2508.04625
fish2mesh transformer 3d human mesh recovery from egocentric vision | arXiv: 2503.06089
fix-clip dual-branch hierarchical contrastive learning via synthetic captions fo | arXiv: 2507.10095
fixtalk taming identity leakage for high-quality talking head generation in extr | arXiv: 2507.01390
flashdepth real-time streaming video depth estimation at 2k resolution | arXiv: 2504.07093
flexgen flexible multi-view generation from text and image inputs | arXiv: 2410.10745
float generative motion latent flow matching for audio-driven talking portrait | arXiv: 2412.01064
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation | arXiv: 2504.10487
flow to the mode mode-seeking diffusion autoencoders for state-of-the-art image | arXiv: 2503.11056
flow-mil constructing highly-expressive latent feature space for whole slide ima
flow4agent long-form video understanding via motion prior from optical flow | arXiv: 2510.05836
flowdps flow-driven posterior sampling for inverse problems | arXiv: 2503.08136
flowedit inversion-free text-based editing using pre-trained flow models | arXiv: 2412.08629
flowseek optical flow made easier with depth foundation models and motion bases | arXiv: 2509.05297
flowstyler artistic video stylization via transformation fields transports
flowtok flowing seamlessly across text and image tokens | arXiv: 2503.10772
focal plane visual feature generation and matching on a pixel processor array
folder accelerating multi-modal large language models with enhanced performance | arXiv: 2501.02430
fontanimate high quality few-shot font generation via animating font transfer pr
forcennet foreground-centric network for document image rectification | arXiv: 2507.19804
forensic-moe exploring comprehensive synthetic image detection traces with mixtu
foresight in motion reinforcing trajectory prediction with reward heuristics | arXiv: 2507.12083
forgelens data-efficient forgery focus for generalizable forgery image detection | arXiv: 2408.13697
forgetting through transforming enabling federated unlearning via class-aware re | arXiv: 2410.06848
foundir unleashing million-scale training data to advance foundation models for | arXiv: 2412.01427
fpem face prior enhanced facial attractiveness prediction for live videos with f
free-form motion control controlling the 6d poses of camera and objects in video | arXiv: 2501.01425
free-merging fourier transform for efficient model merging | arXiv: 2411.16815
free-moref instantly multiplexing context perception capabilities of video-mllms | arXiv: 2508.02134
free-running vs synchronous single-photon lidar for high-flux 3d imaging | arXiv: 2507.09386
free4d tuning-free 4d scene generation with spatial-temporal consistency | arXiv: 2503.20785
freecus free lunch subject-driven customization in diffusion transformers | arXiv: 2507.15249
freedance towards harmonic free-number group dance generation via a unified fram
freedna endowing domain adaptation of diffusion-based dense prediction with trai
freeflux understanding and exploiting layer-specific roles in rope-based mmdit f
freemorph tuning-free generalized image morphing with diffusion model | arXiv: 2507.01953
freescale unleashing the resolution of diffusion models via tuning-free scale fu | arXiv: 2412.09626
frequency-aligned knowledge distillation for lightweight spatiotemporal forecast | arXiv: 2507.02939
frequency-guided diffusion for training-free text-driven image translation
frequency-semantic enhanced variational autoencoder for zero-shot skeleton-based | arXiv: 2506.22179
fret feature redundancy elimination for test time adaptation | arXiv: 2505.10641
from easy to hard progressive active learning framework for infrared small targe | arXiv: 2412.11154
from easy to hard the mir benchmark for progressive interleaved multi-image reas | arXiv: 2509.17040
from gallery to wrist realistic 3d bracelet insertion in videos | arXiv: 2507.20331
from gaze to movement predicting visual attention for autonomous driving human-m
from holistic to localized local enhanced adapters for efficient visual instruct | arXiv: 2411.12787
from image to video an empirical study of diffusion representations | arXiv: 2502.07001
from imitation to innovation the emergence of ais unique artistic styles and the
from linearity to non-linearity how masked autoencoders capture spatial correlat | arXiv: 2508.15404
from objects to events unlocking complex visual understanding in object detector
from one to more contextual part latents for 3d generation | arXiv: 2507.08772
from reflection to perfection scaling inference-time optimization for text-to-im
from reusing to forecasting accelerating diffusion models with taylorseers | arXiv: 2503.06923
from sharp to blur unsupervised domain adaptation for 2d human pose estimation u
from trial to triumph advancing long video understanding via visual context samp
fross faster-than-real-time online 3d semantic scene graph generation from rgb-d | arXiv: 2507.19993
fuse before transfer knowledge fusion for heterogeneous distillation | arXiv: 2410.12342
fusion meets diverse conditions a high-diversity benchmark and baseline for uav-
fusionphys a flexible framework for fusing complementary sensing modalities in r
future-aware interaction network for motion forecasting | arXiv: 2503.06565
fuxi-rtm a physics-guided prediction framework with radiative transfer modeling | arXiv: 2503.19940
fuzzy contrastive decoding to alleviate object hallucination in large vision-lan
fvgen accelerating novel-view synthesis with adversarial video diffusion distill | arXiv: 2508.06392
fw-merging scaling model merging with frank-wolfe optimization | arXiv: 2503.12649
g2d boosting multimodal learning with gradient-guided distillation | arXiv: 2506.21514
g2pdiffusion cross-species genotype-to-phenotype prediction via evolutionary dif | arXiv: 2502.04684
g2sf geometry-guided score fusion for multimodal industrial anomaly detection | arXiv: 2503.10091
gain-mlp improving hdr gain map encoding via a lightweight mlp | arXiv: 2503.11883
gait-x exploring x modality for generalized gait recognition
gamefactory creating new games with generative interactive videos | arXiv: 2501.08325
gap gaussianize any point clouds with text guidance | arXiv: 2508.05631
gas generative avatar synthesis from a single image | arXiv: 2502.06957
gaussian splatting with discretized sdf for relightable assets | arXiv: 2507.15629
gaussian variation field diffusion for high-fidelity video-to-4d synthesis | arXiv: 2507.23785
gaussian-based world model gaussian priors for voxel-based occupancy prediction
gaussianflowocc sparse and weakly supervised occupancy estimation using gaussian | arXiv: 2502.17288
gaussianocc fully self-supervised and efficient 3d occupancy estimation with gau
gaussianproperty integrating physical properties to 3d gaussians with lmms | arXiv: 2412.11258
gaussianupdate continual 3d gaussian splatting update for changing environments | arXiv: 2508.08867
gaussrender learning 3d occupancy with gaussian rendering | arXiv: 2502.05040
gauupdate new object insertion in 3d gaussian fields with consistent global illu
gaze-language alignment for zero-shot prediction of visual search targets from h
gazegaussian high-fidelity gaze redirection with 3d gaussian splatting | arXiv: 2411.12981
gdkvm echocardiography video segmentation via spatiotemporal key-value memory wi | arXiv: 2512.10252
gecko gigapixel vision-concept contrastive pretraining in histopathology | arXiv: 2504.01009
gemex a large-scale groundable and explainable medical vqa benchmark for chest x | arXiv: 2411.16778
geminio language-guided gradient inversion attacks in federated learning | arXiv: 2411.14937
gendop auto-regressive camera trajectory generation as a director of photography | arXiv: 2504.07083
general compression framework for efficient transformer object tracking | arXiv: 2409.17564
generalizable non-line-of-sight imaging with learnable physical priors | arXiv: 2409.14011
generalizable object re-identification via visual in-context prompting | arXiv: 2508.21222
generalized deep multi-view clustering via causal learning with partially aligne
generalized tensor-based parameter-efficient fine-tuning via lie group transform | arXiv: 2504.00851
generate refine and encode leveraging synthesized novel samples for on-the-fly f | arXiv: 2507.04051
generate transduct adapt iterative transduction with vlms | arXiv: 2501.06031
generating fast and slow scalable parallel video generation with video interface | arXiv: 2503.17539
generating multi-image synthetic data for text-to-image customization | arXiv: 2502.01720
generating physically stable and buildable brick structures from text | arXiv: 2505.05469
generative active learning for long-tail trajectory prediction via controllable | arXiv: 2507.22615
generative modeling of shape-dependent self-contact human poses | arXiv: 2509.23393
generative zoo | arXiv: 2412.08101
generic event boundary detection via denoising diffusion | arXiv: 2508.12084
genflow3d generative scene flow estimation and prediction on point cloud sequenc
genflowrl shaping rewards with generative object-centric flow in visual reinforc | arXiv: 2508.11049
genhancer imperfect generative models are secretly strong vision-centric enhance | arXiv: 2503.19480
genhaze pioneering controllable one-step realistic haze generation for real-worl
genieblue integrating both linguistic and multimodal capabilities for large lang
genm3 generative pretrained multi-path motion model for text conditional human m | arXiv: 2503.14919
genmo a generalist model for human motion | arXiv: 2505.01425
geo4d leveraging video generators for geometric 4d scene reconstruction | arXiv: 2504.07961
geobench-vlm benchmarking vision-language models for geospatial tasks | arXiv: 2411.19325
geodistill geometry-guided self-distillation for weakly supervised cross-view lo | arXiv: 2507.10935
geoexplorer active geo-localization with curiosity-driven exploration | arXiv: 2508.00152
geoformer geometry point encoder for 3d object detection with graph-based transf
geometry distributions | arXiv: 2411.16076
geometrycrafter consistent geometry estimation for open-world videos with diffus
geoprog3d compositional visual reasoning for city-scale 3d language fields | arXiv: 2506.23352
geosplatting towards geometry guided gaussian splatting for physically-based inv | arXiv: 2410.24204
gesturehydra semantic co-speech gesture synthesis via hybrid modality diffusion | arXiv: 2507.22731
gfpack attention-driven gradient fields for optimizing 2d irregular packing
ggtalker talking head systhesis with generalizable gaussian priors and identity- | arXiv: 2506.21513
global and local entailment learning for natural world imagery | arXiv: 2506.21476
global motion corresponder for 3d point-based scene interpolation under large mo | arXiv: 2508.20136
global-aware monocular semantic scene completion with state space models | arXiv: 2503.06569
gm-moe low-light enhancement with gated-mechanism mixture-of-experts | arXiv: 2503.07417
gmmamba group masking mamba for whole slide image classification
golden noise for diffusion models a learning framework | arXiv: 2411.09502
grab a challenging graph analysis benchmark for large multimodal models | arXiv: 2408.11817
gradient decomposition and alignment for incremental object detection
gradient extrapolation for debiased representation learning | arXiv: 2503.13236
gradient short-circuit efficient out-of-distribution detection via feature inter | arXiv: 2507.01417
gradient-reweighted adversarial camouflage for physical object detection evasion
granular concept circuits toward a fine-grained circuit discovery for concept re | arXiv: 2508.01728
graph domain adaptation with dual-branch encoder and two-level alignment for who
greg geometry-aware region refinement for sign language video generation
grouped speculative decoding for autoregressive image generation | arXiv: 2508.07747
growing a twig to accelerate large vision-language models | arXiv: 2503.14075
gs-id illumination decomposition on gaussian splatting via adaptive light aggreg
gs-livm real-time photo-realistic lidar-inertial-visual mapping with gaussian sp | arXiv: 2410.17084
gs-occ3d scaling vision-only occupancy reconstruction with gaussian splatting | arXiv: 2507.19451
gsot3d towards generic 3d single object tracking in the wild | arXiv: 2412.02129
gsv3d gaussian splatting-based geometric distillation with stable video diffusio
gt-mean loss a simple yet effective solution for brightness mismatch in low-ligh
gtr guided thought reinforcement prevents thought collapse in rl-based vlm agent | arXiv: 2503.08525
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training | arXiv: 2503.08525
guava generalizable upper body 3d gaussian avatar | arXiv: 2505.03351
guiding diffusion-based articulated object generation by partial point cloud ali | arXiv: 2508.00558
guiding noisy label conditional diffusion models with score-based discriminator | arXiv: 2508.19581
guiodyssey a comprehensive dataset for cross-app gui navigation on mobile device | arXiv: 2406.08451
hades human avatar with dynamic explicit hair strands
haircup hair compositional universal prior for 3d gaussian avatars | arXiv: 2507.19481
hallucinatory image tokens a training-free eazy approach to detecting and mitiga
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation | arXiv: 2503.21979
harmonyseg tubular structure segmentation with deep-shallow feature fusion and g
harnessing massive satellite imagery with efficient masked image modeling | arXiv: 2406.11933
harnessing vision foundation models for high-performance training-free open voca
hcceposebf predicting front back surfaces to construct ultra-dense 2d-3d corresp | arXiv: 2510.10177
hdr image generation via gain map decomposed diffusion
head2body body pose generation from multi-sensory head-mounted inputs
heavy labels out dataset distillation with label space lightening | arXiv: 2408.08201
hermes a unified self-driving world model for simultaneous 3d scene understandin | arXiv: 2501.14729
hermes temporal-coherent long-form understanding with episodes and semantics | arXiv: 2408.17443
heuristic-induced multimodal risk distribution jailbreak attack for multimodal l | arXiv: 2412.05934
hfd-teacher high-frequency depth distillation from depth foundation models for e
hi-gaussian hierarchical gaussians under normalized spherical projection for sin
hi3dgen high-fidelity 3d geometry generation from images via normal bridging | arXiv: 2503.22236
hierarchical 3d scene graphs construction outdoors
hierarchical event memory for accurate and low-latency online video temporal gro | arXiv: 2508.04546
hierarchical material recognition from local appearance | arXiv: 2505.22911
hierarchical variational test-time prompt generation for zero-shot generalizatio
hierarchical visual prompt learning for continual video instance segmentation | arXiv: 2508.08612
hierarchical-aware orthogonal disentanglement framework for fine-grained skeleto
hiero understanding the hierarchy of human behavior enhances reasoning on egocen | arXiv: 2505.12911
high-resolution spatiotemporal modeling with global-local state space models for | arXiv: 2510.11017
himtok learning hierarchical mask tokens for image segmentation with large multi | arXiv: 2503.13026
hineus high-fidelity neural surface mitigating low-texture and reflective ambigu | arXiv: 2506.23854
hints of prompt enhancing visual representation for multimodal llms in autonomou | arXiv: 2411.13076
hipandas hyperspectral image joint denoising and super-resolution by image fusio
his-gpt towards 3d human-in-scene multimodal understanding | arXiv: 2503.12955
holistic tokenizer for autoregressive image generation | arXiv: 2507.02358
holistic unlearning benchmark a multi-faceted evaluation for text-to-image diffu | arXiv: 2410.05664
hort monocular hand-held objects reconstruction with transformers | arXiv: 2503.21313
housetour a virtual real estate aigent | arXiv: 2510.18054
how do multimodal large language models handle complex multimodal reasoning plac
how do optical flow and textual prompts collaborate to assist in audio-visual se | arXiv: 2601.08133
how far are ai-generated videos from simulating the 3d visual world a learned 3d | arXiv: 2406.19568
how would it sound material-controlled multimodal acoustic profile generation fo | arXiv: 2508.02905
hpsv3 towards wide-spectrum human preference score | arXiv: 2508.03789
hq-clip leveraging large vision-language models to create high-quality image-tex
hrscene how far are vlms from effective high-resolution image understanding | arXiv: 2504.18406
humanolat a large-scale dataset for full-body human relighting and novel-view sy | arXiv: 2508.09137
humans as checkerboards calibrating camera motion scale for world-coordinate hum
humoto a 4d dataset of mocap human object interactions | arXiv: 2504.10414
hust high-fidelity unbiased skin tone estimation via texture quantization
hvpunet hybrid-voxel point-cloud upsampling network
hybrid layout control for diffusion transformer fewer annotations superior aesth
hybrid-tower fine-grained pseudo-query interaction and generation for text-to-vi
hybrid-tta continual test-time adaptation via dynamic domain shift detection | arXiv: 2409.08566
hypdae hyperbolic diffusion autoencoders for hierarchical few-shot image generat | arXiv: 2411.17784
hyper-depth hypergraph-based multi-scale representation fusion for monocular dep
hypidecoder hybrid pixel decoder for efficient segmentation and detection
hytip hybrid temporal information propagation for masked conditional residual vi | arXiv: 2508.02072
i am big you are little i am right you are wrong | arXiv: 2507.23509
i2-world intra-inter tokenization for efficient dynamic 4d scene forecasting | arXiv: 2507.09144
iap invisible adversarial patch attack through perceptibility-aware localization | arXiv: 2507.06856
ideator jailbreaking and benchmarking large vision-language models using themsel | arXiv: 2411.00827
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves | arXiv: 2411.00827
identity preserving 3d head stylization with multiview score distillation | arXiv: 2411.13536
identity-aware language gaussian splatting for open-vocabulary 3d semantic segme
idf iterative dynamic filtering networks for generalizable image denoising | arXiv: 2508.19649
idface face template protection for efficient and secure identification | arXiv: 2507.12050
igl-nav incremental 3d gaussian localization for image-goal navigation | arXiv: 2508.00823
illume illuminating your llms to see draw and self-enhance | arXiv: 2412.06673
im-lut interpolation mixing look-up tables for image super-resolution | arXiv: 2507.09923
im360 large-scale indoor mapping with 360 cameras | arXiv: 2502.12545
image as an imu estimating camera motion from a single motion-blurred image | arXiv: 2503.17358
image intrinsic scale assessment bridging the gap between quality and resolution | arXiv: 2502.06476
image-guided shape-from-template using mesh inextensibility constraints | arXiv: 2507.22699
imagegem in-the-wild generative image interaction dataset for generative model p | arXiv: 2510.18433
imanip skill-incremental learning for robotic manipulation | arXiv: 2503.07087
imbalance in balance online concept balancing in generation models | arXiv: 2507.13345
imhead a large-scale implicit morphable model for localized head modeling | arXiv: 2510.10793
implicit counterfactual learning for audio-visual segmentation | arXiv: 2507.20740
improved noise schedule for diffusion training | arXiv: 2407.03297
improving large vision and language models by learning from a panel of peers | arXiv: 2509.01610
incremental few-shot semantic segmentation via multi-level switchable visual pro
inference-time diffusion model distillation | arXiv: 2412.08871
infgen a resolution-agnostic paradigm for scalable image synthesis | arXiv: 2509.10441
infinidreamer arbitrarily long human motion generation via segment score distill | arXiv: 2411.18303
information density principle for mllm benchmarks | arXiv: 2503.10079
information-bottleneck driven binary neural network for change detection | arXiv: 2507.03504
inpaint4drag repurposing inpainting models for drag-based image editing via bidi | arXiv: 2509.04582
insideout integrated rgb-radiative gaussian splatting for comprehensive 3d objec | arXiv: 2510.17864
instadrive instance-aware driving world models for realistic and consistent vide
instance-level video depth in groups beyond occlusions
instascene towards complete 3d instance decomposition and reconstruction from cl | arXiv: 2507.08416
instinct instance-level interaction architecture for query-based collaborative p | arXiv: 2509.23700
instruction-grounded visual projectors for continual learning of generative visi | arXiv: 2508.00260
instruction-oriented preference alignment for enhancing multi-modal comprehensio | arXiv: 2503.20309
integrating biological knowledge for robust microscopy image profiling on de nov | arXiv: 2507.10737
integrating task-specific and universal adapters for pre-trained model-based cla | arXiv: 2508.08165
integrating visual interpretation and linguistic reasoning for geometric problem
inter2former dynamic hybrid attention for efficient high-precision interactive s | arXiv: 2507.09612
interactavatar modeling hand-face interaction in photorealistic avatars with def
interaction-merged motion planning effectively leveraging diverse motion dataset | arXiv: 2507.04790
intergsedit interactive 3d gaussian splatting editing with 3d geometry-consisten
interpretable point cloud classification using multiple instance learning
interpretable zero-shot learning with locally-aligned vision-language model | arXiv: 2506.23822
intersyn interleaved learning for dynamic motion synthesis in the wild | arXiv: 2508.10297
intervening in black box concept bottleneck model for enhancing human neural net | arXiv: 2506.22803
intra-modal and cross-modal synchronization for audio-visual deepfake detection
intra-view and inter-view correlation guided multi-view novel class discovery | arXiv: 2507.12029
introstyle training-free introspective style attribution using diffusion feature | arXiv: 2412.14432
invisible watermarks visible gains steering machine unlearning with bi-level wat | arXiv: 2508.10065
irgpt understanding real-world infrared image with bi-cross-modal curriculum on | arXiv: 2507.14449
iris breaking gui complexity with adaptive focus and self-refining | arXiv: 2412.10342
is less more exploring token condensation as training-free test-time adaptation | arXiv: 2410.14729
is meta-learning out rethinking unsupervised few-shot classification with limite | arXiv: 2509.13185
jailbreaking multimodal large language models via shuffle inconsistency | arXiv: 2501.04931
jigsaw imagining complete shape priors for object reassembly | arXiv: 2410.11816
joint asymmetric loss for learning with noisy labels | arXiv: 2507.17692
joint diffusion models in continual learning | arXiv: 2411.08224
joint self-supervised video alignment and action segmentation | arXiv: 2503.16832
jointdit enhancing rgb-depth joint modeling with diffusion transformers | arXiv: 2505.00482
jpeg processing neural operator for backward-compatible coding | arXiv: 2507.23521
kaputt a large-scale dataset for visual defect detection | arXiv: 2510.05903
kda knowledge diffusion alignment with enhanced context for video temporal groun
keep your friends close and your enemies farther distance-aware voxel-wise contr
keyframe-oriented vision token pruning enhancing efficiency of large vision lang
kh symmetry understanding of 3d shapes via chirality disentanglement | arXiv: 2508.05505
kinmo kinematic-aware human motion understanding and generation | arXiv: 2411.15472
know no better a data-driven approach for enhancing negation awareness in clip | arXiv: 2501.10913
know your attention maps class-specific token masking for weakly supervised sema | arXiv: 2507.06848
knowledge distillation for learned image compression
knowledge distillation with refined logits | arXiv: 2408.07703
knowledge-guided part segmentation
la-motr end-to-end multi-object tracking by learnable association
laconic a 3d layout adapter for controllable image creation | arXiv: 2507.03257
lacoot layer collapse through optimal transport | arXiv: 2406.08933
langbridge interpreting image as a combination of language embeddings | arXiv: 2503.19404
langtraj diffusion model and dataset for language-conditioned trajectory simulat | arXiv: 2504.11521
language decoupling with fine-grained knowledge guidance for referring multi-obj
language driven occupancy prediction | arXiv: 2411.16072
larender training-free occlusion control in image generation via latent renderin | arXiv: 2508.07647
large multi-modal models can interpret features in large multi-modal models | arXiv: 2411.14982
large scene generation with cube-absorb discrete diffusion
large-scale pre-training for grounded video caption generation | arXiv: 2503.10781
lark low-rank updates after knowledge localization for few-shot class-incrementa
latent diffusion models with masked autoencoders | arXiv: 2507.09984
latent expression generation for referring image segmentation and grounding | arXiv: 2508.05123
latent swap joint diffusion for 2d long-form latent generation | arXiv: 2502.05130
latino-pro latent consistency inverse solver with prompt optimization | arXiv: 2503.12615
latte collaborative test-time adaptation of vision-language models in federated | arXiv: 2507.21494
lawdis language-window-based controllable dichotomous image segmentation | arXiv: 2508.01152
lay-your-scene natural scene layout generation with diffusion transformers | arXiv: 2505.04718
lay2story extending diffusion transformers for layout-togglable story generation | arXiv: 2508.08949
layeranimate layer-level control for animation | arXiv: 2501.08295
layerd decomposing raster graphic designs into layers | arXiv: 2509.25134
layerlock non-collapsing representation learning with progressive freezing | arXiv: 2509.10156
layertracer cognitive-aligned layered svg synthesis via diffusion transformer | arXiv: 2502.01105
lazymar accelerating masked autoregressive models via feature caching | arXiv: 2503.12450
ld-rps zero-shot unified image restoration via latent diffusion recurrent poster | arXiv: 2507.00790
ldip long distance information propagation for video super-resolution
leanvae an ultra-efficient reconstruction vae for video diffusion models | arXiv: 2503.14325
leaps and bounds an improved point cloud winding number formulation for fast nor
learn2synth learning optimal data synthesis using hypergradients for brain image | arXiv: 2411.16719
learnable feature patches and vectors for boosting low-light image enhancement w
learnable fractional reaction-diffusion dynamics for under-display tof imaging a | arXiv: 2511.01704
learnable logit adjustment for imbalanced semi-supervised learning under class d
learnable retrieval enhanced visual-text alignment and fusion for radiology repo
learned image compression with hierarchical progressive context modeling | arXiv: 2507.19125
learning 3d object spatial relationships from pre-trained 2d diffusion models | arXiv: 2503.19914
learning 3d scene analogies with neural contextual scene maps | arXiv: 2503.15897
learning 4d embodied world models | arXiv: 2504.20995
learning a unified template for gait recognition
learning deblurring texture prior from unpaired data with diffusion model | arXiv: 2507.13599
learning few-step diffusion models by trajectory distribution matching | arXiv: 2503.06674
learning hierarchical line buffer for image processing
learning implicit features with flow-infused transformations for realistic virtu
learning interpretable queries for explainable image classification with informa | arXiv: 2312.11548
learning neural scene representation from itof imaging
learning normal flow directly from events
learning on the go a meta-learning object navigation model
learning pixel-adaptive multi-layer perceptrons for real-time image enhancement | arXiv: 2507.12135
learning precise affordances from egocentric videos for robotic manipulation | arXiv: 2408.10123
learning robust image watermarking with lossless cover recovery
learning robust stereo matching in the wild with selective mixture-of-experts | arXiv: 2507.04631
learning separable fine-grained representation via dendrogram construction from
learning to generalize without bias for open-vocabulary action recognition | arXiv: 2502.20158
learning to see in the extremely dark | arXiv: 2506.21132
learning to see inside opaque liquid containers using speckle vibrometry | arXiv: 2507.20757
learning visual hierarchies in hyperbolic space for image retrieval | arXiv: 2411.17490
learning visual proxy for compositional zero-shot learning | arXiv: 2501.13859
legion learning to ground and explain for synthetic image detection | arXiv: 2503.15264
lego-maker a semantic-driven algorithm for text-to-3d generation
legrad an explainability method for vision transformers via feature formation se | arXiv: 2404.03214
less is more empowering gui agent with context-aware simplification | arXiv: 2507.03730
less is more improving motion diffusion models with sparse keyframes | arXiv: 2503.13859
less-to-more generalization unlocking more controllability by in-context generat | arXiv: 2504.02160
leveraging 2d priors and sdf guidance for urban scene rendering | arXiv: 2510.13381
leveraging bev paradigm for ground-to-aerial image synthesis | arXiv: 2408.01812
leveraging panoptic scene graph for evaluating fine-grained text-to-image genera
leveraging spatial invariance to boost adversarial transferability
lga-net learning local and global affinities for sparse scribble based image col
lhm large animatable human reconstruction model for single image to 3d in second
liberated-gs 3d gaussian splatting independent from sfm point clouds
lift latent implicit functions for task- and data-agnostic encoding | arXiv: 2503.15420
lifting the structural morphing for wide-angle images rectification unified cont
lightcity an urban dataset for outdoor inverse rendering and reconstruction unde
lightsout diffusion-based outpainting for enhanced lens flare removal | arXiv: 2510.15868
lightweight and fast real-time image enhancement via decomposition of the spatia | arXiv: 2508.16121
lightweight gradient-aware upscaling of 3d gaussian splatting images | arXiv: 2503.14171
linr-pcgc lossless implicit neural representations for point cloud geometry comp | arXiv: 2507.15686
lion-lora rethinking lora fusion to unify controllable spatial and temporal gene
lira reasoning reconstruction via multimodal large language models
lit delving into a simple linear diffusion transformer for image generation | arXiv: 2501.12976
llava-3d a simple yet effective pathway to empowering lmms with 3d capabilities | arXiv: 2409.18125
llava-cot let vision language models reason step-by-step | arXiv: 2411.10440
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step | arXiv: 2411.10440
llava-kd a framework of distilling multimodal large language models | arXiv: 2410.16236
llava-prumerge adaptive token reduction for efficient large multimodal models | arXiv: 2403.15388
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | arXiv: 2403.15388
llm thought divergence and convergence for dialogue-based image generation contr
llm-assisted entropy-based adaptive distillation for unsupervised fine-grained v
lmm-det make large multimodal models excel in object detection | arXiv: 2507.18300
local dense logit relations for enhanced knowledge distillation | arXiv: 2507.15911
localdygs multi-view global dynamic scene modeling via adaptive local implicit f | arXiv: 2507.02363
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models | arXiv: 2504.14032
long context tuning for video generation | arXiv: 2503.10589
long-context state-space video world models | arXiv: 2505.20171
long-term traffic simulation with interleaved autoregressive motion and scenario | arXiv: 2506.17213
long3r long sequence streaming 3d reconstruction | arXiv: 2507.18255
longsplat robust unposed 3d gaussian splatting for casual long videos | arXiv: 2508.14041
looking in the mirror a faithful counterfactual explanation method for interpret | arXiv: 2509.16822
lookout real-world humanoid egocentric navigation | arXiv: 2508.14466
lora-fair federated lora fine-tuning with aggregation and initialization refinem | arXiv: 2411.14961
loraverse a submodular framework to retrieve diverse adapters for diffusion mode | arXiv: 2510.15022
loss functions for predictor-based neural architecture search | arXiv: 2506.05869
low-light image enhancement using event-based illumination estimation | arXiv: 2504.09379
lusd localized update score distillation for text-guided image editing | arXiv: 2503.11054
lvface progressive cluster optimization for large vision models in face recognit | arXiv: 2501.13420
lyra an efficient and speech-centric framework for omni-cognition | arXiv: 2412.09501
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | arXiv: 2412.09501
m-net mri brain tumor sequential segmentation network via mesh-cast | arXiv: 2507.20582
m2eit multi-domain mixture of experts for robust neural inertial tracking
m2sformer multi-spectral and multi-scale attention with edge-aware difficulty gu | arXiv: 2506.20922
ma-cir a multimodal arithmetic benchmark for composed image retrieval
maestro task-relevant optimization via adaptive feature enhancement and suppress | arXiv: 2509.17462
magic insert style-aware drag-and-drop | arXiv: 2407.02489
magiccity geometry-aware 3d city generation from satellite imagery with multi-vi
magicdrive-v2 high-resolution long video generation for autonomous driving with | arXiv: 2411.13807
magichoi leveraging 3d priors for accurate hand-object reconstruction from short
magicid hybrid preference optimization for id-consistent and dynamic-preserved v | arXiv: 2503.12689
magicmirror id-preserved video generation in video diffusion transformers | arXiv: 2501.03931
mags reconstructing and simulating dynamic 3d objects with mesh-adsorbed gaussia
magshield towards better robustness in sparse inertial motion capture under magn | arXiv: 2506.22907
make me happier evoking emotions through image diffusion models | arXiv: 2403.08255
make your training flexible towards deployment-efficient video models | arXiv: 2503.14237
mambaml exploring state space models for multi-label image classification
mamtiff-cad multi-scale latent diffusion with mamba for complex parametric seque | arXiv: 2511.17647
manual-pa learning 3d part assembly from instruction diagrams | arXiv: 2411.18011
maskcontrol spatio-temporal control for masked motion synthesis | arXiv: 2410.10780
maskhand generative masked modeling for robust hand mesh reconstruction in the w | arXiv: 2412.13393
masksam auto-prompt sam with mask classification for volumetric medical image se
mastering collaborative multi-modal data selection a focus on informativeness un | arXiv: 2412.06293
matchdiffusion training-free generation of match-cuts | arXiv: 2411.18677
mate images are all you need for material transfer via diffusion transformer
materialmvp illumination-invariant material generation via multi-view pbr diffus | arXiv: 2503.10289
matvlm hybrid mamba-transformer for efficient vision-language modeling | arXiv: 2503.13440
mavflow preserving paralinguistic elements with conditional flow matching for ze | arXiv: 2503.11026
mavias mitigate any visual bias | arXiv: 2412.06632
mbti masked blending transformers with implicit positional encoding for frame-ra
mc-bench a benchmark for multi-context visual grounding in the era of mllms | arXiv: 2410.12332
mcam multimodal causal analysis model for ego-vehicle-level driving video unders | arXiv: 2507.06072
mcid multi-aspect copyright infringement detection for generated images
mdd a dataset for text-and-music conditioned duet dance generation | arXiv: 2508.16911
mdp-omni parameter-free multimodal depth prior-based sampling for omnidirectiona
mdp3 a training-free approach for list-wise frame selection in video-llms | arXiv: 2501.02885
measurexpert automatic anthropometric measurement extraction from two unregister
measuring the impact of rotation equivariance on aerial object detection | arXiv: 2507.09896
mega memory-efficient 4d gaussian splatting for dynamic scenes | arXiv: 2410.13613
meh a multi-style dataset and toolkit for advancing egyptian hieroglyph recognit
membership inference attacks with false discovery rate control | arXiv: 2508.07066
memdistill distilling lidar knowledge into memory for camera-only 3d object dete
memfof high-resolution training for memory-efficient multi-frame optical flow es | arXiv: 2506.23151
memory-efficient 4-bit preconditioned stochastic optimization | arXiv: 2412.10663
memory-efficient generative models via product quantization
memorytalker personalized speech-driven 3d facial animation via audio-guided sty | arXiv: 2507.20562
meshanything v2 artist-created mesh generation with adjacent mesh tokenization | arXiv: 2408.02555
meshllm empowering large language models to progressively understand and generat
meshmamba state space models for articulated 3d mesh generation and reconstructi | arXiv: 2507.15212
meshpad interactive sketch-conditioned artist-reminiscent mesh generation and ed | arXiv: 2503.01425
met2net a decoupled two-stage spatio-temporal forecasting model for complex mete
meta-learning dynamic center distance hard sample mining for learning with noisy
meta-unlearning on diffusion models preventing relearning unlearned concepts | arXiv: 2410.12777
metamorph multimodal understanding and generation via instruction tuning | arXiv: 2412.14164
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning | arXiv: 2412.14164
meteor multi-encoder collaborative token pruning for efficient vision language m | arXiv: 2507.20842
metric convolutions a unifying theory to adaptive image convolutions | arXiv: 2406.05400
mgsfm multi-camera geometry driven global structure-from-motion | arXiv: 2507.03306
mh-lvc multi-hypothesis temporal prediction for learned conditional residual vid
mikudance animating character art with mixed motion dynamics | arXiv: 2411.08656
mincd-pnp learning 2d-3d correspondences with approximate blind pnp | arXiv: 2507.15257
mind the cost of scaffold benign clients may even become accomplices of backdoor | arXiv: 2411.16167
mind the gap aligning vision foundation models to image feature matching | arXiv: 2507.10318
minerva evaluating complex video reasoning | arXiv: 2505.00681
miore var-miore benchmarks to push the boundaries of restoration | arXiv: 2509.06803
missrag addressing the missing modality challenge in multimodal large language m
mistsense versatile online detection of procedural and execution mistakes
mitigating catastrophic overfitting in fast adversarial training via label infor
mitigating object hallucinations via sentence-level early intervention | arXiv: 2507.12455
mixa-q revisiting activation sparsity for vision transformers from a mixed-preci | arXiv: 2507.19131
mixant observation-dependent memory propagation for stochastic dense action anti | arXiv: 2509.11394
mixed signals a diverse point cloud dataset for heterogeneous lidar v2x collabor | arXiv: 2502.14156
mixri mixing features of reference images for novel object pose estimation | arXiv: 2601.06883
mixture-of-scores robust image-text data valuation via three lines of code
mm-ifengine towards multimodal instruction following | arXiv: 2504.07957
mm-spatial exploring 3d spatial understanding in multimodal llms | arXiv: 2503.13111
mmaif multi-task and multi-degradation all-in-one for image fusion with language | arXiv: 2503.14944
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning | arXiv: 2507.21924
mmone representing multiple modalities in one scene | arXiv: 2507.11129
mmreason an open-ended multi-modal multi-step reasoning benchmark for mllms towa
mobileie an extremely lightweight and effective convnet for real-time image enha | arXiv: 2507.01838
mobileviclip an efficient video-text model for mobile devices | arXiv: 2508.07312
modaltune fine-tuning slide-level foundation models with multi-modal information
moerl when mixture-of-experts meet reinforcement learning for adverse weather im
mofrr mixture of diffusion models for face retouching restoration | arXiv: 2507.19770
moga 3d generative avatar prior for monocular gaussian avatar reconstruction | arXiv: 2507.23597
molparser end-to-end visual recognition of molecule structures in the wild | arXiv: 2411.11098
moma-kitchen a 100k benchmark for affordance-grounded last-mile navigation in mo
moment quantization for video temporal grounding | arXiv: 2504.02286
momentum-gs momentum gaussian self-distillation for high-quality large scene rec | arXiv: 2412.04887
monocular facial appearance capture in the wild | arXiv: 2412.12765
monocular semantic scene completion via masked recurrent networks | arXiv: 2507.17661
monomobility zero-shot 3d mobility analysis from monocular videos | arXiv: 2505.11868
monosowa scalable monocular 3d object detector without human annotations | arXiv: 2501.09481
monovln bridging the observation gap between monocular and panoramic vision and
monster a unified model for motion scene text retrieval | arXiv: 2510.03200
morphogen efficient unconditional generation of long-range projection neuronal m
mosaic generating consistent privacy-preserving scenes from multiple depth views
mosaicdiff training-free structural pruning for diffusion model acceleration ref | arXiv: 2510.11962
mosic optimal-transport motion trajectory for dense self-supervised learning | arXiv: 2506.08694
motal unsupervised 3d object detection by modality and task-specific knowledge t
motion-2-to-3 leveraging 2d motion data for 3d motion generations
motionagent fine-grained controllable video generation via motion field agent | arXiv: 2502.03207
motionctrl a real-time controllable vision-language-motion model
motiondiff training-free zero-shot interactive motion editing via flow-assisted | arXiv: 2503.17695
motionfollower editing video motion via score-guided diffusion | arXiv: 2405.20325
motionshot adaptive motion transfer across arbitrary objects for text-to-video g | arXiv: 2507.16310
motionstreamer streaming motion generation via diffusion-based autoregressive mo | arXiv: 2503.15451
moto latent motion token as the bridging language for learning robot manipulatio | arXiv: 2412.04445
move motion-guided few-shot video object segmentation | arXiv: 2507.22061
mp-hsir a multi-prompt framework for universal hyperspectral image restoration | arXiv: 2503.09131
mr-fiqa face image quality assessment with multi-reference representations from
mrgen segmentation data engine for underrepresented mri modalities | arXiv: 2412.04106
ms3d high-quality 3d generation via multi-scale representation modeling
msa2 multi-task framework with structure-aware and style-adaptive character repr
msq memory-efficient bit sparsification quantization | arXiv: 2507.22349
mug pseudo labeling augmented audio-visual mamba network for audio-visual video | arXiv: 2507.01384
mugs multi-baseline generalizable gaussian splatting reconstruction | arXiv: 2508.04297
multi-cache enhanced prototype learning for test-time generalization of vision-l | arXiv: 2508.01225
multi-identity human image animation with structural video diffusion | arXiv: 2504.04126
multi-modal few-shot temporal action segmentation
multi-modal multi-platform person re-identification benchmark and method | arXiv: 2503.17096
multi-modal multi-task unified embedding model m3t-uem a task-adaptive represent
multi-modal segment anything model for camouflaged scene segmentation
multi-object sketch animation by scene decomposition and motion planning | arXiv: 2503.19351
multi-scenario overlapping text segmentation with depth awareness
multi-turn consistent image editing | arXiv: 2505.04320
multi-view 3d point tracking | arXiv: 2508.21060
multi-view gaze target estimation | arXiv: 2508.05857
multimodal action conditioned video simulation
multimodal latent diffusion model for complex sewing pattern generation | arXiv: 2412.14453
multimodal llms as customized reward models for text-to-image generation | arXiv: 2507.21391
multiverse a multi-turn conversation benchmark for evaluating large vision and l | arXiv: 2510.16641
multiverseg scalable interactive segmentation of biomedical imaging datasets wit | arXiv: 2412.15058
munba machine unlearning via nash bargaining | arXiv: 2411.15537
muse-vl modeling unified vlm through semantic discrete encoding | arXiv: 2411.17762
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding | arXiv: 2411.17762
music-aligned holistic 3d dance generation via hierarchical motion modeling | arXiv: 2507.14915
mv-adapter multi-view consistent image generation made easy | arXiv: 2412.03632
mvgbench a comprehensive benchmark for multi-view generation models | arXiv: 2507.00006
nappure adversarial purification for robust image classification under non-addit | arXiv: 2510.14025
natra noise-agnostic framework for trajectory prediction with noisy observations
nautilus locality-aware autoencoder for scalable mesh generation | arXiv: 2501.14317
navmorph a self-evolving world model for vision-and-language navigation in conti | arXiv: 2506.23468
navq learning a q-model for foresighted vision-and-language navigation | arXiv: 2510.16457
negrefine refining negative label-based zero-shot ood detection | arXiv: 2507.09795
netracer a topology-aware iterative tracing approach for tubular structure extra
neuframeq neural frame fields for scalable and generalizable anisotropic quadran
neural architecture search driven by locally guided diffusion for personalized f
neural compression for 3d geometry sets | arXiv: 2405.15034
neural inverse rendering for high-accuracy 3d measurement of moving objects with
neural multi-view self-calibrated photometric stereo without photometric stereo | arXiv: 2507.23162
neural solver of dichromatic reflection model for specular highlight removal
neuraleaf neural parametric leaf models with shape and deformation disentangleme | arXiv: 2507.12714
neuromanifold-regularized kans for shape-fair feature representations
neurons emulating the human visual cortex improves fidelity and interpretability | arXiv: 2503.11167
ngd neural gradient based deformation for monocular garment reconstruction | arXiv: 2508.17712
no more sibling rivalry debiasing human-object interaction detection | arXiv: 2509.00760
no pose at all self-supervised pose-free 3d gaussian splatting from sparse views | arXiv: 2508.01171
noise2score3d tweedies approach for unsupervised point cloud denoising | arXiv: 2503.09283
noisecontroller towards consistent multi-view video generation via noise decompo
normalcrafter learning temporally consistent normals from video diffusion priors | arXiv: 2504.11427
not all degradations are equal a targeted feature denoising framework for genera
not all frame features are equal video-to-4d generation via decoupling dynamic-s | arXiv: 2502.08377
not only vision evolve visual speech recognition via peripheral information
nuiscene exploring efficient generation of unbounded outdoor scenes | arXiv: 2503.16375
nullswap proactive identity cloaking against deepfake face swapping | arXiv: 2503.18678
o-mama learning object mask matching between egocentric and exocentric views | arXiv: 2506.06026
oasis one image is all you need for multimodal instruction data synthesis | arXiv: 2503.08741
object-level correlation for few-shot segmentation | arXiv: 2509.07917
objectrelator enabling cross-view object relation understanding across ego-centr
occlugaussian occlusion-aware gaussian splatting for large scene reconstruction | arXiv: 2503.16177
occupancy learning with spatiotemporal memory | arXiv: 2508.04705
ock unsupervised dynamic video prediction with object-centric kinematics | arXiv: 2404.18423
ocr hinders rag evaluating the cascading impact of ocr on retrieval-augmented ge | arXiv: 2412.02592
od-rase ontology-driven risk assessment and safety enhancement for autonomous dr | arXiv: 2603.05936
odp-bench benchmarking out-of-distribution performance prediction | arXiv: 2510.27263
omegance a single parameter for various granularities in diffusion-based synthes | arXiv: 2411.17769
ominicontrol minimal and universal control for diffusion transformer | arXiv: 2411.15098
omni-dc highly robust depth completion with multiresolution depth integration | arXiv: 2411.19278
omni-scene perception-oriented point cloud geometry enhancement for coordinate q
omnidiff a comprehensive benchmark for fine-grained image difference captioning | arXiv: 2503.11093
omnihuman-1 rethinking the scaling-up of one-stage conditioned human animation m | arXiv: 2502.01061
omnipaint mastering object-oriented editing via disentangled insertion-removal i | arXiv: 2503.08677
omnisam omnidirectional segment anything model for uda in panoramic semantic seg | arXiv: 2503.07098
omnivton training-free universal virtual try-on | arXiv: 2507.15037
on large multimodal models as open-world image classifiers | arXiv: 2503.21851
on the complexity-faithfulness trade-off of gradient-based explanations | arXiv: 2508.10490
on the generalization of representation uncertainty in earth observation | arXiv: 2503.07082
on the provable importance of gradients for autonomous language-assisted image c
on the recovery of cameras from fundamental matrices
on the robustness tradeoff in fine-tuning | arXiv: 2503.14836
one look is enough seamless patchwise refinement for zero-shot monocular depth e | arXiv: 2503.22351
one perturbation is enough on generating universal adversarial perturbations aga | arXiv: 2406.05491
one polyp identifies all one-shot polyp segmentation with sam via cascaded prior
one-shot knowledge transfer for scalable person re-identification | arXiv: 2511.06016
onegt one-shot geometry-texture neural rendering for head avatars
online dense point tracking with streaming memory | arXiv: 2503.06471
online generic event boundary detection | arXiv: 2510.06855
online language splatting | arXiv: 2503.09447
online reasoning video segmentation with just-in-time digital twins | arXiv: 2503.21056
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models | arXiv: 2507.00898
open-vocabulary octree-graph for 3d scene understanding | arXiv: 2411.16253
open-world skill discovery from unsegmented demonstration videos | arXiv: 2503.10684
openanimals revisiting person re-identification for animals towards better gener | arXiv: 2410.00204
openrsd towards open-prompts for object detection in remote sensing images | arXiv: 2503.06146
openvision a fully-open cost-effective family of advanced vision encoders for mu | arXiv: 2505.04601
ophclip hierarchical retrieval-augmented learning for ophthalmic surgical video-
optical model-driven sharpness mapping for autofocus in small depth-of-field and
optimal transport for brain-image alignment unveiling redundancy and synergy in
oraclefusion assisting the decipherment of oracle bone script with structurally | arXiv: 2506.21101
orderchain towards general instruct-tuning for stimulating the ordinal understan | arXiv: 2504.04801
orion a holistic end-to-end autonomous driving framework by vision-language inst
ouroboros single-step diffusion models for cycle-consistent forward and inverse | arXiv: 2508.14461
ouromamba a data-free quantization framework for vision mamba | arXiv: 2503.10959
outdoor monocular slam with global scale-consistent 3d gaussian pointmaps | arXiv: 2507.03737
outlier-aware post-training quantization for image super-resolution | arXiv: 2511.00682
ov-scan semantically consistent alignment for novel object discovery in open-voc
ovg-hq online video grounding with hybrid-modal queries | arXiv: 2508.11903
p-avas can physics-integrated audio-visual modeling boost neural acoustic synthe
pacgdc label-efficient generalizable depth completion with projection ambiguity | arXiv: 2507.07374
pan-crafter learning modality-consistent alignment for pan-sharpening | arXiv: 2505.23367
panollama generating endless and coherent panoramas with next-token-prediction l | arXiv: 2411.15867
panst3r multi-view consistent panoptic segmentation | arXiv: 2506.21348
partfield learning 3d feature fields for part segmentation and beyond | arXiv: 2504.11451
partial forward blocking a novel data pruning paradigm for lossless training acc | arXiv: 2506.23674
PASDF: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation | arXiv: 2505.24431
pasg a closed-loop framework for automated geometric primitive extraction and se | arXiv: 2508.05976
passing the driving knowledge test | arXiv: 2508.21824
pasta part-aware sketch-to-3d shape generation with text-aligned prior | arXiv: 2503.12834
patchscaler an efficient patch-independent diffusion model for image super-resol | arXiv: 2405.17158
pathfinder a multi-modal multi-agent system for medical diagnostic decision-maki
pbcat patch-based composite adversarial training against physically realizable a | arXiv: 2506.23581
pcr-gs colmap-free 3d gaussian splatting via pose co-regularizations | arXiv: 2507.13891
penalizing boundary activation for object completeness in diffusion models | arXiv: 2509.16968
perception-as-control fine-grained controllable image animation with 3d-aware mo
personacraft personalized and controllable full-body multi-human scene generatio
personalvideo high id-fidelity video customization without dynamic and semantic | arXiv: 2411.17048
perspective-aware reasoning in vision-language models via mental imagery simulat | arXiv: 2504.17207
perspective-aware teaching adapting knowledge for heterogeneous distillation | arXiv: 2501.08885
perspose 3d human pose estimation with perspective encoding and perspective rota | arXiv: 2508.17239
ph-gan physics-inspired gan for generating sar images under limited data | arXiv: 2503.02242
phatnet a physics-guided haze transfer network for domain-adaptive real-world im | arXiv: 2507.14826
phd personalized 3d human body fitting with point diffusion | arXiv: 2508.21257
photolithography overlay map generation with implicit knowledge distillation dif
physical degradation model-guided interferometric hyperspectral reconstruction w
physics context builders a modular framework for physical reasoning in vision-la | arXiv: 2412.08619
physsplat efficient physics simulation for 3d scenes via mllm-guided gaussian sp | arXiv: 2411.12789
phystwin physics-informed reconstruction and simulation of deformable objects fr
pi-gps enhancing geometry problem solving by unleashing the power of diagrammati | arXiv: 2503.05543
pinco position-induced consistent adapter for diffusion transformer in foregroun
pino person-interaction noise optimization for long-duration and customizable mo | arXiv: 2507.19292
pixelstitch structure-preserving pixel-wise bidirectional warps for unsupervised
pla prompt learning attack against text-to-image generative models | arXiv: 2508.03696
placeit3d language-guided object placement in real 3d scenes | arXiv: 2505.05288
plan proactive low-rank allocation for continual learning | arXiv: 2510.21188
planar affine rectification from local change of scale and orientation
planeras learning planar primitives for 3d plane recovery
plangen towards unified layout planning and image generation in auto-regressive
plmp - point-line minimal problems for projective sfm | arXiv: 2503.04351
polaranything diffusion-based polarimetric image synthesis | arXiv: 2507.17268
polarimetric neural field via unified complex-valued wave representation
poseanchor robust root position estimation for 3d human pose estimation
posesyn synthesizing diverse 3d pose data from in-the-wild 2d data | arXiv: 2503.13025
possloss a reliable and sensitive facial landmark detection loss function
pre-mamba a 4d state space model for ultra-high-frequent event camera deraining | arXiv: 2505.05307
predict-optimize-distill a self-improving cycle for 4d object understanding | arXiv: 2504.17441
pretrained reversible generation as unsupervised visual representation learning | arXiv: 2412.01787
primhoi compositional human-object interaction via reusable primitives
principal components enable a new language of images | arXiv: 2503.08685
prior-aware dynamic temporal modeling framework for sequential 3d hand pose esti
prior-flow enhancing primitive panoramic optical flow with orthogonal view | arXiv: 2506.23897
prior2former - evidential modeling of mask transformers for assumption-free open
priormotion generative class-agnostic motion prediction with raster-vector motio
privacy-centric deep motion retargeting for anonymization of skeleton-based moti
pro-vpt distribution-adaptive visual prompt tuning via prompt relocation | arXiv: 2503.06901
proactive scene decomposition and reconstruction | arXiv: 2510.16272
probabilistic inertial poser probip uncertainty-aware human motion modeling from
probres probabilistic jump diffusion for open-world egocentric activity recognit | arXiv: 2504.03948
processing and acquisition traces in visual encoders what does clip know about y | arXiv: 2508.10637
progait a multi-purpose video dataset and benchmark for transfemoral prosthesis | arXiv: 2507.10223
progressive artwork outpainting via latent diffusion models
progressive test time energy adaptation for medical image segmentation | arXiv: 2503.16616
progressor a perceptually guided reward estimator with self-supervised online re | arXiv: 2411.17764
prompt guidance and human proximal perception for hot prediction with regional j | arXiv: 2507.01630
prompt-a-video prompt your video diffusion model via preference-aligned llm | arXiv: 2412.15156
prompt-driven transferable adversarial attack on person re-identification with a
promptdresser improving the quality and controllability of virtual try-on via ge
propvg end-to-end proposal-driven visual grounding with multi-granularity discri | arXiv: 2509.04833
prototype guided backdoor defense via activation space manipulation
prototype-based contrastive learning with stage-wise progressive augmentation fo
proxy-bridged game transformer for interactive extreme motion prediction
pruning all-rounder rethinking and improving inference efficiency for large visi
pseudo-sd pseudo controlled stable diffusion for semi-supervised and cross-domai
pseudomaptrainer learning online mapping without hd maps | arXiv: 2508.18788
purge-gate backpropagation-free test-time adaptation for point clouds classifica
pvchat personalized video chat with one-shot learning | arXiv: 2503.17069
pvmamba parallelizing vision mamba via dynamic state aggregation
q-frame query-aware frame selection and multi-resolution adaptation for video-ll | arXiv: 2506.22139
qk-edit revisiting attention-based injection in mm-dit for image and video editi
quadratic gaussian splatting high quality surface reconstruction with second-ord
quantcache adaptive importance-guided quantization with hierarchical latent and
quantifying and narrowing the unknown interactive text-to-video retrieval via un | arXiv: 2507.15504
r-livit a lidar-visual-thermal dataset enabling vulnerable road user focused roa
r1-onevision advancing generalized multimodal reasoning through cross-modal form | arXiv: 2503.10615
r1-vl learning to reason with multimodal large language models via step-wise gro | arXiv: 2503.12937
ra-busseg relation-aware semi-supervised breast ultrasound image segmentation vi
radarsplat radar gaussian splatting for high-fidelity data synthesis and 3d reco
radgpt constructing 3d image-text tumor datasets | arXiv: 2501.04678
radiant foam real-time differentiable ray tracing | arXiv: 2502.01157
ragnet large-scale reasoning-based affordance segmentation benchmark towards gen | arXiv: 2507.23734
rainbowprompt diversity-enhanced prompt-evolving for continual learning | arXiv: 2507.22553
raloc enhancing outdoor lidar localization via rotation awareness
randomized autoregressive visual generation | arXiv: 2411.00776
rapverse coherent vocals and whole-body motion generation from text | arXiv: 2405.20336
rareclip rarity-aware online zero-shot industrial anomaly detection
raygaussx accelerating gaussian-based ray marching for real-time and high-qualit
rayletdf raylet distance fields for generalizable 3d surface reconstruction from | arXiv: 2508.09830
raypose ray bundling diffusion for template views in unseen 6d object pose estim | arXiv: 2510.18521
rayzer a self-supervised large view synthesis model | arXiv: 2505.00702
real3d towards scaling large reconstruction models with real images
realcam-i2v real-world image-to-video generation with interactive complex camera | arXiv: 2502.10059
reangle-a-video 4d video generation as video-to-video translation | arXiv: 2503.09151
reasonvqa a multi-hop reasoning benchmark with structural knowledge for visual q | arXiv: 2507.16403
recammaster camera-controlled generative rendering from a single video | arXiv: 2503.11647
recondreamer harmonizing generative and reconstructive models for driving scene | arXiv: 2503.18438
recot reflective self-correction training for mitigating confirmation bias in la
recover biological structure from sparse-view diffraction images with neural vol | arXiv: 2510.16391
recovering parametric scenes from very few time-of-flight pixels | arXiv: 2509.16132
rectifying magnitude neglect in linear attention | arXiv: 2507.00698
reducing unimodal bias in multi-modal semantic segmentation with multi-scale fun
reducio generating 1k video within 16 seconds using extremely compressed motion | arXiv: 2411.13552
refedit a benchmark and method for improving instruction-based image editing mod
refer to any segmentation mask group with vision-language prompts | arXiv: 2506.05342
referdino referring video object segmentation with visual grounding foundations | arXiv: 2501.14607
reference-based super-resolution via image-based retrieval-augmented generation
refereverything towards segmenting everything we can speak of in videos | arXiv: 2410.23287
referring expression comprehension for small objects | arXiv: 2510.03701
reflex text-guided editing of real images in rectified flow via mid-step feature | arXiv: 2507.01496
regen learning compact video embedding with re-generative decoder | arXiv: 2503.08665
reggs unposed sparse views gaussian splatting with 3dgs registration | arXiv: 2507.08136
region-aware anchoring mechanism for efficient referring visual grounding
region-based cluster discrimination for visual representation learning | arXiv: 2507.20025
region-level data attribution for text-to-image generative models
registration beyond points general affine subspace alignment via geodesic distan
reinforcement learning-guided data selection via redundancy assessment | arXiv: 2506.21037
relative illumination fields learning medium and light independent underwater sc | arXiv: 2504.10024
removing out-of-focus reflective flares via color alignment
remp-ad retrieval-enhanced multi-modal prompt fusion for few-shot industrial vis
rep-mtl unleashing the power of representation-level task saliency for multi-tas | arXiv: 2507.21049
repa-e unlocking vae for end-to-end tuning of latent diffusion transformers | arXiv: 2504.10483
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers | arXiv: 2504.10483
reparo compositional 3d assets generation with differentiable 3d layout alignmen | arXiv: 2405.18525
reposed efficient relative pose estimation with known depth information | arXiv: 2501.07742
representation shift unifying token compression with flashattention | arXiv: 2508.00367
representing 3d shapes with 64 latent vectors for 3d diffusion models | arXiv: 2503.08737
repurposing 2d diffusion models with gaussian atlas for 3d generation | arXiv: 2503.15877
rescue crowd evacuation simulation via controlling sdm-united characters | arXiv: 2507.20117
resgs residual densification of 3d gaussian for efficient detail recovery | arXiv: 2412.07494
residualvit for efficient temporally dense video encoding | arXiv: 2509.13255
resolving token-space gradient conflicts token space manipulation for transforme | arXiv: 2507.07485
resonance learning to predict social-aware pedestrian trajectories as co-vibrati | arXiv: 2412.02447
resq a novel framework to implement residual neural networks on analog rydberg a | arXiv: 2506.21537
rethink sparse signals for pose-guided text-to-image generation | arXiv: 2506.20983
rethinking cross-modal interaction in multimodal diffusion transformers | arXiv: 2506.07986
rethinking detecting salient and camouflaged objects in unconstrained scenes | arXiv: 2412.10943
rethinking dpo-style diffusion aligning frameworks
rethinking few shot clip benchmarks a critical analysis in the inductive setting | arXiv: 2507.20834
rethinking key-frame-based micro-expression recognition a robust and accurate fr
rethinking layered graphic design generation with a top-down approach | arXiv: 2507.05601
rethinking multi-modal object detection from the perspective of mono-modality fe
rethinking the embodied gap in vision-and-language navigation a holistic study o | arXiv: 2507.13019
rethinking the upsampling process in light field super-resolution with spatial-e
retinexmcnet a memory controller dominated network for low-light video enhanceme
revelio interpreting and leveraging semantic information in diffusion models | arXiv: 2411.16725
revisiting adversarial patch defenses on object detectors unified evaluation lar | arXiv: 2508.00649
revisiting image fusion for multi-illuminant white-balance correction | arXiv: 2503.14774
revisiting point cloud completion are we ready for the real-world | arXiv: 2411.17580
rhythmguassian repurposing generalizable gaussian model for remote physiological
ri3d few-shot gaussian splatting with repair and inpainting diffusion priors | arXiv: 2503.10860
riocc efficient cross-modal fusion transformer with collaborative feature refine
rmultiplex200k toward reliable multimodal process supervision for visual languag
roadwork a dataset and benchmark for learning to recognize observe analyze and d | arXiv: 2406.07661
robava a large-scale dataset and baseline towards video based robotic arm action
robofactory exploring embodied agent collaboration with compositional constraint | arXiv: 2503.16408
robopearls editable video simulation for robot manipulation | arXiv: 2506.22756
robotrom-nav a unified framework for embodied navigation integrating perception
robotron-mani all-in-one multimodal large model for robotic manipulation | arXiv: 2412.07215
robotron-sim improving real-world driving via simulated hard-case | arXiv: 2508.04642
robridge a hierarchical architecture bridging cognition and execution for genera
robust 3d object detection using probabilistic point clouds from single-photon l | arXiv: 2508.00169
robust 3d-masked part-level editing in 3d gaussian splatting with regularized sc
robust adverse weather removal via spectral-based spatial grouping | arXiv: 2507.22498
robust and efficient 3d gaussian splatting for urban scene reconstruction | arXiv: 2507.23006
robust dataset condensation using supervised contrastive learning
robust machine unlearning for quantized neural networks via adaptive gradient re
robust multi-view learning via representation fusion of sample-level attention a
robustereo robust zero-shot stereo matching under adverse weather | arXiv: 2507.01653
robustsplat decoupling densification and dynamics for transient-free 3dgs | arXiv: 2506.02751
roco-sim enhancing roadside collaborative perception through foreground simulati | arXiv: 2503.10410
ross3d reconstructive visual instruction tuning with 3d-awareness | arXiv: 2504.01901
rs-vheat heat conduction guided efficient remote sensing foundation model | arXiv: 2411.17984
rtmap real-time recursive mapping with change detection and localization | arXiv: 2507.00980
s2m2 scalable stereo matching model for reliable depth estimation
s3e self-supervised state estimation for radar-inertial system | arXiv: 2509.25984
s3r-gs streamlining the pipeline for large-scale street scene reconstruction | arXiv: 2503.08217
s4m boosting semi-supervised instance segmentation with sam
sa-lut spatial adaptive 4d look-up table for photorealistic style transfer | arXiv: 2506.13465
sa-occ satellite-assisted 3d occupancy prediction in real world | arXiv: 2503.16399
sac-gnc sample consensus for adaptive graduated non-convexity
safeguarding vision-language models mitigating vulnerabilities to gaussian noise | arXiv: 2504.01308
saft shape and appearance of fabrics from template via differentiable physical s
saliency-aware quantized imitation learning for efficient robotic control | arXiv: 2505.15304
salvaging the overlooked leveraging class-aware contrastive learning for multi-c
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree | arXiv: 2410.16268
sam4d segment anything in camera and lidar streams | arXiv: 2506.21547
samo a lightweight sharpness-aware approach for multi-task optimization with joi | arXiv: 2507.07883
sample semantic alignment through temporal-adaptive multimodal prompt learning f
sana-sprint one-step diffusion with continuous-time consistency distillation | arXiv: 2503.09641
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation | arXiv: 2503.09641
sas segment any 3d scene with integrated 2d priors | arXiv: 2503.08512
sat2city 3d city generation from a single satellite image with cascaded latent d | arXiv: 2507.04403
sauce selective concept unlearning in vision-language models with sparse autoenc | arXiv: 2503.14530
sc-captioner improving image captioning with self-correction by reinforcement le | arXiv: 2508.06125
scaling 3d compositional models for robust classification and pose estimation
scaling action detection adatad with transformer-enhanced temporal-spatial adapt
scaling inference-time search with vision value model for improved visual compre | arXiv: 2412.03704
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension | arXiv: 2412.03704
Scaling Language-Free Visual Representation Learning | arXiv: 2504.01017
Scaling Laws for Native Multimodal Models | arXiv: 2504.07951
scaling omni-modal pretraining with multimodal context advancing universal repre
scaling tumor segmentation best lessons from real and synthetic data | arXiv: 2510.14831
scan bootstrapping contrastive pre-training for data efficiency | arXiv: 2411.09126
scene coordinate reconstruction priors | arXiv: 2510.12387
scenemi motion in-betweening for modeling human-scene interaction | arXiv: 2503.16289
scenepainter semantically consistent perpetual 3d scene generation with concept
scflow implicitly learning style and content disentanglement with flow models | arXiv: 2508.03402
scheduling weight transitions for quantization-aware training | arXiv: 2404.19248
scivid cross-domain evaluation of video models in scientific applications | arXiv: 2507.03578
score scene context matters in open-vocabulary remote sensing instance segmentat | arXiv: 2507.12857
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation | arXiv: 2507.12857
scorehoi physically plausible reconstruction of human-object interaction via sco | arXiv: 2509.07920
sculpting memory multi-concept forgetting in diffusion models via dynamic mask a
sd2actor continuous state decomposition via diffusion embeddings for robotic man
sdmatte grafting diffusion models for interactive matting | arXiv: 2508.00443
seeing and seeing through the glass real and synthetic data for multi-layer dept | arXiv: 2503.11633
seganypet universal promptable segmentation from positron emission tomography im | arXiv: 2502.14351
segmentdreamer towards high-fidelity text-to-3d synthesis with segmented consist | arXiv: 2507.05256
sehdr single-exposure hdr novel view synthesis via 3d gaussian bracketing | arXiv: 2509.20400
selective contrastive learning for weakly supervised affordance grounding | arXiv: 2508.07877
self-calibrated variance-stabilizing transformations for real-world image denois | arXiv: 2407.17399
self-calibrating gaussian splatting for large field-of-view reconstruction
self-ensembling gaussian splatting for few-shot novel view synthesis | arXiv: 2411.00144
self-supervised learning of hybrid part-aware 3d representations of 2d gaussians | arXiv: 2408.10789
self-supervised sparse sensor fusion for long range perception | arXiv: 2508.13995
semantic alignment and reinforcement for data-free quantization of vision transf | arXiv: 2412.16553
semantic causality-aware vision-based 3d occupancy prediction | arXiv: 2509.08388
semantic discrepancy-aware detector for image forgery identification | arXiv: 2508.12341
semantic watermarking reinvented enhancing robustness and generation quality wit | arXiv: 2509.07647
semges semantics-aware co-speech gesture generation using semantic coherence and | arXiv: 2507.19359
semi-supervised deep transfer for regression without domain alignment | arXiv: 2509.05092
semivisbooster boosting semi-supervised learning for fine-grained classification
semtalk holistic co-speech motion generation with frame-level semantic emphasis | arXiv: 2412.16563
separation for better integration disentangling edge and motion in event-based d
seqgrowgraph learning lane topology as a chain of graph expansions | arXiv: 2507.04822
sequential gaussian avatars with hierarchical motion context | arXiv: 2411.16768
sequential keypoint density estimator an overlooked baseline of skeleton-based v | arXiv: 2506.18368
serep semantic facial expression representation for robust in-the-wild capture a
serialization based point cloud oversegmentation
sfuod source-free unknown object detection | arXiv: 2507.17373
shadowhack hacking shadows via luminance-color divide and conquer | arXiv: 2412.02545
shape of motion 4d reconstruction from a single video | arXiv: 2407.13764
sheap self-supervised head geometry predictor learned via 2d gaussians | arXiv: 2504.12292
shift smoothing hallucinations by information flow tuning for multimodal large l
shortft diffusion model alignment via shortcut-based fine-tuning | arXiv: 2507.22604
shortv efficient multimodal large language models by freezing visual tokens in i | arXiv: 2504.00502
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers | arXiv: 2504.00502
sibai a few-shot meta-classifier for poisoning detection in federated learning
sic similarity-based interpretable image classification with neural networks | arXiv: 2501.17328
signrep enhancing self-supervised sign representations | arXiv: 2503.08529
signs as tokens a retrieval-enhanced multilingual sign language generator | arXiv: 2411.17799
sim-detr unlock detr for temporal sentence grounding | arXiv: 2509.23867
sim3d single-instance multiview multimodal and multisetup 3d anomaly detection b | arXiv: 2506.21549
simmlm a simple framework for multi-modal learning with missing modality | arXiv: 2507.19264
simplevqa multimodal factuality evaluation for multimodal large language models | arXiv: 2502.13059
simulating dual-pixel images from ray tracing for depth estimation | arXiv: 2503.11213
simultaneous motion and noise estimation with event cameras | arXiv: 2504.04029
single-scanline relative pose estimation for rolling shutter cameras | arXiv: 2506.22069
site towards spatial intelligence thorough evaluation | arXiv: 2505.05456
skeleton motion words for unsupervised skeleton-based temporal action segmentati | arXiv: 2508.04513
sketchsplat 3d edge reconstruction via differentiable multi-view sketch splattin | arXiv: 2503.14786
skip-vision efficient and scalable acceleration of vision-language models via ad
skysense v2 a unified foundation model for multi-modal remote sensing | arXiv: 2507.13812
sl2a-inr single-layer learnable activation for implicit neural representation | arXiv: 2409.10836
sliderspace decomposing the visual capabilities of diffusion models | arXiv: 2502.01639
smarties spectrum-aware multi-sensor auto-encoder for remote sensing images | arXiv: 2506.19585
smgdiff soccer motion generation using diffusion probabilistic models | arXiv: 2411.16216
smolora exploring and defying dual catastrophic forgetting in continual visual i | arXiv: 2411.13949
social debiasing for fair multi-modal llms | arXiv: 2408.06569
soft separation and distillation toward global uniformity in federated unsupervi | arXiv: 2508.01251
spade spatial-aware denoising network for open-vocabulary panoptic scene graph g | arXiv: 2507.05798
sparfels fast reconstruction from sparse unposed imagery | arXiv: 2505.02178
sparse-dense side-tuner for efficient video temporal grounding | arXiv: 2507.07744
sparselanestp leveraging spatio-temporal priors with sparse transformers for 3d | arXiv: 2601.04968
sparsemm head sparsity emerges from visual concept responses in mllms | arXiv: 2506.05344
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs | arXiv: 2506.05344
sparsevila decoupling visual sparsity for efficient vlm inference | arXiv: 2510.17777
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference | arXiv: 2510.17777
sparsity outperforms low-rank projections in few-shot adaptation | arXiv: 2504.12436
spatial preference rewarding for mllms spatial understanding | arXiv: 2510.14374
spatial-temporal aware visuomotor diffusion policy learning | arXiv: 2507.06710
spatial-temporal forgery trace based forgery image identification
spatially-varying autofocus
spatialsplat efficient semantic 3d from sparse unposed images | arXiv: 2505.23044
spatialtrackerv2 advancing 3d point tracking with explicit camera motion
specguard spectral projection-based advanced invisible watermarking | arXiv: 2510.07302
spectral image tokenizer | arXiv: 2412.09607
spectral sensitivity estimation with an uncalibrated diffraction grating | arXiv: 2508.00330
spherical epipolar rectification for deep two-view absolute depth estimation
spikediff zero-shot high-quality video reconstruction from chromatic spike camer
spinmeround consistent multi-view identity generation using diffusion models | arXiv: 2504.10716
splat-based 3d scene reconstruction with extreme motion-blur
splat-loam gaussian splatting lidar odometry and mapping | arXiv: 2503.17491
splattalk 3d vqa with gaussian splatting | arXiv: 2503.06271
split-and-combine enhancing style augmentation for single domain generalization
srefiner soft-braid attention for multi-agent trajectory refinement | arXiv: 2507.04263
ssvq unleashing the potential of vector quantization with sign-splitting | arXiv: 2503.08668
stable score distillation | arXiv: 2507.09168
staining and locking computer vision models without retraining | arXiv: 2507.22000
star spatial-temporal augmentation with text-to-video models for real-world vide
std-gs exploring frame-event interaction for spatiotemporal-disentangled gaussia
stealthattack robust 3d gaussian splatting poisoning via density-guided illusion | arXiv: 2510.02314
stealthy backdoor attack in federated learning via adaptive layer-wise gradient
steerx creating any camera-free 3d and 4d scenes with geometric steering | arXiv: 2503.12024
step-detr advancing detr-based semi-supervised object detection with super teach
stepping out of similar semantic space for open-vocabulary segmentation | arXiv: 2506.16058
stereo any video temporally consistent stereo matching | arXiv: 2503.05549
sti-bench are mllms ready for precise spatial-temporal world understanding | arXiv: 2503.23765
stiv scalable text and image conditioned video generation | arXiv: 2412.07730
stochastic interpolants for revealing stylistic flows across the history of art
stochasticsplats stochastic rasterization for sorting-free 3d gaussian splatting | arXiv: 2503.24366
stolenlora exploring lora extraction attacks via synthetic data | arXiv: 2509.23594
straighten viscous rectified flow via noise optimization | arXiv: 2507.10218
strandhead text to hair-disentangled 3d head avatars using human-centric priors | arXiv: 2412.11586
streamdiffusion a pipeline-level solution for real-time interactive generation | arXiv: 2312.12491
streamgs online generalizable gaussian splatting reconstruction for unposed imag
streaming videollms for real-time procedural video understanding
streammind unlocking full frame rate streaming video dialogue through event-gate | arXiv: 2503.06220
stroke2sketch harnessing stroke attributes for training-free sketch generation | arXiv: 2510.16319
structure-aware semantic discrepancy and consistency for 3d medical image self-s
structure-guided diffusion models for high-fidelity portrait shadow removal | arXiv: 2507.04692
strumamba3d exploring structural mamba for self-supervised point cloud represent | arXiv: 2506.21541
stylekeeper prevent content leakage using negative visual query guidance | arXiv: 2510.06827
stylemotif multi-modal motion stylization using style-content cross fusion | arXiv: 2503.21775
stylesrn scene text image super-resolution with text style embedding
stylized-face a million-level stylized face dataset for face recognition
su-rgs relightable 3d gaussian splatting from sparse views under unconstrained i
subjective camera 10 bridging human cognition and visual reconstruction through
suma a subspace mapping approach for robust and effective concept erasure in tex
summdiff generative modeling of video summarization with diffusion | arXiv: 2510.08458
supercharging floorplan localization with semantic rays | arXiv: 2507.09291
superdec 3d scene decomposition with superquadrics primitives | arXiv: 2504.00992
superedit rectifying and facilitating supervision for instruction-based image ed | arXiv: 2505.02370
supermat physically consistent pbr material estimation at interactive rates | arXiv: 2411.17515
supervised exploratory learning for long-tailed visual recognition
surfacesplat connecting surface reconstruction and gaussian splatting | arXiv: 2507.15602
sv4d 20 enhancing spatio-temporal consistency in multi-view video diffusion for
svg-head hybrid surface-volumetric gaussians for high-fidelity head reconstructi | arXiv: 2508.09597
svip semantically contextualized visual patches for zero-shot learning | arXiv: 2503.10252
svtrv2 ctc beats encoder-decoder models in scene text recognition | arXiv: 2411.15858
sweettok semantic-aware spatial-temporal tokenizer for compact video discretizat | arXiv: 2412.10443
switch-a-view view selection learned from unlabeled in-the-wild videos | arXiv: 2412.18386
synad enhancing real-world end-to-end autonomous driving models through syntheti
syncdiff synchronized motion diffusion for multi-body human-object interaction s | arXiv: 2412.20104
synchronization of multiple videos | arXiv: 2510.14051
syncity training-free generation of 3d worlds | arXiv: 2503.16420
synergistic prompting for robust visual recognition with missing modalities | arXiv: 2507.07802
synfer towards boosting facial expression recognition with synthetic data | arXiv: 2410.09865
syntag enhancing the geometric robustness of inversion-based generative image wa
synthesizing near-boundary ood samples for out-of-distribution detection | arXiv: 2507.10225
tab transformer attention bottlenecks enable user intervention and debugging in | arXiv: 2412.18675
taming the untamed graph-based knowledge retrieval and reasoning for mllms to co | arXiv: 2506.17589
tapnext tracking any point tap as next token prediction | arXiv: 2504.05579
tar3d creating high-quality 3d assets via next-part prediction | arXiv: 2412.16919
target bias is all you need zero-shot debiasing of vision-language models with b
tars traffic-aware radar scene flow estimation | arXiv: 2503.10210
task vector quantization for memory-efficient model merging | arXiv: 2503.06921
task-aware prompt gradient projection for parameter-efficient tuning federated c
tavis text-bridged audio-visual segmentation with foundation models | arXiv: 2506.11436
taxadiffusion progressively trained diffusion model for fine-grained species gen | arXiv: 2506.01923
tcfg truncated classifier-free guidance for efficient and scalable text-to-image
teaching ai the anatomy behind the scan addressing anatomical flaws in medical i
teefusion blending text embeddings to distill classifier-free guidance | arXiv: 2507.18192
teeth reconstruction and performance capture using a phone camera
teethgenerator a two-stage framework for paired pre- and post-orthodontic 3d den | arXiv: 2507.04685
temperature in cosine-based softmax loss
temporal overlapping prediction a self-supervised pre-training method for lidar
temporal rate reduction clustering for human motion segmentation | arXiv: 2506.21249
temporal unlearnable examples preventing personal video data from unauthorized e | arXiv: 2507.07483
temporal-aware query routing for real-time video instance segmentation
tera rethinking text-guided realistic 3d avatar generation | arXiv: 2509.02466
test-time prompt tuning for zero-shot depth completion
test-time retrieval-augmented adaptation for vision-language models
text embedding knows how to quantize text-guided diffusion models | arXiv: 2507.10340
text2outfit controllable outfit generation with multimodal language models
text2vdm text to vector displacement maps for expressive and interactive 3d scul | arXiv: 2502.20045
textured 3d regenerative morphing with 3d diffusion prior | arXiv: 2502.14316
the best of both worlds integrating language models and diffusion models for vid
the curse of conditions analyzing and improving optimal transport for conditiona | arXiv: 2503.10636
the devil is in the spurious correlations boosting moment retrieval with dynamic | arXiv: 2501.07305
the inter-intra modal measure a predictive lens on fine-tuning outcomes in visio | arXiv: 2407.15731
the scalability of simplicity empirical analysis of vision-language learning wit
the silent assistant noisequery as implicit guidance for goal-driven image gener | arXiv: 2412.05101
thermal polarimetric multi-view stereo | arXiv: 2510.20972
tikzero zero-shot text-guided graphics program synthesis | arXiv: 2503.11509
time-aware auto white balance in mobile photography | arXiv: 2504.05623
timeexpert an expert-guided video llm for video temporal grounding | arXiv: 2508.01699
timeformer capturing temporal relationships of deformable 3d gaussians for robus | arXiv: 2411.11941
timestep-aware diffusion model for extreme image rescaling | arXiv: 2408.09151
tinyvim frequency decoupling for tiny hybrid vision mamba | arXiv: 2411.17473
tip-i2v a million-scale real text and image prompt dataset for image-to-video ge | arXiv: 2411.04709
tlb-vfi temporal-aware latent brownian bridge diffusion for video frame interpol | arXiv: 2507.04984
to label or not to label palm - a predictive model for evaluating sample efficie | arXiv: 2507.15381
toga temporally grounded open-ended video qa with weak supervision | arXiv: 2506.09445
token-efficient vlm high-resolution image understanding via dynamic region propo
TokenBridge: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation | arXiv: 2503.16430
tokensgen harnessing condensed tokens for long video generation
tokenunify scaling up autoregressive pretraining for neuron segmentation | arXiv: 2405.16847
toolvqa a dataset for multi-step reasoning vqa with external tools | arXiv: 2508.03284
ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools | arXiv: 2508.03284
topotta topology-enhanced test-time adaptation for tubular structure segmentatio | arXiv: 2508.00442
totp transferable online pedestrian trajectory prediction with temporal-adaptive
toward better out-painting improving the image composition with initialization p
toward long-tailed online anomaly detection through class-agnostic concepts | arXiv: 2507.16946
toward material-agnostic system identification from videos | arXiv: 2508.01112
Towards a Unified Copernicus Foundation Model for Earth Vision | arXiv: 2503.11849
towards a unified copernicus foundation model for earth vision | arXiv: 2503.11849
towards a universal 3d medical multi-modality generalization via learning person
towards a universal image degradation model via content-degradation disentanglem | arXiv: 2505.12860
towards adversarial robustness via debiased high-confidence logit alignment | arXiv: 2408.06079
towards comprehensive lecture slides understanding large-scale dataset and effec
towards cross-modal backward-compatible representation learning for vision-langu
towards efficient general feature prediction in masked skeleton modeling | arXiv: 2509.03609
towards long-horizon vision-language-action system reasoning acting and memory
towards more diverse and challenging pre-training for point cloud learning self- | arXiv: 2509.01250
towards omnimodal expressions and reasoning in referring audio-visual segmentati | arXiv: 2507.22886
towards open-world generation of stereo images and unsupervised matching | arXiv: 2503.12720
towards performance consistency in multi-level model collaboration
towards privacy-preserved pre-training of remote sensing foundation models with
towards robust defense against customization via protective perturbation resista | arXiv: 2509.13922
towards robustness of person search against corruptions
towards scalable spatial intelligence via 2d-to-3d data lifting | arXiv: 2507.18678
towards stabilized and efficient diffusion transformers through long-skip-connec
towards video thinking test a holistic benchmark for advanced video reasoning an | arXiv: 2507.15028
tpg-inr target prior-guided implicit 3d ct reconstruction for enhanced sparse-vi
tr-pts task-relevant parameter and token selection for efficient tuning | arXiv: 2507.22872
trace learning 3d gaussian physical dynamics from multi-view videos | arXiv: 2508.09811
trace3d consistent segmentation lifting via gaussian instance tracing | arXiv: 2508.03227
trackany3d transferring pretrained 3d models for category-unified 3d point cloud | arXiv: 2507.19908
tracking tiny drones against clutter large-scale infrared benchmark with motion-
trade-offs in image generation how do different dimensions interact | arXiv: 2507.22100
trafficloc localizing traffic surveillance cameras in 3d scenes | arXiv: 2412.10308
training-free class purification for open-vocabulary semantic segmentation | arXiv: 2508.00557
training-free generation of temporally consistent rewards from vlms | arXiv: 2507.04789
training-free industrial defect generation with diffusion models
training-free personalization via retrieval and reasoning on fingerprints | arXiv: 2503.18623
TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update | arXiv: 2507.11069
trans-adapter a plug-and-play framework for transparent image inpainting | arXiv: 2508.01098
transformed low-rank adaptation via tensor decomposition and its applications to | arXiv: 2501.08727
transit transient transformer for non-line-of-sight videography | arXiv: 2503.11328
transparent vision a theory of hierarchical invariant representations
trce towards reliable malicious concept erasure in text-to-image diffusion model | arXiv: 2503.07389
trial-oriented visual rearrangement
tridi trilateral diffusion of 3d humans objects and interactions | arXiv: 2412.06334
trokens semantic-aware relational trajectory tokens for few-shot action recognit | arXiv: 2508.03695
trust but verify programmatic vlm evaluation in the wild | arXiv: 2410.13121
trustmark robust watermarking and watermark removal for arbitrary resolution ima
tryon-refiner conditional rectified-flow-based tryon refiner for more accurate d
tune-your-style intensity-tunable 3d style transfer with gaussian splatting | arXiv: 2602.00618
turboreg turboclique for robust and efficient point cloud registration | arXiv: 2507.01439
twist scout grounding multimodal llm-experts by forget-free tuning
two losses one goal balancing conflict gradients for semi-supervised semantic se
u-vilar uncertainty-aware visual localization for autonomous driving via differe
uavscenes a multi-modal dataset for uavs | arXiv: 2507.22412
udc-vit a real-world video dataset for under-display cameras | arXiv: 2501.18545
uipro unleashing superior interaction capability for gui agents | arXiv: 2509.17328
ukbob one billion mri labeled masks for generalizable 3d medical image segmentat | arXiv: 2504.06908
ultho ultra-lightweight yet efficient hyperparameter optimization in deep reinfo
ultra-precision 6dof pose estimation using 2-d interpolated discrete fourier tra
umdatrack unified multi-domain adaptive tracking under adverse weather condition | arXiv: 2507.00648
uncalibrated structure from motion on a sphere
uncertainty-aware gradient stabilization for small object detection | arXiv: 2303.01803
understanding co-speech gestures in-the-wild | arXiv: 2503.22668
understanding flatness in generative models its role and benefits | arXiv: 2503.11078
understanding museum exhibits using vision-language reasoning | arXiv: 2412.01370
understanding personal concept in open-vocabulary semantic segmentation | arXiv: 2507.11030
unfolding-associative encoder-decoder network with progressive alignment for pan
unicombine unified multi-conditional combination with diffusion transformer | arXiv: 2503.09277
uniconvnet expanding effective receptive field while maintaining asymptotically | arXiv: 2508.09000
unidxmd towards unified representation for cross-modal unsupervised domain adapt
uniegomotion a unified model for egocentric motion reconstruction forecasting an | arXiv: 2508.01126
unified category-level object detection and pose estimation from rgb images usin | arXiv: 2508.02157
unified multi-agent trajectory modeling with masked trajectory diffusion
unified multimodal understanding via byte-pair visual encoding | arXiv: 2506.23639
uniglyph unified segmentation-conditioned diffusion for precise visual text synt | arXiv: 2507.00992
uniocc a unified benchmark for occupancy forecasting and prediction in autonomou | arXiv: 2503.24381
uniphys unified planner and controller with diffusion for flexible physics-based | arXiv: 2504.12540
uniportrait a unified framework for identity-preserving single- and multi-human
unires universal image restoration for complex degradations | arXiv: 2506.05599
universe unleashing the scene prior of video diffusion models for robust radianc
univg a generalist diffusion model for unified image generation and editing | arXiv: 2503.12652
unleashing high-quality image generation in diffusion sampling using second-orde
unleashing the temporal potential of stereo event cameras for continuous-time 3d | arXiv: 2508.02288
unleashing vecset diffusion model for fast shape generation | arXiv: 2503.16302
unlocking the potential of diffusion priors in blind face restoration | arXiv: 2508.08556
unraveling the effects of synthetic data on end-to-end autonomous driving | arXiv: 2503.18108
unsupervised identification of protein compositions and conformations via implic
unsupervised imaging inverse problems with diffusion distribution matching | arXiv: 2506.14605
unsupervised joint learning of optical flow and intensity with event cameras | arXiv: 2503.17262
unsupervised rgb-d point cloud registration for scenes with low overlap and phot
unsupervised visible-infrared person re-identification under unpaired settings
unsupervised visual chain-of-thought reasoning via preference optimization | arXiv: 2504.18397
unziplora separating content and style from a single image | arXiv: 2412.04465
upp unified point-level prompting for robust point cloud analysis | arXiv: 2507.18997
upre zero-shot domain adaptation for object detection via unified prompt and rep | arXiv: 2507.00721
ust-ssm unified spatio-temporal state space models for point cloud video modelin | arXiv: 2508.14604
v2pe improving multimodal long-context capability of vision-language models with
v2xpnp vehicle-to-everything spatio-temporal fusion for multi-agent perception a | arXiv: 2412.01812
va-moe variables-adaptive mixture of experts for incremental weather forecasting | arXiv: 2412.02503
vace all-in-one video creation and editing | arXiv: 2503.07598
VACE: All-in-One Video Creation and Editing | arXiv: 2503.07598
vaflow video-to-audio generation with cross-modality flow matching
vamba understanding hour-long videos with hybrid mamba-transformers | arXiv: 2503.11579
variance-based pruning for accelerating and compressing trained networks | arXiv: 2507.12988
vector contrastive learning for pixel-wise pretraining in medical vision | arXiv: 2506.20850
veggie instructional editing and reasoning video concepts with grounded generati | arXiv: 2503.14350
versatile transition generation with image-to-video diffusion | arXiv: 2508.01698
vertexregen mesh generation with continuous level of detail | arXiv: 2508.09062
vggsounder audio-visual evaluations for foundation models | arXiv: 2508.08237
vgmamba attribute-to-location clue reasoning for quantity-agnostic 3d visual gro
victr vital consistency transfer for pathology aware image synthesis | arXiv: 2505.04963
vid-group temporal video grounding pretraining from unlabeled videos in the wild
video color grading via look-up table generation | arXiv: 2508.00548
video motion graphs | arXiv: 2503.20218
video-t1 test-time scaling for video generation | arXiv: 2503.18942
videollamb long streaming video understanding with recurrent memory bridges | arXiv: 2409.01071
videominer iteratively grounding key frames of hour-long videos via tree-based g | arXiv: 2510.06040
videosetdiff identifying and reasoning similarities and differences in similar v
videovae large motion video autoencoding with cross-modal video vae
viewsrd 3d visual grounding via structured multi-view decomposition | arXiv: 2507.11261
vigface virtual identity generation for privacy-free face recognition dataset | arXiv: 2403.08277
vilu learning vision-language uncertainties for failure prediction | arXiv: 2507.07620
vip iterative online preference distillation for efficient video diffusion model | arXiv: 2508.03254
vishall3d monocular semantic scene completion from reconstructing the visible re
vision-language interactive relation mining for open-vocabulary scene graph gene
vision-language models cant see the obvious | arXiv: 2507.04741
vision-language neural graph featurization for extracting retinal lesions
visionmath vision-form mathematical problem-solving
visnumbench evaluating number sense of multimodal large language models | arXiv: 2503.14939
visrl intention-driven visual perception via reinforced reasoning | arXiv: 2503.07523
visual chronicles using multimodal llms to analyze massive collections of images | arXiv: 2504.08727
visual intention grounding for egocentric assistants | arXiv: 2504.13621
visual interestingness decoded how gpt-4o mirrors human interests | arXiv: 2510.13316
visual modality prompt for adapting vision-language object detectors | arXiv: 2412.00622
visual relation diffusion for human-object interaction detection
visual surface wave elastography revealing subsurface physical properties via vi | arXiv: 2507.09207
visual-oriented fine-grained knowledge editing for multimodal large language mod | arXiv: 2411.12790
visual-rft visual reinforcement fine-tuning | arXiv: 2503.01785
visualcloze a universal image generation framework via visual in-context learnin | arXiv: 2504.07960
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning | arXiv: 2504.07960
vit-ensembleattack augmenting ensemble models for stronger adversarial transfera
vit-linearizer distilling quadratic knowledge into linear-time vision models | arXiv: 2504.00037
vit-split unleashing the power of vision foundation models via efficient splitti | arXiv: 2506.03433
vital more understandable feature visualization through distribution alignment a | arXiv: 2503.22399
vivid4d improving 4d reconstruction from monocular video by video inpainting | arXiv: 2504.11092
vlabench a large-scale benchmark for language-conditioned robotics manipulation
vlipp towards physically plausible video generation with vision and language inf
vlr-driver large vision-language-reasoning models for embodied autonomous drivin
vlrmbench a comprehensive and challenging benchmark for vision-language reward m | arXiv: 2503.07478
vmbench a benchmark for perception-aligned video motion generation | arXiv: 2503.10076
voccl3d a video benchmark dataset for 3d human pose and shape estimation under r | arXiv: 2508.06757
volume - authentic 3d video calls from live gaussian splat prediction | arXiv: 2507.21311
volumetricsmpl a neural volumetric body model for efficient interactions contact | arXiv: 2506.23236
vovtrack exploring the potentiality in raw videos for open-vocabulary multi-obje
vpo aligning text-to-video generation models with prompt optimization | arXiv: 2503.20491
vq-sgen a vector quantized stroke representation for creative sketch generation | arXiv: 2411.16446
vq-vla improving vision-language-action models via scaling vector-quantized acti | arXiv: 2507.01016
vsc visual search compositional text-to-image diffusion model | arXiv: 2505.01104
vsp diagnosing the dual challenges of perception and reasoning in spatial planni
vsrm a robust mamba-based framework for video super-resolution | arXiv: 2506.22762
vssd vision mamba with non-causal state space duality | arXiv: 2407.18559
vtimecot thinking by drawing for video temporal grounding and reasoning | arXiv: 2510.14672
vulnerability-aware spatio-temporal learning for generalizable deepfake video de | arXiv: 2501.01184
walkvlm aid visually impaired people walking by vision language model
wasserstein style distribution analysis and transform for stylized image generat
wave-mambaad wavelet-driven state space model for multi-class unsupervised anoma
wavelet policy lifting scheme for policy learning in long-horizon tasks | arXiv: 2507.04331
weakly supervised visible-infrared person re-identification via heterogeneous ex | arXiv: 2507.12942
weakly-supervised learning of dense functional correspondences | arXiv: 2509.03893
weaveseg iterative contrast-weaving and spectral feature-refining for nuclei ins
what changed and what could have changed state-change counterfactuals for proced
what changed detecting and evaluating instruction-guided image edits with multim
what if understanding motion through sparse interactions | arXiv: 2510.12777
what makes for text to 360-degree panorama generation with stable diffusion | arXiv: 2505.22129
what you have is what you track adaptive and robust multimodal tracking | arXiv: 2507.05899
whats in a latent leveraging diffusion latent space for domain generalization | arXiv: 2503.06698
whats making that sound right now video-centric audio-visual localization | arXiv: 2507.04667
when large vision-language model meets large remote sensing imagery coarse-to-fi
when lighting deceives exposing vision-language models illumination vulnerabilit
when pixel difference patterns meet vit pidivit for few-shot object detection
where am i cross-view geo-localization with natural language descriptions | arXiv: 2412.17007
where what why towards explainable driver attention prediction | arXiv: 2506.23088
who controls the authorization invertible networks for copyright protection in t
who is a better talker subjective and objective quality assessment for ai-genera
why lvlms are more prone to hallucinations in longer responses the role of conte | arXiv: 2510.20229
wikiautogen towards multi-modal wikipedia-style article generation | arXiv: 2503.19065
wildsat learning satellite image representations from wildlife observations | arXiv: 2412.14428
wildseg3d segment any 3d objects in the wild from 2d images | arXiv: 2503.08407
wins winograd structured pruning for fast winograd convolution
wir3d visually-informed and geometry-aware 3d shape abstraction | arXiv: 2505.04813
wonderplay dynamic 3d scene generation from a single image and actions | arXiv: 2505.18151
wonderturbo generating interactive 3d world in 072 seconds | arXiv: 2504.02261
world4drive end-to-end autonomous driving via intention-aware physical latent wo | arXiv: 2507.00603
worldscore a unified evaluation benchmark for world generation | arXiv: 2504.00983
x-dancer expressive music to human dance video generation | arXiv: 2502.17414
x-prompt generalizable auto-regressive visual learning with in-context prompting
xtrack multimodal training boosts rgb-x video object trackers | arXiv: 2405.17773
yolo-count differentiable object counting for text-to-image generation | arXiv: 2508.00728
YOLOE: Real-Time Seeing Anything | arXiv: 2503.07465
you share beliefs i adapt progressive heterogeneous collaborative perception | arXiv: 2509.09310
your text encoder can be an object-level watermarking controller | arXiv: 2503.11945
zero-avsr zero-shot audio-visual speech recognition with llms by learning langua | arXiv: 2503.06273
zero-shot depth aware image editing with diffusion models
zero-shot inexact cad model alignment from a single image | arXiv: 2507.03292
zerostereo zero-shot stereo matching from single images | arXiv: 2501.08654
zeroth-order fine-tuning of llms in random subspaces | arXiv: 2410.08989
zfusion efficient deep compositional zero-shot learning for blind image super-re
zim zero-shot image matting for anything | arXiv: 2411.00626
zipvl accelerating vision-language models through dynamic token sparsity
3dgs lm faster gaussian splatting optimization with levenberg marquardt
aaa gaussians anti aliased artifact free 3d gaussian rendering
alltracker efficient dense point tracking at high resolution
argmatch adaptive refinement gathering for efficient dense matching
beziergs dynamic urban scene reconstruction with bezier curve gaussian splatting | arXiv: 2506.22099
boosting multi-view indoor 3d object detection via adaptive 3d volume | arXiv: 2507.18331
bridging 3d anomaly localization and repair via high-qualit | arXiv: 2505.24431
dap-mae domain-adaptive point cloud masked autoencoder for e | arXiv: 2510.21635
ask and remember a questions only replay strategy for continual visual question answering
backdoor attacks on neural networks via one bit flip
acam kd adaptive cooperative attention masking knowledge distillation | arXiv: 2503.06307
ad gs object aware bspline gaussian splatting self supervised autonomous driving
resonance learning to predict social aware pedestrian trajectories as co vibrations | arXiv: 2412.02447
tikzero zero-shot text-guided graphics program synthesis | arXiv: 2503.11509
cargait cross attention based re ranking for gait recognition | arXiv: 2503.03501
dynfacerestore balancing fidelity and quality in diffusion-guided blind face res | arXiv: 2507.13797
a0 affordance aware hierarchical model robotic manipulation | arXiv: 2504.12636
adaptive routing of text to image generation requests between large cloud model and light weight edge model
addressing text embedding leakage in diffusion based image editing
adiee automatic dataset creation and scorer for instruction guided image editing evaluation
ale attribute leakage free editing | arXiv: 2412.04715
bridging diffusion models and 3d representations a 3d consis | arXiv: 2508.04090
bridging the skeleton text modality gap diffusion powered modality alignment for | arXiv: 2411.10745
chords diffusion sampling accelerator with multi core hierarchical ode solvers | arXiv: 2507.15260
ec-flow enabling versatile robotic manipulation from action-unlabeled videos via | arXiv: 2507.06224
aligning information capacity between vision and language via dense-to-sparse fe
aligning information capacity between vision and language via dense to sparse feature distillation
langbridge interpreting image as a combination of language embeddings
monster a unified model for motion scene text retrieval
ocr hinders rag evaluating the cascading impact of ocr on retrieval-augmented ge | arXiv: 2412.02592
representation shift unifying token compression with flashattention | arXiv: 2508.00367
vilu learning vision-language uncertainties for failure prediction | arXiv: 2507.07620
aim amending inherent interpretability via self-supervised masking | arXiv: 2508.11502
argotweak towards self-updating hd maps through structured priors | arXiv: 2509.08764
ce-fam concept-based explanation via fusion of activation maps | arXiv: 2509.23849
granular concept circuits toward a fine-grained circuit discovery for concept re | arXiv: 2508.01728
learnable fractional reaction-diffusion dynamics for under-display tof imaging a | arXiv: 2511.01704
minerva evaluating complex video reasoning | arXiv: 2505.00681
principal components enable a new language of images | arXiv: 2503.08685
svip semantically contextualized visual patches for zero-shot learning | arXiv: 2503.10252
vital more understandable feature visualization through distribution alignment a | arXiv: 2503.22399
3dsrbench a comprehensive 3d spatial reasoning benchmark | arXiv: 2412.07825
a conditional probability framework for compositional zero-shot learning | arXiv: 2507.17377
a conditional probability framework for compositional zerosh | arXiv: 2507.17377
a real-world display inverse rendering dataset | arXiv: 2508.14411
a realworld display inverse rendering dataset | arXiv: 2508.14411
batclip bimodal online test-time adaptation for clip | arXiv: 2412.02837
discopatch taming adversarially-driven batch statistics for improved out-of-dist | arXiv: 2501.08005
dista-net dynamic closely-spaced infrared small target unmixing | arXiv: 2505.19148
forcennet foreground-centric network for document image rectification | arXiv: 2507.19804
generative zoo | arXiv: 2412.08101
hiero understanding the hierarchy of human behavior enhances reasoning on egocen | arXiv: 2505.12911
imbalance in balance online concept balancing in generation models | arXiv: 2507.13345
intersyn interleaved learning for dynamic motion synthesis in the wild | arXiv: 2508.10297
odp-bench benchmarking out-of-distribution performance prediction | arXiv: 2510.27263
omnidiff a comprehensive benchmark for fine-grained image difference captioning | arXiv: 2503.11093
on the robustness tradeoff in fine-tuning | arXiv: 2503.14836
rethinking few shot clip benchmarks a critical analysis in the inductive setting | arXiv: 2507.20834
shadowhack hacking shadows via luminance-color divide and conquer | arXiv: 2412.02545
spectral sensitivity estimation with an uncalibrated diffraction grating | arXiv: 2508.00330
supercharging floorplan localization with semantic rays | arXiv: 2507.09291
svtrv2 ctc beats encoder-decoder models in scene text recognition | arXiv: 2411.15858
any-ssr how recursive least squares works in continual learning of large languag
any ssr how recursive least squares works in continual learning of large language models
va gpt aligning effective tokens video anomaly | arXiv: 2508.06350
vim versatile interactive motion language model | arXiv: 2410.05628
ace-g improving generalization of scene coordinate regression through query pre- | arXiv: 2510.11605
aceg improving generalization of scene coordinate regression | arXiv: 2510.11605
conststyle robust domain generalization with unified style transformation | arXiv: 2509.05975
dataset ownership verification for pre-trained masked models | arXiv: 2507.12022
eta energy-based test-time adaptation for depth completion | arXiv: 2508.05989
flow to the mode mode-seeking diffusion autoencoders for state-of-the-art image | arXiv: 2503.11056
image intrinsic scale assessment bridging the gap between quality and resolution | arXiv: 2502.06476
make your training flexible towards deployment-efficient video models | arXiv: 2503.14237
adversarial robust memory-based continual learner | arXiv: 2311.17608
chartcap mitigating hallucination of dense chart captioning | arXiv: 2508.03164
forgetting through transforming enabling federated unlearning via class-aware re | arXiv: 2410.06848
temporal unlearnable examples preventing personal video data from unauthorized e | arXiv: 2507.07483
b vllm a vision large language model with balanced spatio temporal tokens
motionfollower editing video motion via score-guided diffusion | arXiv: 2405.20325
adaptive prompt learning via gaussian outlier synthesis for out of distribution detection
aigi holmes towards explainable and generalizable ai generated image detection via mllm
aircache activating inter modal relevancy kv cache compression for efficient large vision language model
coa-vla improving vision-language-action models via visual-text chain-of-afforda | arXiv: 2412.20451
gtr guided thought reinforcement prevents thought collapse in rl-based vlm agent | arXiv: 2503.08525
vq focusambiguity acknowledging focus ambiguity visual questions | arXiv: 2501.02201
learning 4d embodied world models | arXiv: 2504.20995
a plug-and-play physical motion restoration approach for in- | arXiv: 2412.17377
lawdis language-window-based controllable dichotomous image segmentati | arXiv: 2508.01152
gradient extrapolation for debiased representation learning | arXiv: 2503.13236
propvg end-to-end proposal-driven visual grounding with multi-granularity discri | arXiv: 2509.04833
i2-world intra-inter tokenization for efficient dynamic 4d scene forecasting | arXiv: 2507.09144
adversarial distribution matching for diffusion distillation towards efficient i | arXiv: 2507.18569
adversarial distribution matching for diffusion distillation towards efficient image and video synthesis
aid adapting image2video diffusion models for instruction-guided video predictio | arXiv: 2406.06465
aligning moments in time using video queries | arXiv: 2508.15439
badvideo stealthy backdoor attack against text-to-video generation | arXiv: 2504.16907
causal-entity reflected egocentric traffic accident video synthesis | arXiv: 2506.23263
d3 training-free ai-generated video detection using second-order features | arXiv: 2508.00701
dacon dino for anime paint bucket colorization with any number of reference imag | arXiv: 2509.14685
decouple and track benchmarking and improving video diffusion transformers for m | arXiv: 2503.17350
dh-facevid-1k a large-scale high-quality dataset for face video generation | arXiv: 2410.07151
disentangled world models learning to transfer semantic knowledge from distracti | arXiv: 2503.08751
dive taming dino for subject-driven video editing
dollar few-step video generation via distillation and latent reward optimization | arXiv: 2412.15689
dollar fewstep video generation via distillation and latent | arXiv: 2412.15689
dreamrelation relation-centric video customization | arXiv: 2503.07602
dual-expert consistency model for efficient and high-quality video generation | arXiv: 2506.03123
dualreal adaptive joint training for lossless identity-motion fusion in video cu | arXiv: 2505.02192
efficientmt efficient temporal adaptation for motion transfer in text-to-video d
etva evaluation of text-to-video alignment via fine-grained question generation | arXiv: 2503.16867
free-form motion control controlling the 6d poses of camera and objects in video | arXiv: 2501.01425
fuxi-rtm a physics-guided prediction framework with radiative transfer modeling | arXiv: 2503.19940
fvgen accelerating novel-view synthesis with adversarial video diffusion distill | arXiv: 2508.06392
generating fast and slow scalable parallel video generation with video interface | arXiv: 2503.17539
leanvae an ultra-efficient reconstruction vae for video diffusion models | arXiv: 2503.14325
long context tuning for video generation | arXiv: 2503.10589
magicdrive-v2 high-resolution long video generation for autonomous driving with | arXiv: 2411.13807
magicmirror id-preserved video generation in video diffusion transformers | arXiv: 2501.03931
motionagent fine-grained controllable video generation via motion field agent | arXiv: 2502.03207
motionshot adaptive motion transfer across arbitrary objects for text-to-video g | arXiv: 2507.16310
multi-identity human image animation with structural video diffusion | arXiv: 2504.04126
normalcrafter learning temporally consistent normals from video diffusion priors | arXiv: 2504.11427
ock unsupervised dynamic video prediction with object-centric kinematics | arXiv: 2404.18423
omnihuman-1 rethinking the scaling-up of one-stage conditioned human animation m | arXiv: 2502.01061
prompt-a-video prompt your video diffusion model via preference-aligned llm | arXiv: 2412.15156
quantifying and narrowing the unknown interactive text-to-video retrieval via un | arXiv: 2507.15504
realcam-i2v real-world image-to-video generation with interactive complex camera | arXiv: 2502.10059
reangle-a-video 4d video generation as video-to-video translation | arXiv: 2503.09151
recammaster camera-controlled generative rendering from a single video | arXiv: 2503.11647
steerx creating any camera-free 3d and 4d scenes with geometric steering | arXiv: 2503.12024
stiv scalable text and image conditioned video generation | arXiv: 2412.07730
sweettok semantic-aware spatial-temporal tokenizer for compact video discretizat | arXiv: 2412.10443
tip-i2v a million-scale real text and image prompt dataset for image-to-video ge | arXiv: 2411.04709
vace all-in-one video creation and editing | arXiv: 2503.07598
vace allinone video creation and editing | arXiv: 2503.07598
versatile transition generation with image-to-video diffusion | arXiv: 2508.01698
vip iterative online preference distillation for efficient video diffusion model | arXiv: 2508.03254
vmbench a benchmark for perception-aligned video motion generation | arXiv: 2503.10076
vpo aligning text-to-video generation models with prompt optimization | arXiv: 2503.20491
vsrm a robust mamba-based framework for video super-resolution | arXiv: 2506.22762
worldscore a unified evaluation benchmark for world generation | arXiv: 2504.00983
x-dancer expressive music to human dance video generation | arXiv: 2502.17414
4d bench benchmarking multimodal llms for 4d object understanding
adaptive hyper graph convolution network skeleton action recognition
aim adaptive inference multimodal llms token merging pruning | arXiv: 2412.03248
aim adaptive inference of multi modal llms via token merging and pruning
despite exploring contrastive deep skeleton-pointcloud-imu-text embeddings for a | arXiv: 2506.13897
prior-flow enhancing primitive panoramic optical flow with o | arXiv: 2506.23897