CVPR2025 论文笔记 TODO¶

总计: 3299 篇 | 已完成: 2019 | 待更新: 1280

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification | arXiv: 2412.00678
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes | arXiv: 2411.14974
3D Dental Model Segmentation with Geometrical Boundary Preserving | arXiv: 2503.23702
3D Face Reconstruction From Radar Images | arXiv: 2412.02403
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations | arXiv: 2504.14967
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency | arXiv: 2502.11801
3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation | arXiv: 2502.04074
3D Student Splatting and Scooping (SSS) | arXiv: 2503.10148
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation | arXiv: 2406.09126
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination | arXiv: 2406.05132
3D-GSW: 3D Gaussian Splatting for Robust Watermarking | arXiv: 2409.13222
3D-HGS: 3D Half-Gaussian Splatting | arXiv: 2406.02720
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer | arXiv: 2501.01163
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning | arXiv: 2411.17735
3D-MVP: 3D Multiview Pretraining for Robotic Manipulation | arXiv: 2406.18158
3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping
3denhancer consistent multi-view diffusion for 3d enhancement | arXiv: 2412.18565
3dgut enabling distorted cameras and secondary rays in gaussian splatting | arXiv: 2412.12507
3dtopia-xl scaling high-quality 3d asset generation via primitive diffusion | arXiv: 2409.12957
4d langsplat 4d language gaussian splatting via multimodal large language models | arXiv: 2503.10437
4d-fly fast 4d reconstruction from a single monocular video
4deform neural surface deformation for robust shape interpolation | arXiv: 2502.20208
4dequine disentangling motion and appearance for 4d equine reconstruction from m | arXiv: 2603.10125
4dgc rate-aware 4d gaussian compression for efficient streamable free-viewpoint | arXiv: 2503.18421
4dtam non-rigid tracking and mapping via dynamic surface gaussians | arXiv: 2505.22859
4real-video learning generalizable photo-realistic 4d video diffusion | arXiv: 2412.04462
5100 breaking performance shackles of full fine-tuning on visual recognition tas
a bias-free training paradigm for more general ai-generated image detection | arXiv: 2412.17671
a closed-form solution for debiasing vision-language models with utility guarant | arXiv: 2603.12998
a closer look at time steps is worthy of triple speed-up for diffusion model tra
a comprehensive study of decoder-only llms for text-to-image generation | arXiv: 2506.08210
a data-centric revisit of pre-trained vision models for robot learning | arXiv: 2503.06960
a dataset for semantic segmentation in the presence of unknowns | arXiv: 2503.22309
a distractor-aware memory for visual object tracking with sam2 | arXiv: 2411.17576
a flag decomposition for hierarchical datasets | arXiv: 2502.07782
a focused human body model for accurate anthropometric measurements extraction
a general adaptive dual-level weighting mechanism for remote sensing pansharpeni
a hubness perspective on representation learning for graph-based multi-view clus
a lightweight udf learning framework for 3d reconstruction based on local shape | arXiv: 2407.01330
a neuro-symbolic framework combining inductive and deductive reasoning for auton | arXiv: 2603.12421
a new statistical model of star speckles for learning to detect and characterize
a physics-informed blur learning framework for imaging systems | arXiv: 2502.11382
a polarization-aided transformer for image deblurring via motion vector decompos
a prediction-as-perception framework for 3d object detection | arXiv: 2603.12599
a regularization-guided equivariant approach for image restoration | arXiv: 2505.19799
a selective re-learning mechanism for hyperspectral fusion imaging
a semantic knowledge complementarity based decoupling framework for semi-supervi
a semi-supervised framework for breast ultrasound segmentation with training-fre | arXiv: 2603.06167
a simple data augmentation for feature distribution skewed federated learning | arXiv: 2306.09363
a simple yet effective layout token in large language models for document unders
a stitch in time saves nine small vlm is a precise guidance for accelerating lar
a tale of two classes adapting supervised contrastive learning to binary imbalan
a theory of learning unified model via knowledge integration from label space va
a unified approach to interpreting self-supervised pre-training methods for 3d p
a unified framework for heterogeneous semi-supervised learning | arXiv: 2503.00286
a unified image-dense annotation generation model for underwater scenes | arXiv: 2503.21771
a unified latent schrodinger bridge diffusion model for unsupervised anomaly det
a unified model for compressed sensing mri across undersampling patterns
a unified resilient and explainable adversarial patch detector
a universal scale-adaptive deformable transformer for image restoration across d
a2z-10m geometric deep learning with a-to-z brep annotations for ai-assisted cad | arXiv: 2603.12605
a3 few-shot prompt learning of unlearnable examples with cross-modal adversarial
a4a adapter for adapter transfer via all-for-all mapping for cross-architecture
aa-clip enhancing zero-shot anomaly detection via anomaly-aware clip | arXiv: 2503.06661
abbspo adaptive bounding box scaling and symmetric prior based orientation predi
abc-former auxiliary bimodal cross-domain transformer with interactive channel a
abra teleporting fine-tuned knowledge across domains for open-vocabulary object | arXiv: 2603.12409
ac3d analyzing and improving 3d camera control in video diffusion transformers
acattack adaptive cross attacking rgb-t tracker via multi-modal response decoupl
acc3d accelerating single image to 3d diffusion models via edge consistency guid
accelerating diffusion transformer via increment-calibrated caching with channel
accelerating multimodal large language models by searching optimal vision token
accelerating stroke mri with diffusion probabilistic models through large-scale | arXiv: 2603.13007
accurate differential operators for hybrid neural fields
accurate scene text recognition with efficient model scaling and cloze self-dist
ace anti-editing concept erasure in text-to-image models
acl activating capability of linear attention for image restoration
acquire and then adapt squeezing out text-to-image model for image restoration
action detail matters refining video recognition with local action queries
activating sparse part concepts for 3d class incremental learning
active data curation effectively distills large-scale multimodal models | arXiv: 2411.18674
active event-based stereo vision
active hyperspectral imaging using an event camera
activegamer active gaussian mapping through efficient rendering | arXiv: 2501.06897
adacm2 on understanding extremely long-term video with adaptive cross-modality m
adadare-gamma balancing stability and plasticity in multi-modal llms through eff
adamms model merging for heterogeneous multimodal large language models with uns
adaptation of weakly supervised localization in histopathology by debiasing pred | arXiv: 2603.12468
adaptcmvc robust adaption to incremental views in continual multi-view clusterin
adapter merging with centroid prototype mapping for scalable class-incremental l | arXiv: 2412.18219
adapting dense matching for homography estimation with grid-based acceleration
adapting pre-trained 3d models for point cloud video understanding via cross-fra
adapting text-to-image generation with feature difference instruction for generi
adapting to observation length of trajectory prediction via contrastive learning
adapting to the unknown training-free audio-visual event perception with dynamic
adaptive dropout unleashing dropout across layers for generalizable image super-
adaptive keyframe sampling for long video understanding
adaptive markup language generation for contextually-grounded visual document un
adaptive non-uniform timestep sampling for accelerating diffusion model training
adaptive parameter selection for tuning vision-language models
adaptive part learning for fine-grained generalized category discovery a plug-an
adaptive rectangular convolution for remote sensing pansharpening
adaptive unimodal regulation for balanced multimodal information acquisition
add attribution-driven data augmentation framework for boosting image super-reso
addressing data scarcity in 3d trauma detection through self-supervised and semi | arXiv: 2603.12514
admit adaptive multi-source tuning in dynamic environments
adu adaptive detection of unknown categories in black-box domain adaptation
adv-cpg a customized portrait generation framework with facial adversarial attac
advancing adversarial robustness in gnerfs the il2-nerf attack
advancing generalizable tumor segmentation with anomaly-aware open-vocabulary at
advancing manga analysis comprehensive segmentation annotations for the manga109
advancing multiple instance learning with continual learning for whole slide ima
advancing myopia to holism fully contrastive language-image pre-training | arXiv: 2412.00440
advancing semantic future prediction through multimodal visual sequence transfor
adventurer optimizing vision mamba architecture designs for efficiency | arXiv: 2410.07599
adversarial diffusion compression for real-world image super-resolution | arXiv: 2411.13383
adversarial domain prompt tuning and generation for single domain generalization
aerialmegadepth learning aerial-ground reconstruction and view synthesis | arXiv: 2504.13157
aerogen enhancing remote sensing object detection with diffusion-driven data gen
aespa attention-guided self-supervised parallel imaging for mri reconstruction
aesthetic post-training diffusion models from generic preferences with step-by-s
aesthetiq enhancing graphic layout design via aesthetic-aware preference alignme
afforddp generalizable diffusion policy with transferable affordance
afl a single-round analytic approach for federated learning with pre-trained mod
ag-vpreid a challenging large-scale benchmark for aerial-ground video-based pers
ai-face a million-scale demographically annotated ai-generated face dataset and
aigv-assessor benchmarking and evaluating the perceptual quality of text-to-vide
aim-fair advancing algorithmic fairness via selectively fine-tuning biased model
aipparel a multimodal foundation model for digital garments
airroom objects matter in room reidentification
akira augmentation kit on rays for optical video generation
alias-free latent diffusion models improving fractional shift equivariance of di
alien implicit neural representations for human motion prediction under arbitrar
align-a-video deterministic reward tuning of image diffusion models for consiste
align-kd distilling cross-modal alignment knowledge for mobile vision-language l
align3r aligned monocular depth estimation for dynamic videos
alignmamba enhancing multimodal mamba with local and global cross-modal alignmen
alignment mining and fusion representation alignment with hard negative mining a
all languages matter evaluating lmms on culturally diverse 100 languages
all-day multi-camera multi-target tracking
all-directional disparity estimation for real-world qpd images
all-optical nonlinear diffractive deep network for ultrafast image denoising
alternating gradient flow utility a unified metric for structural pruning and dy | arXiv: 2603.12354
amo sampler enhancing text rendering with overshooting | arXiv: 2411.19415
amr-transformer enabling efficient long-range interaction for complex neural flu
an end-to-end robust point cloud semantic segmentation network with single-step
an fpga implementation of displacement vector search for intra pattern copy in j | arXiv: 2603.10671
an image-like diffusion method for human-object interaction detection | arXiv: 2503.18134
analyzing the synthetic-to-real domain gap in 3d hand pose estimation | arXiv: 2503.19307
anatomical consistency and adaptive prior-informed transformation for multi-cont
anchor-aware similarity cohesion in target frames enables predicting temporal mo
anidoc animation creation made easier | arXiv: 2412.14173
anigrad anisotropic gradient-adaptive sampling for 3d reconstruction from monocu
anigs animatable gaussian avatar from a single image with inconsistent gaussian | arXiv: 2412.02684
animate and sound an image
animateanything consistent and controllable animation for video generation | arXiv: 2411.10836
animer animal pose and shape estimation using family aware transformer | arXiv: 2412.00837
animo species-aware model for text-driven animal motion generation
annexe unified analyzing answering and pixel grounding for egocentric interactio
annotation ambiguity aware semi-supervised medical image segmentation
anomalyncd towards novel anomaly class discovery in industrial scenarios | arXiv: 2410.14379
anomize better open vocabulary video anomaly detection | arXiv: 2503.18094
antidote a unified framework for mitigating lvlm hallucinations in counterfactua
any-resolution ai-generated image detection by spectral learning | arXiv: 2411.19417
any3dis class-agnostic 3d instance segmentation by 2d mask tracking | arXiv: 2411.16183
any6d model-free 6d pose estimation of novel objects | arXiv: 2503.18673
anyattack towards large-scale self-supervised adversarial attacks on vision-lang
anycam learning to recover camera poses and intrinsics from casual videos
anydressing customizable multi-garment virtual dressing via latent diffusion mod
anyedit mastering unified high-quality image editing for any idea
anymap learning a general camera model for structure-from-motion with unknown di
anymole any character motion in-betweening leveraging video diffusion models
anysat one earth observation model for many resolutions scales and modalities
aphq-vit post-training quantization with average perturbation hessian based reco
apollo an exploration of video understanding in large multimodal models
apply hierarchical-chain-of-generation to complex attributes text-to-3d generati
apt adaptive personalized training for diffusion models with limited data
ar-diffusion asynchronous video generation with auto-regressive diffusion
arbitrary-steps image super-resolution via diffusion inversion | arXiv: 2412.09013
arc2avatar generating expressive 3d avatars from a single image via id guidance
arche autoregressive residual compression with hyperprior and excitation | arXiv: 2603.10188
arcpro architectural programs for structured 3d abstraction of sparse points
are general-purpose vision models all we need for 2d medical image segmentation | arXiv: 2603.13044
are images indistinguishable to humans also indistinguishable to classifiers
are spatial-temporal graph convolution networks for human action recognition ove
argus a compact and versatile foundation model for vision
argus vision-centric reasoning with grounded chain-of-thought
arkit labelmaker a new scale for indoor 3d scene understanding
arm appearance reconstruction model for relightable 3d generation | arXiv: 2411.10825
around the world in 80 timesteps a generative approach to global visual geolocat
art anonymous region transformer for variable multi-layer transparent image gene
artformer controllable generation of diverse 3d articulated objects | arXiv: 2412.07237
articulated kinematics distillation from video diffusion models | arXiv: 2504.01204
articulatedgs self-supervised digital twin modeling of articulated objects using
artifade learning to generate high-quality subject from blemished images | arXiv: 2409.03745
artiscene language-driven artistic 3d scene generation through image intermediar
as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
as-bridge a bidirectional generative framework bridging next-generation astronom | arXiv: 2603.11928
asap advancing semantic alignment promotes multi-modal manipulation detecting an | arXiv: 2412.12718
ashita automatic scene-grounded hierarchical task analysis | arXiv: 2504.06553
asign an anatomy-aware spatial imputation graphic network for 3d spatial transcr
assessing and learning alignment of unimodal vision and language models | arXiv: 2412.04616
association of radiologic ppfe change with mortality in lung cancer screening co | arXiv: 2603.09531
associative transformer | arXiv: 2309.12862
asynchronous collaborative graph representation for frames and events
ata adaptive transformation agent for text-guided subject-position variable back
atom aligning text-to-motion model at event-level with gpt-4vision reward
atp adaptive threshold pruning for efficient data encoding in quantum neural net
atp-llava adaptive token pruning for large vision language models
attend to not attended structure-then-detail token merging for post-training dit
attention distillation a unified approach to visual characteristics transfer
attention iou examining biases in celeba using attention maps
attraction diminishing and distributing for few-shot class-incremental learning
attribute-formed class-specific concept space endowing language bottleneck model
attribute-missing multi-view graph clustering
audcast audio-driven human video generation by cascaded diffusion transformers
audio-visual instance segmentation | arXiv: 2310.18709
audio-visual semantic graph network for audio-visual event localization
augmented deep contexts for spatially embedded video coding
augmenting multimodal llms with self-reflective tokens for knowledge-based visua
augmenting perceptual super-resolution via image quality predictors | arXiv: 2504.18524
aurafusion360 augmented unseen region alignment for reference-based 360deg unbou
auto cherry-picker learning from high-quality generative data driven by language
auto-encoded supervision for perceptual image super-resolution
autolut lut-based image super-resolution with automatic sampling and adaptive re
automated detection of malignant lesions in the ovary using deep learning models | arXiv: 2603.11818
automated generation of challenging multiple-choice questions for vision languag
automated proof of polynomial inequalities via reinforcement learning
automatic joint structured pruning and quantization for efficient neural network | arXiv: 2502.16638
automatic spectral calibration of hyperspectral images method dataset and benchm
autopresent designing structured visuals from scratch | arXiv: 2501.00912
autoregressive distillation of diffusion transformers | arXiv: 2504.11295
autoregressive sequential pretraining for visual tracking
autossvh exploring automated frame sampling for efficient self-supervised video | arXiv: 2504.03587
autourdf unsupervised robot modeling from point cloud frames using cluster regis
avatarartist open-domain 4d avatarization | arXiv: 2503.19906
avf-mae scaling affective video facial masked autoencoders via efficient audio-v
avqacl a novel benchmark for audio-visual question answering continual learning
bacon improving clarity of image captions via bag-of-concept graphs | arXiv: 2407.03314
badgr bundle adjustment diffusion conditioned by gradients for wide-baseline flo
badtoken token-level backdoor attacks to multi-modal large language models
balanced direction from multifarious choices arithmetic meta-learning for domain
balanced rate-distortion optimization in learned image compression
balancing two classifiers via a simplex etf structure for model calibration
bard-gs blur-aware reconstruction of dynamic scenes via gaussian splatting
bases of steerable kernels for equivariant cnns from 2d rotations to the lorentz | arXiv: 2603.12459
basket a large-scale video dataset for fine-grained skill estimation
bayesian prompt flow learning for zero-shot anomaly detection
bayesian test-time adaptation for vision-language models
be more specific evaluating object-centric realism in synthetic images
behaviorvlm unified finetuning-free behavioral understanding with vision-languag | arXiv: 2603.12176
believing is seeing unobserved object detection using generative models
benchmarking large vision-language models via directed scene graph for comprehen
benchmarking object detectors under real-world distribution shifts in satellite
bendfm a taxonomy and synthetic cad dataset for manufacturability assessment in | arXiv: 2603.13102
beta-fft nonlinear interpolation and differentiated training strategies for semi
bevdiffuser plug-and-play diffusion model for bev denoising with ground-truth gu
beyond background shift rethinking instance replay in continual semantic segment
beyond clean training data a versatile and model-agnostic framework for out-of-d
beyond convolution a taxonomy of structured operators for learning-based image p | arXiv: 2603.12067
beyond final answers crystal benchmark for transparent multimodal reasoning eval | arXiv: 2603.13099
beyond generation a diffusion-based low-level feature extractor for detecting ai
beyond human perception understanding multi-object world from monocular view
beyond image classification a video benchmark and dual-branch hybrid discriminat
beyond local sharpness communication-efficient global sharpness-aware minimizati
beyond sight towards cognitive alignment in lvlm via enriched visual knowledge
beyond single-modal boundary cross-modal anomaly detection through visual protot
beyond single-sample reliable multi-sample distillation for video understanding | arXiv: 2603.11423
beyond words augmenting discriminative richness via diffusions in unsupervised p | arXiv: 2504.11930
bf-stvsr b-splines and fourier---best friends for high fidelity spatial-temporal | arXiv: 2501.11043
bfanet revisiting 3d semantic segmentation with boundary feature analysis | arXiv: 2503.12539
bg-triangle bezier gaussian triangle for 3d vectorization and rendering
bhvit binarized hybrid vision transformer | arXiv: 2503.02394
bias for action video implicit neural representations with bias modulation | arXiv: 2501.09277
biclip bidirectional and consistent language-image processing for robust medical | arXiv: 2603.00156
bigain unified token compression for joint generation and classification | arXiv: 2603.12240
bigs bimanual category-agnostic interaction reconstruction from monocular videos
bilora almost-orthogonal parameter spaces for continual learning
bim-vfi bidirectional motion field-guided frame interpolation for video with non | arXiv: 2412.11365
bimart a unified approach for the synthesis of 3d bimanual interaction with arti
bimba selective-scan compression for long-range video question answering | arXiv: 2503.09590
binarized mamba-transformer for lightweight quad bayer hybridevs demosaicing | arXiv: 2503.16134
binarized neural network for multi-spectral image fusion
binwang2hfnet geogran-aware hierarchical feature fusion network for salient obje | arXiv: 2603.12680
biomedcoop learning to prompt for biomedical vision-language models
biomedica an open biomedical image-caption archive dataset and vision-language m
biox-cpath biologically-driven explainable diagnostics for multistain ihc comput
bip3d bridging 2d images and 3d perception for embodied intelligence
birth and death of a rose
bizgen advancing article-level visual text rendering for infographics generation
black hole-driven identity absorbing in diffusion models
black swan abductive and defeasible video reasoning in unpredictable events
black-box forgery attacks on semantic watermarks for diffusion models
blade single-view body mesh estimation through accurate depth estimation | arXiv: 2412.08640
blendergym benchmarking foundational model systems for graphics editing
blind bitstream-corrupted video recovery via metadata-guided diffusion model
blobgen-vid compositional text-to-video generation with blob video representatio
blockdance reuse structurally similar spatio-temporal features to accelerate dif
blood flow speed estimation with optical coherence tomography angiography images
bluelm-v-3b algorithm and system co-design for multimodal large language models
blurred lidar for sharper 3d robust handheld 3d scanning with diffuse lidar and
blurry-edges photon-limited depth estimation from defocused boundaries | arXiv: 2503.23606
boe-vit boosting orientation estimation with equivariance in self-supervised 3d
bolt boost large vision-language model without training for long-form video unde
boltzmann attention sampling for image analysis with small objects | arXiv: 2503.02841
boost the inference with co-training a depth-guided mutual learning framework fo
boost your human image generation model via direct preference optimization | arXiv: 2405.20216
boosting adversarial transferability through augmentation in hypothesis space
boosting domain incremental learning selecting the optimal parameters is all you | arXiv: 2505.23744
boosting point-supervised temporal action localization through integrating query
boosting the dual-stream architecture in ultra-high resolution segmentation with
bootplace bootstrapped object placement with detection transformers | arXiv: 2503.21991
bootstrap your own views masked ego-exo modeling for fine-grained view-invariant | arXiv: 2503.19706
boow-vton boosting in-the-wild virtual try-on via mask-free pseudo data training | arXiv: 2408.06047
boss a best-of-strategies selector as an oracle for deep active learning | arXiv: 2603.13109
bounds on agreement between subjective and objective measurements | arXiv: 2603.13204
brain-inspired spiking neural networks for energy-efficient object detection
breaking the low-rank dilemma of linear attention | arXiv: 2411.07635
breaking the memory barrier of contrastive loss via tile-based strategy
breaking the tuning barrier zero-hyperparameters yield multi-corner analysis via | arXiv: 2603.13092
brepgiff lightweight generation of complex b-rep with 3d gat diffusion
bridge frame and event common spatiotemporal fusion for high-dynamic scene optic
bridge the gap from weak to full supervision for temporal action localization wi
bridging gait recognition and large language models sequence modeling
bridging modalities improving universal multimodal retrieval by multimodal large
bridging past and future end-to-end autonomous driving with historical predictio
bridging the gap between gaussian diffusion models and universal quantization fo
bridging the skill gap in clinical cbct interpretation with cbctrepd | arXiv: 2603.10933
bridging the vision-brain gap with an uncertainty-aware blur prior
bridging viewpoint gaps geometric reasoning boosts semantic correspondence
bringing clip to the clinic dynamic soft labels and negation-aware learning for
buffer anytime zero-shot video depth and normal from image priors
building a mind palace structuring environment-grounded semantic graphs for effe
building vision models upon heat conduction | arXiv: 2405.16555
bwformer building wireframe reconstruction from airborne lidar point cloud with
bytheway boost your text-to-video generation model to higher quality in a traini
cachequant comprehensively accelerated diffusion models | arXiv: 2503.01323
cad-llama leveraging large language models for computer-aided design parametric
cadcrafter generating computer-aided design models from unconstrained images | arXiv: 2504.04753
caddreamer cad object generation from single-view images | arXiv: 2502.20732
cadref robust out-of-distribution detection via class-aware decoupled relative f
calibrated multi-preference optimization for aligning diffusion models
calico part-focused semantic co-segmentation with large vision-language models | arXiv: 2412.19331
camera resection from known line pencils and a radially distorted scanline
camfreediff camera-free image to panorama generation with diffusion model | arXiv: 2407.07174
camouflage anything learning to hide using controlled out-painting and represent
campoint boosting point cloud segmentation with virtual camera
camuvid calibration-free multi-view detection
can generative video models help pose estimation | arXiv: 2412.16155
can large vision-language models correct semantic grounding errors by themselves | arXiv: 2404.06510
can machines understand composition dataset and benchmark for photographic image
can text-to-video generation help video-language alignment | arXiv: 2503.18507
cant slow me down learning robust and hardware-adaptive object detectors against
cap-net a unified network for 6d pose and size estimation of categorical articul
cap4d creating animatable 4d portrait avatars with morphable multi-view diffusio
care transformer mobile-friendly linear visual transformer via decoupled dual in
caricaturebooth data-free interactive caricature generation in a photo booth
carl a framework for equivariant image registration | arXiv: 2405.16738
carplanner consistent auto-regressive trajectory planning for large-scale reinfo
casagpt cuboid arrangement and scene assembly for interior design
casp compression of large multimodal models based on attention sparsity
casp consistency-aware audio-induced saliency prediction model for omnidirection
cat4d create anything in 4d with multi-view video diffusion models
catanet efficient content-aware token aggregation for lightweight image super-re
category-agnostic neural object rigging | arXiv: 2505.20283
causal composition diffusion model for closed-loop traffic generation
cav-mae sync improving contrastive audio-visual mask autoencoders via fine-grain
cawm-mamba a unified model for infrared-visible image fusion and compound advers | arXiv: 2603.02560
ccin compositional conflict identification and neutralization for composed image
cdi copyrighted data identification in diffusion models
certified human trajectory prediction | arXiv: 2403.13778
cgmatch a different perspective of semi-supervised learning
ch3depth efficient and flexible depth foundation model with flow matching
chain of attack on the robustness of vision-language models against transfer-bas
chain of semantics programming in 3d gaussian splatting representation for 3d vi
chainhoi joint-based kinematic chain modeling for human-object interaction gener
change3d revisiting change detection and captioning from a video modeling perspe
channel consistency prior and self-reconstruction strategy based unsupervised im
channel-wise noise scheduled diffusion for inverse rendering in indoor scenes | arXiv: 2503.09993
chapter-llama efficient chaptering in hour-long videos with llms | arXiv: 2504.00072
charm the missing piece in vit fine-tuning for image aesthetic assessment | arXiv: 2504.02522
chat-based person retrieval via dialogue-refined cross-modal alignment
chat2svg vector graphics generation with large language models and image diffusi
chatgarment garment estimation generation and editing via large language models | arXiv: 2412.17811
chatgen automatic text-to-image generation from freestyle chatting | arXiv: 2411.17176
chathuman chatting about 3d humans with tools | arXiv: 2405.04533
cheb-gr rethinking k-nearest neighbor search in re-ranking for person re-identif
chebyshev attention depth permutation texture network with latent texture attrib
checkmanual a new challenge and benchmark for manual-based appliance manipulatio
chexwhatsapp a dataset for exploring challenges in the diagnosis of chest x-rays
chexworld exploring image world modeling for radiograph representation learning | arXiv: 2504.13820
cholectrack20 a multi-perspective tracking dataset for surgical tools | arXiv: 2312.07352
circumventing shortcuts in audio-visual deepfake detection datasets with unsuper
citywalker learning embodied urban navigation from web-scale videos | arXiv: 2411.17820
cl-lora continual low-rank adaptation for rehearsal-free class-incremental learn | arXiv: 2505.24816
cl-moe enhancing multimodal large language model with dual momentum mixture-of-e
classic video denoising in a machine learning world robust fast and controllable | arXiv: 2504.03136
classifier-free guidance inside the attraction basin may cause memorization | arXiv: 2411.16738
classifier-guided clip distillation for unsupervised multi-label classification | arXiv: 2503.16873
classifier-to-bias toward unsupervised automatic bias detection for visual class
cleandift diffusion features without noise | arXiv: 2412.03439
clearsight visual signal enhancement for object hallucination mitigation in mult
climbingcap multi-modal dataset and method for rock climbing in world coordinate | arXiv: 2503.21268
clip is almost all you need towards parameter-efficient scene text retrieval wit
clip is strong enough to fight back test-time counterattacks towards zero-shot a
clip under the microscope a fine-grained analysis of multi-object representation | arXiv: 2502.19842
clip-driven coarse-to-fine semantic guidance for fine-grained open-set semi-supe
cloc contrastive learning for ordinal classification with multi-margin n-pair lo
cloe expert consistency learning for missing modality segmentation | arXiv: 2603.09316
closed-loop supervised fine-tuning of tokenized traffic models | arXiv: 2412.05334
closest neighbors are harmful for lightweight masked auto-encoders
cmmloc advancing text-to-pointcloud localization with cauchy-mixture-model based | arXiv: 2503.02593
co-op correspondence-based novel object pose estimation | arXiv: 2503.17731
co-speech gesture video generation with implicit motion-audio entanglement
co-spy combining semantic and pixel features to detect synthetic images by ai | arXiv: 2503.18286
coa towards real image dehazing via compression-and-adaptation | arXiv: 2504.05590
coap memory-efficient training with correlation-aware gradient projection | arXiv: 2412.00071
coarse correspondences boost spatial-temporal reasoning in multimodal language m | arXiv: 2408.00754
cob-gs clear object boundaries in 3dgs segmentation based on boundary-adaptive g | arXiv: 2503.19443
cobra combinatorial retrieval augmentation for few-shot adaptation | arXiv: 2412.17684
cocoer aligning multi-level feature by competition and coordination for emotion
cocogaussian leveraging circle of confusion for gaussian splatting from defocuse | arXiv: 2412.16028
code-as-monitor constraint-aware visual programming for reactive and proactive r
codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
codrawagents a multi-agent dialogue framework for compositional image generation | arXiv: 2603.12829
coe chain-of-explanation via automatic visual concept circuit description and po
coeff-tuning a graph filter subspace view for tuning attention-based large model | arXiv: 2503.18337
coherent 3d portrait video reconstruction via triplane fusion | arXiv: 2405.00794
colabsfm collaborative structure-from-motion by point cloud registration | arXiv: 2503.17093
collaborative decoding makes visual auto-regressive modeling efficient | arXiv: 2411.17787
collaborative tree search for enhancing embodied multi-agent collaboration
collm a large language model for composed image retrieval | arXiv: 2503.19910
color alignment in diffusion | arXiv: 2503.06746
comapgs covisibility map-based gaussian splatting for sparse novel view synthesi | arXiv: 2503.20998
comatcher multi-view collaborative feature matching | arXiv: 2504.01872
combo conflict mitigation via branched optimization for class incremental segmen
comfybench benchmarking llm-based agents in comfyui for autonomously designing c
comm a coherent interleaved image-text dataset for multimodal understanding and | arXiv: 2406.10462
common3d self-supervised learning of 3d morphable models for common objects in n
commonsense video question answering through video-grounded entailment tree reas
community forensics using thousands of generators to train fake image detectors | arXiv: 2411.04125
comparative evaluation of traditional methods and deep learning for brain glioma | arXiv: 2603.04796
compass control multi object orientation control for text-to-image generation | arXiv: 2504.06752
competition-aware cpc forecasting with near-market coverage | arXiv: 2603.13059
compgs unleashing 2d compositionality for compositional text-to-3d via dynamical
complementary advantages exploiting cross-field frequency correlation for nir-as
completion as enhancement a degradation-aware selective image guided network for | arXiv: 2412.19225
complexity experts are task-discriminative learners for any image restoration | arXiv: 2411.18466
composing driving worlds through disentangled control for adversarial scenario g | arXiv: 2603.12864
composing parts for expressive object generation | arXiv: 2406.10197
compositional caching for training-free open-vocabulary attribute detection | arXiv: 2503.19145
compositional targeted multi-label universal perturbations
comprehensive information bottleneck for unveiling universal attribution to inte
comprehensive relighting generalizable and consistent monocular human relighting | arXiv: 2504.03011
comrope scalable and robust rotary position embedding parameterized by trainable
concept lancet image editing with compositional representation transplant | arXiv: 2504.02828
concept replacer replacing sensitive concepts in diffusion models via precision | arXiv: 2412.01244
conceptguard continual personalized text-to-image generation with forgetting and | arXiv: 2503.10358
condensing action segmentation datasets via generative network inversion | arXiv: 2503.14112
conditional balance improving multi-conditioning trade-offs in image generation | arXiv: 2412.19853
conformal prediction and mllm aided uncertainty quantification in scene graph ge
conformal prediction for zero-shot models | arXiv: 2505.24693
conical visual concentration for efficient large vision-language models
conmo controllable motion disentanglement and recomposition for zero-shot motion | arXiv: 2504.02451
consistency posterior sampling for diverse image synthesis
consistency-aware self-training for iterative-based stereo matching | arXiv: 2503.23747
consistent and controllable image animation with motion diffusion models | arXiv: 2407.15642
consistent normal orientation for 3d point clouds via least squares on delaunay
context-aware multimodal pretraining | arXiv: 2411.15099
context-cir learning from concepts in text for composed image retrieval | arXiv: 2505.20764
context-enhanced memory-refined transformer for online action detection | arXiv: 2503.18359
contextual ad narration with interleaved multimodal sequence | arXiv: 2403.12922
continual learning with vision-language models via semantic-geometry preservatio | arXiv: 2603.12055
continual sft matches multimodal rlhf with negative supervision | arXiv: 2411.14797
continuous 3d perception model with persistent state | arXiv: 2501.12387
continuous adverse weather removal via degradation-aware distillation
continuous locomotive crowd behavior generation | arXiv: 2504.04756
continuous space-time video resampling with invertible motion steganography
continuous subject-specific attribute control in t2i models by identifying seman
controlface harnessing facial parametric control for face rigging | arXiv: 2412.01160
controllable human image generation with personalized multi-garments | arXiv: 2411.16801
convex combination star shape prior for data-driven image semantic segmentation
convex relaxation for robust vanishing point estimation in manhattan world | arXiv: 2505.04788
core4d a 4d human-object-human interaction dataset for collaborative object rear
corrbev multi-view 3d object detection by correlation learning with multi-modal
correcting deviations from normality a reformulated diffusion model for multi-cl
correlative and discriminative label grouping for multi-label visual prompt tuni
cosdh communication-efficient collaborative perception via supply-demand awarene
coser towards consistent dense multiview text-to-image generator for 3d creation
cosmic clique-oriented semantic multi-space integration for robust clip test-tim
cosmos cross-modality self-distillation for vision language pre-training | arXiv: 2412.01814
cospace benchmarking continuous space perception ability for vision-language mod
cot-vla visual chain-of-thought reasoning for vision-language-action models | arXiv: 2503.22020
countllm towards generalizable repetitive action counting via large language mod
counts benchmarking object detectors and multimodal large language models under | arXiv: 2504.10158
cpath-omni a unified multimodal foundation model for patch and whole slide image
crab a unified audio-visual scene understanding model with explicit cooperation | arXiv: 2503.13068
craftsman3d high-fidelity mesh generation with 3d native diffusion and interacti
creating your editable 3d photorealistic avatar with tetrahedron-constrained gau
crisp object pose and shape estimation with test-time adaptation | arXiv: 2412.01052
critic-v vlm critics help catch vlm errors in multimodal reasoning | arXiv: 2411.18203
crocodl cross-device collaborative dataset for localization
cropper vision-language model for image cropping through in-context learning | arXiv: 2408.07790
cross-modal 3d representation with multi-view images and point clouds
cross-modal and uncertainty-aware agglomeration for open-vocabulary 3d scene und
cross-modal causal relation alignment for video question grounding | arXiv: 2503.07635
cross-modal distillation for 2d3d multi-object discovery from 2d motion
cross-modal information flow in multimodal large language models | arXiv: 2411.18620
cross-modal interactive perception network with mamba for lung tumor segmentatio
cross-rejective open-set sar image registration
cross-view completion models are zero-shot correspondence estimators | arXiv: 2412.09072
crossearth-sar a sar-centric and billion-scale geospatial foundation model for d | arXiv: 2603.12008
crossover 3d scene cross-modal alignment | arXiv: 2502.15011
crosssdf 3d reconstruction of thin structures from cross-sections | arXiv: 2412.04120
cryptoface end-to-end encrypted face recognition | arXiv: 2509.00332
csc-pa cross-image semantic correlation via prototype attentions for single-netw
ctrl-d controllable dynamic 3d scene editing with personalized 2d diffusion | arXiv: 2412.01792
ctrl-o language-controllable object-centric visual representation learning | arXiv: 2503.21747
cubify anything scaling indoor 3d object detection | arXiv: 2412.04458
curriculum coarse-to-fine selection for high-ipc dataset distillation | arXiv: 2503.18872
curriculum direct preference optimization for diffusion and consistency models | arXiv: 2405.13637
custany customizing anything from a single example | arXiv: 2406.11643
customized condition controllable generation for video soundtrack
customkd customizing large vision foundation for edge model improvement via know
cxpmrg-bench pre-training and benchmarking for x-ray medical report generation o
cycleulm a unified label-free deep learning framework for ultrasound localisatio | arXiv: 2603.09840
d2it dynamic diffusion transformer for accurate image generation
d2sp dynamic dual-stage purification framework for dual noise mitigation in visi
d3 scaling up deepfake detection by learning from discrepancy
d3-human dynamic disentangled digital human from monocular video | arXiv: 2501.01589
d3ctta domain-dependent decorrelation for continual test-time adaption of 3d lid
da-vpt semantic-guided visual prompt tuning for vision transformers | arXiv: 2505.23694
dacapo score distillation as stacked bridge for fast and high-quality 3d editing
dagsm disentangled avatar generation with gs-enhanced mesh | arXiv: 2411.15205
damm-diffusion learning divergence-aware multi-modal diffusion model for nanopar
darkir robust low-light image restoration | arXiv: 2412.13443
dart disease-aware image-text alignment and self-correcting re-alignment for tru
dashgaussian optimizing 3d gaussian splatting in 200 seconds | arXiv: 2503.18402
data distributional properties as inductive bias for systematic generalization | arXiv: 2502.20499
data synthesis with diverse styles for face recognition via 3dmm-guided diffusio
data-free group-wise fully quantized winograd convolution via learnable scales | arXiv: 2412.19867
data-free universal adversarial perturbation with pseudo-semantic prior | arXiv: 2502.21048
dataset distillation with neural characteristic function a minmax perspective | arXiv: 2502.20653
dcevo discriminative cross-dimensional evolutionary learning for infrared and vi
de2gaze deformable and decoupled representation learning for 3d gaze estimation
deal data-efficient adversarial learning for high-quality infrared imaging | arXiv: 2503.00905
debiasing multimodal large language models via noise-aware preference optimizati
decafnet delegate and conquer for efficient temporal grounding in long videos | arXiv: 2505.16376
decentralized diffusion models | arXiv: 2501.05450
decision spikeformer spike-driven transformer for decision making | arXiv: 2504.03800
declip decoupled learning for open-vocabulary dense perception | arXiv: 2505.04410
decloth decomposable 3d cloth and human body reconstruction from a single image | arXiv: 2503.19373
decoder gradient shield provable and high-fidelity prevention of gradient-based
decoding matters efficient mamba-based decoder with distribution-aware deep supe | arXiv: 2603.12547
decompositional neural scene reconstruction with generative diffusion prior | arXiv: 2503.14830
deconstructing the failure of ideal noise correction a three-pillar diagnosis | arXiv: 2603.12997
decouple distortion from perception region adaptive diffusion for extreme-low bi
decouple-then-merge finetune diffusion models as multi-task learning | arXiv: 2410.06664
decoupled distillation to erase a general unlearning method for any class-centri
decoupled motion expression video segmentation
decoupledgaussian object-scene decoupling for physics-based interaction | arXiv: 2503.05484
decoupling fine detail and global geometry for compressed depth map super-resolu
decoupling training-free guided diffusion by admm | arXiv: 2411.12773
dede detecting backdoor samples for ssl encoders via decoders | arXiv: 2411.16154
deep change monitoring a hyperbolic representative learning framework and a data
deep fair multi-view clustering with attention kan
deep learning based estimation of blood glucose levels from multidirectional scl | arXiv: 2603.12715
deep learning-based assessment of the relation between the third molar and mandi | arXiv: 2603.11850
deepcompress-vit rethinking model compression to enhance efficiency of vision tr
deepla-net very deep local aggregation networks for point cloud analysis
defectfill realistic defect generation with inpainting diffusion model for visua
defmamba deformable visual state space model | arXiv: 2504.05794
defom-stereo depth foundation model based stereo matching | arXiv: 2501.09466
deformable radial kernel splatting | arXiv: 2412.11752
deformcl learning deformable centerline representation for vessel extraction in
degradation-aware feature perturbation for all-in-one image restoration | arXiv: 2505.12630
deim detr with improved matching for fast convergence | arXiv: 2412.04234
dejavid encoder-agnostic learned temporal matching for video classification | arXiv: 2506.12585
delt a simple diversity-driven earlylate training for dataset distillation | arXiv: 2411.19946
denoising functional maps diffusion models for shape correspondence | arXiv: 2503.01845
dense dispersed structured light for hyperspectral 3d imaging of dynamic scenes | arXiv: 2412.01140
dense match summarization for faster two-view estimation | arXiv: 2506.02893
dense-sfm structure from motion with dense consistent matching | arXiv: 2501.14277
denver deformable neural vessel representations for unsupervised video vessel se
depth any camera zero-shot metric depth estimation from any camera | arXiv: 2501.02464
depth-guided bundle sampling for efficient generalizable neural radiance field r | arXiv: 2505.19793
depthcrafter generating consistent long depth sequences for open-world videos | arXiv: 2409.02095
depthcues evaluating monocular depth perception in large vision models | arXiv: 2411.17385
depthsplat connecting gaussian splatting and depth | arXiv: 2410.13862
derivative-free diffusion manifold-constrained gradient for unified xai | arXiv: 2411.15265
ders towards extremely efficient upcycled mixture-of-experts models | arXiv: 2503.01359
descriptor-in-pixel point-feature tracking for pixel processor arrays
design2garmentcode turning design concepts to tangible garments through program | arXiv: 2412.08603
designdiffusion high-quality text-to-design image generation with diffusion mode
desire-gs 4d street gaussians for static-dynamic decomposition and surface recon
desplat decomposed gaussian splatting for distractor-free rendering | arXiv: 2411.19756
detail-preserving latent diffusion for stable shadow removal | arXiv: 2412.17630
detect any mirrors boosting learning reliability on large-scale unlabeled data w
detect-and-guide self-regulation of diffusion models for safe text-to-image gene
detecting adversarial data using perturbation forgery | arXiv: 2405.16226
detecting backdoor attacks in federated learning via direction alignment inspect | arXiv: 2503.07978
detecting open world objects via partial attribute assignment
detecting out-of-distribution through the lens of neural collapse | arXiv: 2311.01479
detection-friendly nonuniformity correction a union framework for infrared uav t
deterministic certification of graph neural networks against graph poisoning att
deterministic image-to-image translation via denoising brownian bridge models wi
deterministic-to-stochastic diverse latent feature mapping for human motion synt
developing foundation models for universal segmentation from 3d whole-body posit | arXiv: 2603.11627
devil is in the detail towards injecting fine details of image prompt in image g
devils in middle layers of large vision-language models interpreting detecting a
dexgrasp anything towards universal robotic dexterous grasping with physics awar | arXiv: 2503.08257
dexhanddiff interaction-aware diffusion planning for adaptive dexterous manipula
dflmoe decentralized federated learning via mixture of experts for medical data | arXiv: 2503.10412
dfm differentiable feature matching for anomaly detection
dformerv2 geometry self-attention for rgbd semantic segmentation | arXiv: 2504.04701
dh-set improving vision-language alignment with diverse and hybrid set-embedding
di-pcg diffusion-based efficient inverse procedural content generation for high-
dic rethinking conv3x3 designs in diffusion models | arXiv: 2501.00603
diet-gs diffusion prior and event stream-assisted motion deblurring 3d gaussian | arXiv: 2503.24210
diff-palm realistic palmprint generation with polynomial creases and intra-class
diff2flow training flow matching models via diffusion model alignment | arXiv: 2506.02221
diffcam data-driven saliency maps by capturing feature differences
differ disentangling identity features via semantic cues for clothes-changing pe
difference inversion interpolate and isolate the difference with token consisten
differentiable inverse rendering with interpretable basis brdfs | arXiv: 2411.17994
difffno diffusion fourier neural operator | arXiv: 2411.09911
difflo semantic-aware lidar odometry with diffusion-based refinement
difflocks generating 3d hair from a single image using diffusion models | arXiv: 2505.06166
diffportrait360 consistent portrait diffusion for 360 view synthesis | arXiv: 2503.15667
diffsensei bridging multi-modal llms and diffusion models for customized manga g | arXiv: 2412.07589
diffusion bridge leveraging diffusion model to reduce the modality gap between t
diffusion model is effectively its own teacher
diffusion renderer neural inverse and forward rendering with video diffusion mod
diffusion self-distillation for zero-shot customized image generation | arXiv: 2411.18616
diffusion-4k ultra-high-resolution image synthesis with latent diffusion models | arXiv: 2503.18352
diffusion-based event generation for high-quality image deblurring
diffusion-based feature denoising and using nnmf for robust brain tumor classifi | arXiv: 2603.13182
diffusion-based realistic listening head generation via hybrid motion modeling
diffusiondrive truncated diffusion model for end-to-end autonomous driving | arXiv: 2411.15139
diffusionsfm predicting structure and motion via ray origin and endpoint diffusi
diffvsgg diffusion-driven online video scene graph generation | arXiv: 2503.13957
difiisr a diffusion model with gradient guidance for infrared image super-resolu
difix3d improving 3d reconstructions with single-step diffusion models | arXiv: 2503.01774
dig scalable and efficient diffusion models with gated linear attention | arXiv: 2405.18428
digit multi-dilated gated encoder and central-adjacent region integrated decoder
digital twin catalog a large-scale photorealistic 3d object digital twin dataset | arXiv: 2504.08541
din diffusion model for robust medical vqa with semantic noisy labels | arXiv: 2503.18536
dinomaly the less is more philosophy in multi-class unsupervised anomaly detecti
dinov2 meets text a unified framework for image- and pixel-level vision-language | arXiv: 2412.16334
dio decomposable implicit 4d occupancy-flow world model
directional label diffusion model for learning from noisy labels
directtrigs triplane-based gaussian splatting field representation for 3d genera
disciple learning interpretable programs for scientific visual discovery | arXiv: 2502.10060
disco4d disentangled 4d human generation and animation from a single image | arXiv: 2409.17280
discovering fine-grained visual-concept relations by disentangled optimal transp
discovla discrepancy reduction in vision language and alignment for parameter-ef
discrete to continuous generating smooth transition poses from sign language obs
disentangled pose and appearance guidance for multi-pose generation
disentangling safe and unsafe image corruptions via anisotropy and locality
diskvps vanishing point detector via hough transform in a disk region
dispider enabling video llms with active real-time interaction via disentangled
disrt-in-bed diffusion-based sim-to-real transfer framework for in-bed human mes
dissecting and mitigating diffusion bias via mechanistic interpretability | arXiv: 2503.20483
distilled prompt learning for incomplete multimodal survival prediction | arXiv: 2503.01653
distilling long-tailed datasets | arXiv: 2408.14506
distilling monocular foundation model for fine-grained depth completion | arXiv: 2503.16970
distilling multi-modal large language models for autonomous driving | arXiv: 2501.09757
distilling spatially-heterogeneous distortion perception for blind image quality
distilling spectral graph for object-context aware open-vocabulary semantic segm
distinctad distinctive audio description generation in contexts | arXiv: 2411.18180
distinguish then exploit source-free open set domain adaptation via weight barco
distraction is all you need for multimodal large language model jailbreaking | arXiv: 2502.10794
distribution prototype diffusion learning for open-set supervised anomaly detect | arXiv: 2502.20981
dit-ic aligned diffusion transformer for efficient image compression | arXiv: 2603.13162
ditask multi-task fine-tuning with diffeomorphic transformations | arXiv: 2502.06029
ditctrl exploring attention control in multi-modal diffusion transformer for tun
div-ff dynamic image-video feature fields for environment understanding in egoce
diverseflow sample-efficient diverse mode coverage in flows | arXiv: 2504.07894
divide and conquer heterogeneous noise integration for diffusion-based adversari | arXiv: 2503.01407
divot diffusion powers video tokenizer for comprehension and generation | arXiv: 2412.04432
divprune diversity-based visual token pruning for large multimodal models | arXiv: 2503.02175
dkc differentiated knowledge consolidation for cloth-hybrid lifelong person re-i
dkdm data-free knowledge distillation for diffusion models with any architecture | arXiv: 2409.03550
dl2g degradation-guided local-to-global restoration for eyeglass reflection remo
dnf unconditional 4d generation with dictionary-based neural fields | arXiv: 2412.05161
dnlut ultra-efficient color image denoising via channel-aware lookup tables | arXiv: 2503.15931
do computer vision foundation models learn the low-level characteristics of the
do imagenet-trained models learn shortcuts the impact of frequency shortcuts on | arXiv: 2503.03519
do visual imaginations improve vision-and-language navigation agents | arXiv: 2503.16394
do we always need the simplicity bias looking for optimal inductive biases in th
do we really need curated malicious data for safety alignment in multi-modal lar
do your best and get enough rest for continual learning | arXiv: 2503.18371
doclayllm an efficient multi-modal extension of large language models for text-r
docopilot improving multimodal models for document-level understanding | arXiv: 2507.14675
docsam unified document image segmentation via query decomposition and heterogen
document haystacks vision-language reasoning over piles of 1000 documents | arXiv: 2411.16740
docvlm make your vlm an efficient reader | arXiv: 2412.08746
dof-gaussian controllable depth-of-field for 3d gaussian splatting | arXiv: 2503.00746
dof-gs adjustable depth-of-field 3d gaussian splatting for post-capture refocusi
domain adaptive diabetic retinopathy grading with model absence and flowing data | arXiv: 2412.01203
domain generalization in clip via learning with diverse text prompts
dont shake the wheel momentum-aware planning in end-to-end autonomous driving
doppelgangers and adversarial vulnerability
doppelgangers improved visual disambiguation with geometric 3d features | arXiv: 2412.05826
dora sampling and benchmarking for 3d shape variational auto-encoders | arXiv: 2412.17808
doracycle domain-oriented adaptation of unified generative model in multimodal c | arXiv: 2503.03651
dornet a degradation oriented and regularized network for blind depth super-reso
dpc dual-prompt collaboration for tuning vision-language models | arXiv: 2503.13443
dpflow adaptive optical flow estimation with a dual-pyramid framework | arXiv: 2503.14880
dpseg dual-prompt cost volume learning for open-vocabulary semantic segmentation | arXiv: 2505.11676
dpu dynamic prototype updating for multimodal out-of-distribution detection | arXiv: 2411.08227
dr splat directly referring 3d gaussian splatting via direct language embedding | arXiv: 2502.16652
dragin3d image editing by dragging in 3d space
drawer digital reconstruction and articulation with environment realism | arXiv: 2504.15278
dreamcache finetuning-free lightweight personalized image generation via feature | arXiv: 2411.17786
dreamomni unified image generation and editing | arXiv: 2412.17098
dreamrelation bridging customization and relation generation | arXiv: 2410.23280
dreamtext high fidelity scene text synthesis | arXiv: 2405.14701
dreamtrack dreaming the future for multimodal visual object tracking
dreamvideo-omni omni-motion controlled multi-subject video customization with la | arXiv: 2603.12257
drive diffusion-based rigging empowers generation of versatile and expressive ch
drivedreamer4d world models are effective data machines for 4d driving scene rep
drivegen generalized and robust 3d detection in driving via controllable text-to
drivegpt4-v2 harnessing large language model capabilities for enhanced closed-lo
drivescape high-resolution driving video generation by multi-view feature fusion
driving by the rules a benchmark for integrating traffic sign regulations into v | arXiv: 2410.23780
drivingsphere building a high-fidelity 4d world for closed-loop simulation | arXiv: 2411.11252
dronesplat 3d gaussian splatting for robust 3d reconstruction from in-the-wild d | arXiv: 2503.16964
dropgaussian structural regularization for sparse-view gaussian splatting | arXiv: 2504.00773
dropoutgs dropping out gaussians for better sparse-view rendering | arXiv: 2504.09491
drvideo document retrieval based long video understanding | arXiv: 2406.12846
dspnet dual-vision scene perception for robust 3d question answering | arXiv: 2503.03190
dsv-lfs unifying llm-driven semantic cues with visual features for robust few-sh
dtgbrepgen a novel b-rep generative model through decoupling topology and geomet
dtos dynamic time object sensing with large multimodal model
dual consolidation for pre-trained model-based domain-incremental learning | arXiv: 2410.00911
dual diffusion for unified image generation and understanding | arXiv: 2501.00289
dual energy-based model with open-world uncertainty estimation for out-of-distri
dual exposure stereo for extended dynamic range 3d imaging | arXiv: 2412.02351
dual focus-attention transformer for robust point cloud registration
dual prompting image restoration with diffusion transformers | arXiv: 2504.17825
dual semantic guidance for open vocabulary semantic segmentation
dual-agent optimization framework for cross-domain few-shot segmentation
dual-granularity semantic guided sparse routing diffusion model for general pans
dual-interrelated diffusion model for few-shot anomaly image generation | arXiv: 2408.13509
dual-view x-ray detection can ai detect prohibited items from dual-view x-ray im
dualpm dual posed-canonical point maps for 3d shape and pose reconstruction | arXiv: 2412.04464
dualtalk dual-speaker interaction for 3d talking head conversations | arXiv: 2505.18096
dune distilling a universal encoder from heterogeneous 2d and 3d teachers | arXiv: 2503.14405
dv-matcher deformation-based non-rigid point cloud matching guided by pre-traine
dvhgnn multi-scale dilated vision hgnn for efficient vision recognition | arXiv: 2503.14867
dvin dynamic visual routing network for weakly supervised referring expression c
dycoke dynamic compression of tokens for fast video large language models | arXiv: 2411.15024
dycon dynamic uncertainty-aware consistency and contrastive learning for semi-su
dyfo a training-free dynamic focus visual search for enhancing lmms in fine-grai
dymo training-free diffusion model alignment with dynamic multi-objective schedu
dyn-hamr recovering 4d interacting hand motion from a dynamic camera | arXiv: 2412.12861
dynamic camera poses and where to find them | arXiv: 2504.17788
dynamic content prediction with motion-aware priors for blind face video restora
dynamic derivation and elimination audio visual segmentation with enhanced audio | arXiv: 2503.12840
dynamic group normalization spatio-temporal adaptation to evolving data statisti
dynamic integration of task-specific adapters for class incremental learning | arXiv: 2409.14983
dynamic motion blending for versatile motion editing | arXiv: 2503.20724
dynamic neural surfaces for elastic 4d shape representation and analysis | arXiv: 2503.03132
dynamic pseudo labeling via gradient cutting for high-low entropy exploration
dynamic stereotype theory induced micro-expression recognition with oriented def
dynamic updates for language adaptation in visual-language tracking | arXiv: 2503.06621
dynamicscaler seamless and scalable video generation for panoramic scenes | arXiv: 2412.11100
dynamode-nerf motion-aware deblurring neural radiance field for dynamic scenes
dynfocus dynamic cooperative network empowers llms with video understanding | arXiv: 2411.12355
dynpose largely improving the efficiency of human pose estimation by a simple dy
dynrefer delving into region-level multimodal tasks via dynamic resolution | arXiv: 2405.16071
dynscene scalable generation of dynamic robotic manipulation scenes for embodied
eap-gs efficient augmentation of pointcloud for 3d gaussian splatting in few-sho
early-bird diffusion investigating and leveraging timestep-aware early-bird tick
earthdial turning multi-sensory earth observations to interactive dialogues | arXiv: 2412.15190
easemvcefficient dual selection mechanism for deep multi-view clustering
easy-editable image vectorization with multi-layer multi-scale distributed visua
easycraft a robust and efficient framework for automatic avatar crafting | arXiv: 2503.01158
easyhoi unleashing the power of large models for reconstructing hand-object inte
ebs-ekf accurate and high frequency event-based star tracking | arXiv: 2503.20101
ecbench can multi-modal foundation models understand the egocentric world a holi
echomatch partial-to-partial shape matching via correspondence reflection
echomimicv2 towards striking simplified and semi-body human animation | arXiv: 2411.10061
echoone segmenting multiple echocardiography planes in one model | arXiv: 2412.02993
echotraffic enhancing traffic anomaly understanding with audio-visual insights
echoworld learning motion-aware world models for echocardiography probe guidance | arXiv: 2504.13065
ecvc exploiting non-local correlations in multiple frames for contextual video c | arXiv: 2410.09706
edcflow exploring temporally dense difference maps for event-based optical flow | arXiv: 2506.03512
eden enhanced diffusion for high-quality large-motion video frame interpolation | arXiv: 2503.15831
edge-sd-sr low latency and parameter efficient on-device super-resolution with s
edgediff edge-aware diffusion network for building reconstruction from point clo
edgemovingnet edge-preserving point cloud reconstruction via joint geometry feat
edgetam on-device track anything model | arXiv: 2501.07256
edit away and my face will not stay personal biometric defense against malicious
editar unified conditional generation with autoregressive models | arXiv: 2501.04699
editing away the evidence diffusion-based image manipulation and the failure mod | arXiv: 2603.12949
editsplat multi-view fusion and attention-guided optimization for view-consisten
edm equirectangular projection-oriented dense kernelized feature matching | arXiv: 2502.20685
eee-bench a comprehensive multimodal electrical and electronics engineering benc
effective cloud removal for remote sensing images by an improved mean-reverting
effective sam combination for open-vocabulary semantic segmentation | arXiv: 2411.14723
efficient ann-guided distillation aligning rate-based features of spiking neural
efficient data driven mixture-of-expert extraction from trained networks | arXiv: 2505.15414
efficient decoupled feature 3d gaussian splatting via hierarchical compression
efficient depth estimation for unstable stereo camera systems on ar glasses | arXiv: 2411.10013
efficient diffusion as low light enhancer | arXiv: 2410.12346
efficient dynamic scene editing via 4d gaussian-based static-dynamic separation | arXiv: 2502.02091
efficient event-based object detection a hybrid neural network with spatial and | arXiv: 2403.10173
efficient fine-tuning and concept suppression for pruned diffusion models | arXiv: 2412.15341
efficient long video tokenization via coordinate-based patch reconstruction | arXiv: 2411.14762
efficient motion-aware video mllm | arXiv: 2503.13016
efficient personalization of quantized diffusion model without backpropagation | arXiv: 2503.14868
efficient rgb-d scene understanding via multi-task adaptive learning and cross-d | arXiv: 2603.07570
efficient test-time adaptive object detection via sensitivity-guided pruning | arXiv: 2506.02462
efficient transfer learning for video-language foundation models | arXiv: 2411.11223
efficient video face enhancement with enhanced spatial-temporal consistency | arXiv: 2411.16468
efficient video super-resolution for real-time rendering with decoupled g-buffer
efficient visual state space model for image deblurring | arXiv: 2405.14343
efficientllava generalizable auto-pruning for large vision-language models
efficientvim efficient vision mamba with hidden state mixer based state space du | arXiv: 2411.15241
effidec3d an optimized decoder for high-performance and efficient 3d medical ima
effortless active labeling for long-term test-time adaptation | arXiv: 2503.14564
ego4o egocentric human motion capture and understanding from multi-modal input | arXiv: 2504.08449
egolife towards egocentric life assistant | arXiv: 2503.03803
egolm multi-modal language model of egocentric motions | arXiv: 2409.18127
egopressure a dataset for hand pressure and pose estimation in egocentric vision | arXiv: 2409.02224
egotextvqa towards egocentric scene-text aware video question answering | arXiv: 2502.07411
eidt-v exploiting intersections in diffusion trajectories for model-agnostic zer
eigengs representation from eigenspace to gaussian image space | arXiv: 2503.07446
electromyography-informed facial expression reconstruction for physiological-bas
embodied scene understanding for vision language models via metavqa | arXiv: 2501.09167
embracing collaboration over competition condensing multiple prompts for visual | arXiv: 2504.21263
emodubber towards high quality and emotion controllable movie dubbing | arXiv: 2412.08988
emoe modality-specific enhanced dynamic emotion experts
emoedit evoking emotions through image manipulation | arXiv: 2405.12661
emotivetalk expressive talking head generation through audio information decoupl
emova empowering language models to see hear and speak with vivid emotions | arXiv: 2409.18042
emphasizing discriminative features for dataset distillation in complex scenario | arXiv: 2410.17193
empowering large language models with 3d situation awareness | arXiv: 2503.23024
empowering llms to understand and generate complex vector graphics | arXiv: 2412.11102
empowering vector graphics with consistently arbitrary viewing and view-dependen
encapsulated composition of text-to-image and text-to-video models for high-qual
end-to-end hoi reconstruction transformer with graph-based encoding | arXiv: 2503.06012
end-to-end implicit neural representations for classification | arXiv: 2503.18123
enduring efficient and robust trajectory prediction attack in autonomous driving
energymogen compositional human motion generation with energy-based diffusion mo
enhanced contrastive learning with multi-view longitudinal data for chest x-ray | arXiv: 2502.20056
enhanced ood detection through cross-modal alignment of multi-modal representati
enhanced then progressive fusion with view graph for multi-view clustering
enhanced visual-semantic interaction with tailored prompts for pedestrian attrib
enhancing 3d gaze estimation in the wild using weak supervision with gaze follow | arXiv: 2502.20249
enhancing adversarial transferability with checkpoints of a single models traini
enhancing creative generation on stable diffusion-based models | arXiv: 2503.23538
enhancing dance-to-music generation via negative conditioning latent diffusion m | arXiv: 2503.22138
enhancing dataset distillation via non-critical region refinement | arXiv: 2503.18267
enhancing diversity for data-free quantization
enhancing facial privacy protection via weakening diffusion purification | arXiv: 2503.10350
enhancing few-shot class-incremental learning via training-free bi-level modalit
enhancing image aesthetics with dual-conditioned diffusion models guided by mult | arXiv: 2603.11556
enhancing online continual learning with plug-and-play state space model and cla
enhancing privacy-utility trade-offs to mitigate memorization in diffusion model | arXiv: 2504.18032
enhancing sam with efficient prompting and preference optimization for semi-supe
enhancing testing-time robustness for trusted multi-view classification in the w
enhancing video-llm reasoning via agent-of-thoughts distillation | arXiv: 2412.01694
enhancing virtual try-on with synthetic pairs and error-aware noise scheduling | arXiv: 2501.04666
enhancing vision-language compositional understanding with multimodal synthetic | arXiv: 2503.01167
enliveninggs active locomotion of 3dgs
entityerasure erasing entity cleanly via amodal entity segmentation and completi
entitysam segment everything in video
entropymark towards more harmless backdoor watermark via entropy-based constrain
envgs modeling view-dependent appearance with environment gaussian | arXiv: 2412.15215
envposer environment-aware realistic human motion estimation from sparse observa
equipose exploiting permutation equivariance for relative camera pose estimation
equivania a spectral method for rotation-equivariant anisotropic image analysis | arXiv: 2603.11294
erase diffusion empowering object removal through calibrating diffusion pathways | arXiv: 2503.07026
erasing undesirable influence in diffusion models | arXiv: 2401.05779
erupt efficient rendering with unposed patch transformer | arXiv: 2503.24374
esc erasing space concept for knowledge deletion | arXiv: 2504.02199
escape equivariant shape completion via anchor point encoding | arXiv: 2412.00952
escaping platos cave towards the alignment of 3d and text latent spaces | arXiv: 2503.05283
espire a diagnostic benchmark for embodied spatial reasoning of vision-language | arXiv: 2603.13033
estimating body and hand motion in an ego-sensed world | arXiv: 2410.03665
etap event-based tracking of any point | arXiv: 2412.00133
ev-3dod pushing the temporal boundaries of 3d object detection with event camera | arXiv: 2502.19630
eval3d interpretable and fine-grained evaluation for 3d generation | arXiv: 2504.18509
evaluating model perception of color illusions in photorealistic scenes | arXiv: 2412.06184
evaluating vision-language models as evaluators in path planning | arXiv: 2411.18711
evenhancer empowering effectiveness efficiency and generalizability for continuo
event ellipsometer event-based mueller-matrix video imaging | arXiv: 2411.17313
event fields capturing light fields at high speed resolution and dynamic range | arXiv: 2412.06191
event-based video super-resolution via state space models
event-equalized dense video captioning
eventfly event camera perception from ground to the sky | arXiv: 2503.19916
eventgpt event stream understanding with multimodal large language models | arXiv: 2412.00832
eventpsr surface normal and reflectance estimation from photometric stereo using
eventsplat 3d gaussian splatting from moving event cameras for real-time renderi
every sam drop counts embracing semantic priors for multi-modality image fusion | arXiv: 2503.01210
everything to the synthetic diffusion-driven test-time adaptation via synthetic- | arXiv: 2406.04295
evidential learning driven breast tumor segmentation with stage-divided vision-l | arXiv: 2603.11206
evocc accurate semantic occupancy for automated driving using evidence theory
evolsplat efficient volume-based gaussian splatting for urban view synthesis | arXiv: 2503.20168
evolving high-quality rendering and reconstruction in a unified framework with c | arXiv: 2503.00881
evos efficient implicit neural training via evolutionary selector | arXiv: 2412.10153
evotok a unified image tokenizer via residual latent evolution for visual unders | arXiv: 2603.12108
evpgs enhanced view prior guidance for splatting-based extrapolated view synthes
exact exploring space-time perceptive clues for weakly supervised satellite imag
expert pyramid tuning efficient parameter fine-tuning for expertise-driven task | arXiv: 2603.12577
expertaf expert actionable feedback from video | arXiv: 2408.00672
explainable saliency articulating reasoning with contextual prioritization
explaining domain shifts in language concept erasing for interpretable image cla
explaining in diffusion explaining a classifier with diffusion semantics
explicit depth-aware blurry video frame interpolation guided by differential cur
exploiting deblurring networks for radiance fields | arXiv: 2502.14454
exploiting temporal state space sharing for video semantic segmentation | arXiv: 2503.20824
exploration-driven generative interactive environments | arXiv: 2504.02515
exploring clips dense knowledge for weakly supervised semantic segmentation | arXiv: 2503.20826
exploring contextual attribute density in referring expression counting | arXiv: 2503.12460
exploring historical information for rgbe visual tracking with mamba
exploring intrinsic normal prototypes within a single image for universal anomal
exploring scene affinity for semi-supervised lidar semantic segmentation | arXiv: 2408.11280
exploring semantic feature discrimination for perceptual image super-resolution
exploring simple open-vocabulary semantic segmentation | arXiv: 2401.12217
exploring sparse moe in gans for text-conditioned image synthesis | arXiv: 2309.03904
exploring temporally-aware features for point tracking | arXiv: 2501.12218
exploring the deep fusion of large language models and diffusion transformers fo
exploring timeline control for facial motion generation | arXiv: 2505.20861
exploring visual vulnerabilities via multi-loss adversarial search for jailbreak
exposure-slot exposure-centric representations learning with slot-in-slot attent
extrapolating and decoupling image-to-video generation models motion modeling is
extreme rotation estimation in the wild | arXiv: 2411.07096
ezsr event-based zero-shot recognition | arXiv: 2407.21616
f-lmm grounding frozen large multimodal models | arXiv: 2406.05821
f3ocus - federated finetuning of vision-language foundation models with optimal
face forgery video detection via temporal forgery cue unraveling
facebench a multi-view multi-level facial attribute vqa dataset for benchmarking
factchexcker mitigating measurement hallucinations in chest x-ray report generat
factored-neus reconstructing surfaces illumination and materials of possibly glo
fada fast diffusion avatar synthesis with mixed-supervised multi-cfg distillatio
fade frequency-aware diffusion model factorization for video editing | arXiv: 2506.05934
faithdiff unleashing diffusion priors for faithful image super-resolution | arXiv: 2411.18824
falcon fairness learning via contrastive attention approach to continual semanti
fam diffusion frequency and attention modulation for high-resolution image gener
fancy123 one image to high-quality 3d mesh generation via plug-and-play deformat
fast and accurate gigapixel pathological image classification with hierarchical
fast3r towards 3d reconstruction of 1000 images in one forward pass | arXiv: 2501.13928
faster focal token acquiring-and-scaling transformer for long-term 3d objection | arXiv: 2503.01899
faster parameter-efficient tuning with token redundancy reduction | arXiv: 2503.20282
fastvlm efficient vision encoding for vision language models | arXiv: 2412.13303
fate full-head gaussian avatar with textural editing from monocular video | arXiv: 2411.15604
fc-track overlap-aware post-association correction for online multi-object track | arXiv: 2603.12758
fdeid-toolbox face de-identification toolbox | arXiv: 2603.13121
fds frequency-aware denoising score for text-guided latent diffusion image editi
feat2gs probing visual foundation models with gaussian splatting | arXiv: 2412.09606
feature information driven position gaussian distribution estimation for tiny ob
feature selection for latent factor models | arXiv: 2412.10128
feature spectrum learning for remote sensing change detection
feature-preserving mesh decimation for normal integration | arXiv: 2504.00867
feature4x bridging any monocular video to 4d agentic ai with versatile gaussian | arXiv: 2503.20776
fedawa adaptive optimization of aggregation weights in federated learning using | arXiv: 2503.15842
fedbip heterogeneous one-shot federated learning with personalized latent diffus
fedcalm conflict-aware layer-wise mitigation for selective aggregation in deeper
fedcs coreset selection for federated learning
federated learning with domain shift eraser | arXiv: 2503.13063
federated modality-specific encoders and partially personalized fusion decoder f | arXiv: 2603.04887
fedmia an effective membership inference attack exploiting all for one principle
fedspa generalizable federated graph learning under homophily heterogeneity
feededit text-based image editing with dynamic feedback regulation
ferret an efficient online continual learning framework under varying memory con
few-shot implicit function generation via equivariance | arXiv: 2501.01601
few-shot personalized scanpath prediction | arXiv: 2504.05499
few-shot recognition via stage-wise retrieval-augmented finetuning | arXiv: 2406.11148
ffacenerf few-shot face editing in neural radiance fields | arXiv: 2503.17095
ffr frequency feature rectification for weakly supervised semantic segmentation
fg2 fine-grained cross-view localization by fine-grained feature matching
fiction 4d future interaction prediction from video | arXiv: 2412.00932
fifa fine-grained inter-frame attention for drivers video gaze estimation
filmcomposer llm-driven music production for silent film clips | arXiv: 2503.08147
filter images first generate instructions later pre-instruction data selection f
fima-q post-training quantization for vision transformers by fisher information | arXiv: 2506.11543
finding local diffusion schrodinger bridge using kolmogorov-arnold network
fine-grained erasure in text-to-image diffusion-based foundation models | arXiv: 2503.19783
fine-grained image-text correspondence with cost aggregation for open-vocabulary | arXiv: 2501.09688
finecaption compositional image captioning focusing on wherever you want at any | arXiv: 2411.15411
finelip extending clips reach via fine-grained alignment with longer text inputs | arXiv: 2504.01916
finephys fine-grained human action generation by explicitly incorporating physic
finer-cam spotting the difference reveals finer details for visual explanation | arXiv: 2501.11309
finevq fine-grained user generated content video quality assessment | arXiv: 2412.19238
fingerprinting denoising diffusion probabilistic models
finite difference flow optimization for rl post-training of text-to-image models | arXiv: 2603.12893
finsler multi-dimensional scaling manifold learning for asymmetric dimensionalit
fire fixed-points of restoration priors for solving inverse problems | arXiv: 2411.18970
fire robust detection of diffusion-generated images via frequency-guided reconst
fireedit fine-grained instruction-based image editing via region-aware vision la
fireplace geometric refinements of llm common sense reasoning for 3d object plac
fish-vista a multi-purpose dataset for understanding identification of traits fr
fishertune fisher-guided robust tuning of vision foundation models for domain ge
fitted neural lossless image compression
flair vlm with fine-grained language-informed image representations | arXiv: 2412.03561
flame frozen large language models enable data-efficient language-image pre-trai
flare feed-forward geometry appearance and camera estimation from uncalibrated s | arXiv: 2502.12138
flash-split 2d reflection removal with flash cues and latent diffusion separatio
flash3d super-scaling point transformers through joint hardware-geometry localit
flashgs efficient 3d gaussian splatting for large-scale and high-resolution rend
flashmotion few-step controllable video generation with trajectory guidance | arXiv: 2603.12146
flashsloth lightning multimodal large language models via embedded visual compre
flavc learned video compression with feature level attention
flexdrive toward trajectory flexibility in driving scene gaussian splatting reco
flexgs train once deploy everywhere with many-in-one flexible 3d gaussian splatt
flexible frame selection for efficient video reasoning
flexible group count enables hassle-free structured pruning
flexidit your diffusion transformer can easily generate high-quality samples wit
flexuod the answer to real-world unsupervised image outlier detection
flipsketch flipping static drawings to text-guided sketch animations | arXiv: 2411.10818
floating no more object-ground reconstruction from a single image | arXiv: 2407.18914
florence-vl enhancing vision-language models with generative vision encoder and | arXiv: 2412.04424
flovd optical flow meets video diffusion model for enhanced camera-controlled vi
flow-nerf joint learning of geometry poses and dense flow within unified neural | arXiv: 2503.10464
flowing from words to pixels a noise-free framework for cross-modality evolution | arXiv: 2412.15213
flowram grounding flow matching policy with region-aware mamba framework for rob
floxels fast unsupervised voxel based scene flow estimation | arXiv: 2503.04718
fluidnexus 3d fluid reconstruction and prediction from a single video | arXiv: 2503.04720
fluxspace disentangled semantic editing in rectified flow models
focal split untethered snapshot depth from differential defocus | arXiv: 2504.11202
focus knowledge-enhanced adaptive visual compression for few-shot whole slide im
focus-n-fix region-aware fine-tuning for text-to-image generation | arXiv: 2501.06481
focusing on tracks for online multi-object tracking
foley-flow coordinated video-to-audio generation with masked audio-visual alignm
font-agent enhancing font understanding with large language models
forensic self-descriptions are all you need for zero-shot detection open-set sou
forensics adapter adapting clip for generalizable face forgery detection | arXiv: 2411.19715
forensics-bench a comprehensive forgery detection benchmark suite for large visi
forensiczip more tokens are better but not necessary in forensic vision-language | arXiv: 2603.12208
forestlpr lidar place recognition in forests attentioning multiple bev density i | arXiv: 2503.04475
forming auxiliary high-confident instance-level loss to promote learning from la
fortifying federated learning towards trustworthiness via auditable data valuati
foundations of the theory of performance-based ranking | arXiv: 2412.04227
foundationstereo zero-shot stereo matching | arXiv: 2501.09898
foundhand large-scale domain-specific learning for controllable hand image gener | arXiv: 2412.02690
foveated instance segmentation | arXiv: 2503.21854
fractal calibration for long-tailed object detection | arXiv: 2410.11774
fractals made practical denoising diffusion as partitioned iterated function sys | arXiv: 2603.13069
frame floor-aligned representation for avatar motion from egocentric video | arXiv: 2503.23094
frames-vqa benchmarking fine-tuning robustness across multi-modal shifts in visu
framevggt frame evidence rolling memory for streaming vggt | arXiv: 2603.07690
free lunch enhancements for multi-modal crowd counting
free on the fly enhancing flexibility in test-time adaptation with online em | arXiv: 2507.06973
free-viewpoint human animation with pose-correlated reference selection | arXiv: 2412.17290
free360 layered gaussian splatting for unbounded 360-degree view synthesis from
freecloth free-form generation enhances challenging clothed human modeling | arXiv: 2411.19942
freegave 3d physics learning from dynamic videos by gaussian velocity | arXiv: 2506.07865
freepca integrating consistency information across long-short frames in training
freescene mixed graph diffusion for 3d scene synthesis from free prompts | arXiv: 2506.02781
freesim toward free-viewpoint camera simulation in driving scenes | arXiv: 2412.03566
freetimegs free gaussian primitives at anytime anywhere for dynamic scene recons
freeuv ground-truth-free realistic facial uv texture recovery via cross-assembly | arXiv: 2503.17197
freqdebias towards generalizable deepfake detection via consistency-driven frequ
frequency dynamic convolution for dense image prediction | arXiv: 2503.18783
frequency-biased synergistic design for image compression and compensation
fresa feedforward reconstruction of personalized skinned avatars from few images | arXiv: 2503.19207
from alexnet to transformers measuring the non-linearity of deep neural networks
from elements to design a layered approach for automatic graphic design composit | arXiv: 2412.19712
from faces to voices learning hierarchical representations for high-quality vide
from head to tail efficient black-box model inversion attack via long-tailed lea
from head to tail towards balanced representation in large vision-language model
from laboratory to real world a new benchmark towards privacy-preserved visible-
from multimodal llms to generalist embodied agents methods and lessons | arXiv: 2412.08442
from poses to identity training-free person re-identification via feature centra
from prototypes to general distributions an efficient curriculum for masked imag | arXiv: 2411.10685
from slow bidirectional to fast autoregressive video diffusion models | arXiv: 2412.07772
from sparse signal to smooth motion real-time motion generation with rolling pre
from sparse to dense camera relocalization with scene-specific detector from fea
from words to structured visuals a benchmark and framework for text-to-diagram g | arXiv: 2411.11916
from zero to detail deconstructing ultra-high-definition image restoration from
frugalnerf fast convergence for extreme few-shot novel view synthesis without le
fruitninja 3d object interior texture generation with gaussian splatting | arXiv: 2411.12089
fsbench a figure skating benchmark for advancing artistic sports understanding | arXiv: 2504.19514
fsboard over 3 million characters of asl fingerspelling collected via smartphone | arXiv: 2407.15806
fsfm a generalizable face security foundation model via self-supervised facial r | arXiv: 2412.12032
fshnet fully sparse hybrid network for 3d object detection | arXiv: 2506.03714
full-dof egomotion estimation for event cameras using geometric solvers | arXiv: 2503.03307
functionality understanding and segmentation in 3d scenes | arXiv: 2411.16310
fuzzy multimodal learning for trusted cross-modal retrieval
g3d-lf generalizable 3d-language feature fields for embodied tasks | arXiv: 2411.17030
g3flow generative 3d semantic flow for pose-aware and generalizable object manip
ga3ce unconstrained 3d gaze estimation with gaze-aware 3d context encoding | arXiv: 2505.10671
gaf gaussian avatar reconstruction from monocular videos via multi-view diffusio
gain from neighbors boosting model robustness in the wild via adversarial pertur
galaxy walker geometry-aware vlms for galaxy-scale understanding | arXiv: 2503.18578
gapt-dar category-level garments pose tracking via integrated 2d deformation and
garmentpile point-level visual affordance guided retrieval and adaptation for cl
gasp gaussian avatars with synthetic priors | arXiv: 2412.07739
gaucho gaussian distributions with cholesky decomposition for oriented object de
gausshdr high dynamic range gaussian splatting via learning unified 3d and 2d lo | arXiv: 2503.10143
gaussian eigen models for human heads | arXiv: 2407.04545
gaussian splashing unified particles for versatile motion synthesis and renderin
gaussian splatting feature fields for privacy-preserving visual localization | arXiv: 2507.23569
gaussian splatting for efficient satellite image photogrammetry | arXiv: 2412.13047
gaussianformer-2 probabilistic gaussian superposition for efficient 3d occupancy | arXiv: 2412.04384
gaussianip identity-preserving realistic 3d human generation via human-centric d | arXiv: 2503.11143
gaussianspa an optimizing-sparsifying simplification framework for compact and h
gaussianudf inferring unsigned distance functions through 3d gaussian splatting | arXiv: 2503.19458
gaussianworld gaussian world model for streaming 3d occupancy prediction | arXiv: 2412.10373
gausstr foundation model-aligned gaussian transformer for self-supervised 3d spa
gaustar gaussian surface tracking and reconstruction | arXiv: 2501.10283
gaze-lle gaze target estimation via large-scale learned encoders | arXiv: 2412.09586
gazegene large-scale synthetic gaze dataset with 3d eyeball annotations
gazing at rewards eye movements as a lens into human and ai decision-making in h | arXiv: 2411.09176
gazing into missteps leveraging eye-gaze for unsupervised mistake detection in e
gbc-splat generalizable gaussian-based clothed human digitalization under sparse
gblobs explicit local structure via gaussian blobs for improved cross-domain lid
gcc generative color constancy via diffusing a color checker | arXiv: 2502.17435
gce-pose global context enhancement for category-level object pose estimation | arXiv: 2502.04293
geal generalizable 3d affordance learning with cross-modal consistency | arXiv: 2412.09511
gem a generalizable ego-vision multimodal world model for fine-grained ego-motio
gen3c 3d-informed world-consistent video generation with precise camera control | arXiv: 2503.03751
gen3deval using vllms for automatic evaluation of generated 3d objects | arXiv: 2504.08125
genassets generating in-the-wild 3d assets in latent space
gendeg diffusion-based degradation synthesis for generalizable all-in-one image | arXiv: 2411.17687
generalizable object keypoint localization from generative priors
generalized diffusion detector mining robust features from diffusion models for | arXiv: 2503.02101
generalized few-shot 3d point cloud segmentation with vision-language model | arXiv: 2503.16282
generalized gaussian entropy model for point cloud attribute compression with dy
generalized recorrupted-to-recorrupted self-supervised learning beyond gaussian | arXiv: 2412.04648
generalized zero-shot classification via semantics-free inter-class feature gene
generalizing deepfake video detection with plug-and-play video-level blending an
generating 3d-consistent videos from unposed internet photos | arXiv: 2411.13549
generating 6dof object manipulation trajectories from action description in egoc
generating multimodal driving scenes via next-scene prediction | arXiv: 2503.14945
generation of maximal snake polyominoes using a deep neural network | arXiv: 2603.12400
generative densification learning to densify gaussians for high-fidelity general
generative gaussian splatting for unbounded 3d city generation | arXiv: 2406.06526
generative hard example augmentation for semantic point cloud segmentation
generative image layer decomposition with visual effects | arXiv: 2411.17864
generative inbetweening through frame-wise conditions-driven video generation | arXiv: 2412.11755
generative map priors for collaborative bev semantic segmentation
generative modeling of class probability for multi-modal representation learning | arXiv: 2503.17417
generative multimodal pretraining with discrete diffusion timestep tokens | arXiv: 2504.14666
generative multiview relighting for 3d reconstruction under extreme illumination | arXiv: 2412.15211
generative omnimatte learning to decompose video into layers | arXiv: 2411.16683
generative photography scene-consistent camera control for realistic text-to-ima
generative photomontage | arXiv: 2408.07116
generative sparse-view gaussian splatting
generative video propagation | arXiv: 2412.19761
generative zero-shot composed image retrieval
genfusion closing the loop between reconstruction and generation via videos | arXiv: 2503.21219
genius a generative framework for universal multimodal search | arXiv: 2503.19868
genmanip llm-driven simulation for generalizable instruction-following manipulat
genpc zero-shot point cloud completion via 3d generative priors | arXiv: 2502.19896
genvdm generating vector displacement maps from a single image | arXiv: 2503.00605
geoavatar geometrically-consistent multi-person avatar reconstruction from spars
geochemad benchmarking unsupervised geochemical anomaly detection for mineral ex | arXiv: 2603.13068
geodepth from point-to-depth to plane-to-depth modeling for self-supervised mono
geometric knowledge-guided localized global distribution alignment for federated | arXiv: 2503.06457
geometry field splatting with gaussian surfels | arXiv: 2411.17067
geometry in style 3d stylization via surface normal deformation | arXiv: 2503.23241
geometry-guided camera motion understanding in videollms | arXiv: 2603.13119
geometry-guided online 3d video synthesis with multi-view temporal consistency | arXiv: 2505.18932
geomm on geodesic perspective for multi-modal learning | arXiv: 2505.11216
ges3vig incorporating pointing gestures into language-based 3d visual grounding
get unlocking the multi-modal potential of clip for generalized category discove
gflowvlm enhancing multi-step reasoning in vision-language models with generativ
gg-ssms graph-generating state space models | arXiv: 2412.12423
gif generative inspiration for face recognition at scale | arXiv: 2505.03012
gifstream 4d gaussian-based immersive video with feature stream | arXiv: 2505.07539
gigahands a massive annotated dataset of bimanual hand activities | arXiv: 2412.04244
giim graph-based learning of inter- and intra-view dependencies for multi-view m | arXiv: 2603.09446
givepose gradual intra-class variation elimination for rgb-based category-level
glane3d detecting lanes with graph of 3d keypoints | arXiv: 2503.23882
glass guided latent slot diffusion for object-centric learning | arXiv: 2407.17929
glianet adaptive neural network structure learning with glia-driven
global-local tree search in vlms for 3d indoor scene generation | arXiv: 2503.18476
glossy object reconstruction with cost-effective polarized acquisition | arXiv: 2504.07025
glus global-local reasoning unified into a single large language model for video | arXiv: 2504.07962
glyphmastero a glyph encoder for high-fidelity scene text editing | arXiv: 2505.04915
go-n3rdet geometry optimized nerf-enhanced 3d object detector | arXiv: 2503.15211
go-with-the-flow motion-controllable video diffusion models using real-time warp
goal global-local object alignment learning | arXiv: 2503.17782
goalflow goal-driven flow matching for multimodal trajectories generation in end
goku flow based video generative foundation models | arXiv: 2502.04896
golden cudgel network for real-time semantic segmentation | arXiv: 2503.03325
golf-nrt integrating global context and local geometry for few-shot view synthes
good cheap and fast overfitted image compression with wasserstein distortion | arXiv: 2412.00505
gpavatar high-fidelity head avatars by learning efficient gaussian projections
gps as a control signal for image generation | arXiv: 2501.12390
gpvk-vl geometry-preserving virtual keyframes for visual localization under larg
grade benchmarking discipline-informed reasoning in image editing | arXiv: 2603.12264
gradient inversion attacks on parameter-efficient fine-tuning | arXiv: 2506.04453
gradient-guided annealing for domain generalization | arXiv: 2502.20162
grae-3dmot geometry relation-aware encoder for online 3d multi-object tracking
graph neural network combining event stream and periodic aggregation for low-lat
graph-embedded structure-aware perceptual hashing for neural network protection
graphgpt-o synergistic multimodal comprehension and generation on graphs | arXiv: 2502.11925
graphi2p image-to-point cloud registration with exploring pattern of corresponde
graphmimic graph-to-graphs generative modeling from videos for policy learning
great geometry-intention collaborative inference for open-vocabulary 3d object a | arXiv: 2411.19626
gromov-wasserstein problem with cyclic symmetry
groomlight hybrid inverse rendering for relightable human hair appearance modeli
ground-v teaching vlms to ground complex instructions in pixels | arXiv: 2505.13788
grounding 3d object affordance with language instructions visual observations an | arXiv: 2504.04744
groundingface fine-grained face understanding via pixel grounding multimodal lar
groupmamba efficient group-based visual state space model | arXiv: 2407.13772
grove a generalized reward for learning open-vocabulary physical skill | arXiv: 2504.04191
gs-2dgs geometrically supervised 2dgs for reflective object reconstruction | arXiv: 2506.13110
gs-dit advancing video generation with dynamic 3d gaussian fields through effici
guardsplat efficient and robust watermarking for 3d gaussian splatting | arXiv: 2411.19895
gui-xplore empowering generalizable gui agents with one exploration | arXiv: 2503.17709
guiding human-object interactions with rich geometry and relations | arXiv: 2503.20172
gyro-based neural single image deblurring | arXiv: 2404.00916
h-edit effective and flexible diffusion-based editing via doobs h-transform | arXiv: 2503.02187
h-more learning human-centric motion representation for action analysis | arXiv: 2504.10676
h2st hierarchical two-sample tests for continual out-of-distribution detection | arXiv: 2503.14832
hallo3 highly dynamic and realistic portrait image animation with video diffusio
halloc token-level localization of hallucinations for vision language models | arXiv: 2506.10286
hand-held object reconstruction from rgb video with dynamic interaction
handling spatial-temporal data heterogeneity for federated continual learning vi
handos 3d hand reconstruction in one stage | arXiv: 2412.01537
hardware-rasterized ray-based gaussian splatting | arXiv: 2503.18682
harmonyset a comprehensive dataset for understanding video-music semantic alignm
harnessing frequency spectrum insights for image copyright protection against di
harnessing frozen unimodal encoders for flexible multimodal alignment | arXiv: 2409.19425
harnessing global-local collaborative adversarial perturbation for anti-customiz
hash3d training-free acceleration for 3d generation | arXiv: 2404.06091
hawor world-space hand motion reconstruction from egocentric videos | arXiv: 2501.02973
hazy low-quality satellite video restoration via learning optimal joint degradat
hd-epic a highly-detailed egocentric video dataset | arXiv: 2502.04144
hearing anywhere in any environment | arXiv: 2504.10746
hearing hands generating sounds from physical interactions in 3d scenes | arXiv: 2506.09989
heatformer a neural optimizer for multiview human mesh recovery | arXiv: 2412.04456
heie mllm-based hierarchical explainable aigc image implausibility evaluator | arXiv: 2411.17261
helvipad a real-world dataset for omnidirectional stereo depth estimation | arXiv: 2411.18335
hemora unsupervised heuristic consensus sampling for robust point cloud registra
hera hybrid explicit representation for ultra-realistic head avatars
heterogeneous skeleton-based action representation learning | arXiv: 2506.03481
hfp-sam hierarchical frequency prompted sam for efficient marine animal segmenta | arXiv: 2603.12708
hiap a multi-granular stochastic auto-pruning framework for vision transformers | arXiv: 2603.12222
hiding images in diffusion models by editing learned score functions | arXiv: 2503.18459
hierarchical adaptive filtering network for text image specular highlight remova
hierarchical compact clustering attention coca for unsupervised object-centric l | arXiv: 2505.02071
hierarchical dual-change collaborative learning for uav scene change captioning | arXiv: 2603.12832
hierarchical features matter a deep exploration of progressive parameterization
hierarchical flow diffusion for efficient frame interpolation | arXiv: 2504.00380
hierarchical gaussian mixture model splatting for efficient and part controllabl
hierarchical knowledge prompt tuning for multi-task test-time adaptation
hierarq task-aware hierarchical q-former for enhanced video understanding | arXiv: 2503.08585
hifi-portrait zero-shot identity-preserved portrait generation with high-fidelit
hificl high-fidelity in-context learning for multimodal tasks | arXiv: 2603.12760
high dynamic range video compression a large-scale benchmark dataset and a learn
high temporal consistency through semantic similarity propagation in semi-superv
high-fidelity 3d object generation from single image with rgbn-volume gaussian r | arXiv: 2504.01512
high-fidelity lightweight mesh reconstruction from point clouds
high-fidelity relightable monocular portrait animation with lighting-controllabl
high-quality point cloud oriented normal estimation via hybrid angular and eucli
higher-order ratio cycles for fast and globally optimal shape matching
hiif hierarchical encoding based implicit image function for continuous super-re
hilots high-low temporal sensitive representation learning for semi-supervised l
himor monocular deformable gaussian reconstruction with hierarchical motion repr
hipart hierarchical pose autoregressive transformer for occluded 3d human pose e | arXiv: 2503.23331
hires-llava restoring fragmentation input in high-resolution large vision-langua
histofs non-iid histopathologic whole slide image classification via federated s
hmar efficient hierarchical masked auto-regressive image generation | arXiv: 2506.04421
hogs unified near and far object reconstruction via homogeneous gaussian splatti
hoi3dgen generating high-quality human-object-interactions in 3d | arXiv: 2603.12126
hoigen-1m a large-scale dataset for human-object interaction video generation | arXiv: 2503.23715
hoigpt learning long-sequence hand-object interaction with language models
holmes-vau towards long-term video anomaly understanding at any granularity | arXiv: 2412.06171
homesafe-bench evaluating vision-language models on unsafe action detection for | arXiv: 2603.11975
homogen enhanced video inpainting via homography propagation and diffusion
homogeneous dynamics space for heterogeneous humans | arXiv: 2412.06146
hop heterogeneous topology-based multimodal entanglement for co-speech gesture g | arXiv: 2503.01175
horizon-gs unified 3d gaussian splatting for large-scale aerial-to-ground scenes | arXiv: 2412.01745
horp human-object relation priors guided hoi detection
hot hadamard-based optimized training | arXiv: 2503.21261
hot3d hand and object tracking in 3d from egocentric multi-view videos | arXiv: 2411.19167
hotformerloc hierarchical octree transformer for versatile lidar place recogniti
hotspot signed distance function optimization with an asymptotically sufficient | arXiv: 2411.14628
hovle unleashing the power of monolithic vision-language models with holistic vi
how do i do that synthesizing 3d hand motion and contacts for everyday interacti
how to merge your multimodal models over time | arXiv: 2412.06712
hravatar high-quality and relightable gaussian head avatar | arXiv: 2503.08224
hsemotion team at abaw-10 competition facial expression recognition valence-arou | arXiv: 2603.12693
hsi a holistic style injector for arbitrary style transfer | arXiv: 2502.04369
hsi-gpt a general-purpose large scene-motion-language model for human scene inte
human knowledge integrated multi-modal learning for single source domain general | arXiv: 2603.12369
human motion instruction tuning | arXiv: 2411.16805
human-centered interactive learning via mllms for text-to-image person re-identi
humandreamer generating controllable human-motion videos via decoupled generatio
humanmm global human motion recovery from multi-shot videos | arXiv: 2503.07597
humanrig learning automatic rigging for humanoid character in a large scale data
humocon concept discovery for human motion understanding | arXiv: 2505.20920
hunet homotopy unfolding network for image compressive sensing
hunyuanportrait implicit condition control for enhanced portrait animation | arXiv: 2503.18860
huperflow a comprehensive benchmark for human vs machine motion estimation compa
hush holistic panoramic 3d scene understanding using spherical harmonics
hvi a new color space for low-light image enhancement | arXiv: 2502.20272
hybrid concept bottleneck models
hybrid etfce-grf exact cluster-size retrieval with analytical p-values for voxel | arXiv: 2603.11344
hybrid global-local representation with augmented spatial guidance for zero-shot
hybrid reciprocal transformer with triplet feature alignment for scene graph gen
hybrid-level instruction injection for video token compression in multi-modal la
hybridgs decoupling transients and statics with 2d and 3d gaussian splatting | arXiv: 2412.03844
hybridmqa exploring geometry-texture interactions for colored mesh quality asses
hyperbolic category discovery | arXiv: 2504.06120
hyperbolic safety-aware vision-language models | arXiv: 2503.12127
hyperbolic uncertainty-aware few-shot incremental point cloud segmentation
hyperdimensional uncertainty quantification for multimodal uncertainty fusion in
hyperfree a channel-adaptive and tuning-free foundation model for hyperspectral
hyperglm hypergraph for video scene graph generation and anticipation | arXiv: 2411.18042
hypergraph vision transformers images are more than nodes more than edges | arXiv: 2504.08710
hypergs hyperspectral 3d gaussian splatting | arXiv: 2412.12849
hyperlora parameter-efficient adaptive generation for portrait synthesis | arXiv: 2503.16944
hypernet fields efficiently training hypernetworks without ground truth by learn
hypernvd accelerating neural video decomposition via hypernetworks | arXiv: 2503.17276
hyperpose hypernetwork-infused camera pose localization and an extended cambridg
hyperseg hybrid segmentation assistant with fine-grained visual perceiver
hyperspectral pansharpening via diffusion models with iteratively zero-shot guid
i2vguard safeguarding images against misuse in diffusion-based image-to-video mo
iaao interactive affordance learning for articulated objects in 3d environments | arXiv: 2504.06827
ice intrinsic concept extraction from a single image via diffusion models | arXiv: 2503.19902
icediff high resolution and high-quality arctic sea ice forecasting with generat
icp immediate compensation pruning for mid-to-high sparsity
ict image-object cross-level trusted intervention for mitigating object hallucin
id-patch robust id association for group photo personalization | arXiv: 2411.13632
idea inverted text with cooperative deformable aggregation for multi-modal objec
idea-bench how far are generative models from professional designing | arXiv: 2412.11767
identifying and mitigating position bias of multi-image vision-language models | arXiv: 2503.13792
identifying and mitigating spurious correlation in multi-task learning
identity-clothing similarity modeling for unsupervised clothing change person re
identity-preserving distillation sampling by fixed-point iterator | arXiv: 2502.19930
identity-preserving text-to-video generation by frequency decomposition | arXiv: 2411.17440
idol instant photorealistic 3d human creation from a single image | arXiv: 2412.14963
idprotector an adversarial noise encoder to protect against id-preserving image | arXiv: 2412.11638
ig-6dof model-free 6dof pose estimation for unseen object via iterative 3d gauss
ilias instance-level image retrieval at scale | arXiv: 2502.11748
illumination spectrum estimation for multispectral images via surface reflectanc
im-portrait learning 3d-aware video diffusion for photorealistic talking heads f
im-zero instance-level motion controllable video generation in a zero-shot manne
image generation diversity issues and how to tame them | arXiv: 2411.16171
image is all you need to empower large-scale diffusion models for in-domain gene
image over text transforming formula recognition evaluation with character detec
image quality assessment from human to machine preference | arXiv: 2503.10078
image quality assessment investigating causal perceptual effects with abductive | arXiv: 2412.16939
image reconstruction from readout-multiplexed single-photon detector arrays | arXiv: 2312.02971
image referenced sketch colorization based on animation creation workflow | arXiv: 2502.19937
imagine and seek improving composed image retrieval with an imagined proxy | arXiv: 2411.16752
imaginefsl self-supervised pretraining matters on imagined base set for vlm-base
imfine 3d inpainting via geometry-guided multi-view refinement | arXiv: 2503.04501
img-diff contrastive data synthesis for multimodal large language models | arXiv: 2408.04594
immune improving safety against jailbreaks in multi-modal llms via inference-tim
implicit bias injection attacks against text-to-image diffusion models | arXiv: 2504.01819
implicit correspondence learning for image-to-point cloud registration
improve representation for imbalanced regression through geometric constraints | arXiv: 2503.00876
improved monocular depth prediction using distance transform over pre-semantic c
improved video vae for latent video diffusion model | arXiv: 2411.06449
improving accuracy and calibration via differentiated deep mutual learning
improving adversarial transferability on vision transformers via forward propaga
improving autoregressive visual generation with cluster-oriented token predictio
improving diffusion inverse problem solving with decoupled noise annealing | arXiv: 2407.01521
improving editability in image generation with layer-wise memory | arXiv: 2505.01079
improving gaussian splatting with localized points management | arXiv: 2406.04251
improving personalized search with regularized low-rank parameter updates | arXiv: 2506.10182
improving semi-supervised semantic segmentation with sliced-wasserstein feature
improving sound source localization with joint slot attention on image and audio | arXiv: 2504.15118
improving the training of data-efficient gans via quality aware dynamic discrimi
improving the transferability of adversarial attacks on face recognition with di
improving transferable targeted attacks with feature tuning mixup | arXiv: 2411.15553
improving visual and downstream performance of low-light enhancer with vision fo
imputation-free and alignment-free incomplete multi-view clustering driven by co
imvid immersive volumetric videos for enhanced vr engagement | arXiv: 2503.14359
inceventgs pose-free gaussian splatting from a single event camera | arXiv: 2410.08107
incomplete multi-modal brain tumor segmentation via learnable sorting state spac
incomplete multi-view multi-label learning via disentangled representation and l
incorporating dense knowledge alignment into unified multimodal representation m
incremental object keypoint learning | arXiv: 2503.20248
indoorgs geometric cues guided gaussian splatting for indoor scene reconstructio
inference-scale complexity in ann-snn conversion for high-performance and low-po
infighting in the dark multi-label backdoor attack in federated learning | arXiv: 2409.19601
infinity scaling bitwise autoregressive modeling for high-resolution image synth
influence malleability in linearized attention dual implications of non-converge | arXiv: 2603.13085
infp audio-driven interactive head generation in dyadic conversations | arXiv: 2412.04037
inpo inversion preference optimization with reparametrized ddim for efficient di
insight-v exploring long-chain visual reasoning with multimodal large language m | arXiv: 2411.14432
insightedit towards better instruction following for image editing | arXiv: 2411.17323
insightful instance features for 3d instance segmentation
inst3d-lmm instance-aware 3d scene understanding with multi-modal instruction tu
instag learning personalized 3d talking head from few-second video | arXiv: 2502.20387
instance-wise supervision-level optimization in active learning | arXiv: 2503.06517
instancecap improving text-to-video generation via instance-aware structured cap
instancegaussian appearance-semantic joint gaussian representation for 3d instan
instant adversarial purification with adversarial consistency distillation | arXiv: 2408.17064
instant gaussian stream fast and generalizable streaming of dynamic scene recons
instant3dit multiview inpainting for fast editing of 3d objects | arXiv: 2412.00518
instanthdr single-forward gaussian splatting for high dynamic range 3d reconstru | arXiv: 2603.11298
instruct-clip improving instruction-guided image editing with automated data ref
instruction-based image manipulation by watching how things move | arXiv: 2412.12087
integral fast fourier color constancy | arXiv: 2502.03494
integration of deep generative anomaly detection algorithm in high-speed industr | arXiv: 2603.07577
interact advancing large-scale versatile 3d human-object interaction generation | arXiv: 2509.09555
interactanything zero-shot human object interaction synthesis via llm feedback a
interactionmap improving online vectorized hdmap construction with interaction | arXiv: 2503.21659
interactive medical image analysis with concept-based similarity reasoning | arXiv: 2503.06873
interactive medical image segmentation a benchmark dataset and baseline | arXiv: 2411.12814
interactvlm 3d interaction reasoning from 2d foundational models | arXiv: 2504.05303
interdyn controllable interactive dynamics with video diffusion models | arXiv: 2412.11785
interedit navigating text-guided multi-human 3d motion editing | arXiv: 2603.13082
interleaved-modal chain-of-thought | arXiv: 2411.19488
intermimic towards universal whole-body control for physics-based human-object i | arXiv: 2502.20390
interpretable generative models through post-hoc concept bottlenecks | arXiv: 2503.19377
interpretable image classification via non-parametric part prototype learning | arXiv: 2503.10247
interpreting object-level foundation models via visual precision search | arXiv: 2411.16198
inversion circle interpolation diffusion-based image augmentation for data-scarc
investigating the role of weight decay in enhancing nonconvex sgd
invisible backdoor attack against self-supervised learning | arXiv: 2405.14672
irgs inter-reflective gaussian splatting with 2d gaussian ray tracing | arXiv: 2412.15867
iris inverse rendering of indoor scenes from low dynamic range images | arXiv: 2401.12977
is right right enhancing object orientation understanding in multimodal large la
is this generated person existed in real-world fine-grained detecting and calibr
is your world simulator a good story presenter a consecutive events-based benchm
isegman interactive segment-and-manipulate 3d gaussians | arXiv: 2505.11934
ita-mdt image-timestep-adaptive masked diffusion transformer framework for image
iterative predictor-critic code decoding for real-world image dehazing | arXiv: 2503.13147
iteris iterative inference-solving alignment for lora merging | arXiv: 2411.15231
its a blind match towards vision-language correspondence without parallel data | arXiv: 2503.24129
jailbreaking the non-transferable barrier via test-time data disguising | arXiv: 2503.17198
jamma ultra-lightweight local feature matching with joint mamba | arXiv: 2503.03437
janus decoupling visual encoding for unified multimodal understanding and genera
janusflow harmonizing autoregression and rectified flow for unified multimodal u | arXiv: 2411.07975
jarvisir elevating autonomous driving perception with intelligent image restorat
jisam alleviate labeling burden and corner case problems in autonomous driving v
joint and streamwise distributed mimo satellite communications with multi-antenn | arXiv: 2603.12914
joint optimization of neural radiance fields and continuous camera motion from a | arXiv: 2504.19819
joint out-of-distribution filtering and data discovery active learning | arXiv: 2503.02491
joint scheduling of causal prompts and tasks for multi-task learning
joint vision-language social bias removal for clip | arXiv: 2411.12785
jopp-3d joint open vocabulary semantic segmentation on point clouds and panorama | arXiv: 2603.06168
jtd-uav mllm-enhanced joint tracking and description framework for anti-uav syst
just dance with pi a poly-modal inductor for weakly-supervised video anomaly det
k-lora unlocking training-free fusion of any subject and style loras | arXiv: 2502.18461
k-sort arena efficient and reliable benchmarking for generative models via k-wis
kac kolmogorov-arnold classifier for continual learning | arXiv: 2503.21076
keep the balance a parameter-efficient symmetrical framework for rgbx semantic s
keyface expressive audio-driven facial animation for long sequences via keyframe | arXiv: 2503.01715
keyframe-guided creative video inpainting
kiss3dgen repurposing image diffusion models for 3d asset generation | arXiv: 2503.01370
kmd koopman multi-modality decomposition for generalized brain tumor segmentatio
knowledge bridger towards training-free missing modality completion | arXiv: 2502.19834
knowledge memorization and rumination for pre-trained model-based class-incremen
knowledge-aligned counterfactual-enhancement diffusion perception for unsupervis
koala-36m a large-scale video dataset improving consistency between fine-grained
kvq boosting video quality assessment via saliency-guided local perception | arXiv: 2503.10259
l-swag layer-sample wise activation with gradients information for zero-shot nas
l2gtx from local to global time series explanations | arXiv: 2603.13065
label shift meets online learning ensuring consistent adaptation with universal
lal enhancing 3d human motion prediction with latency-aware auxiliary learning
lamra large multimodal model as your advanced retrieval assistant | arXiv: 2412.01720
language guided concept bottleneck models for interpretable continual learning | arXiv: 2503.23283
language-assisted debiasing and smoothing for foundation model-based semi-superv
language-grounded decoupled action representation for robotic manipulation | arXiv: 2603.12967
language-guided audio-visual learning for long-term sports assessment
language-guided image tokenization for generation | arXiv: 2412.05796
language-guided salient object ranking
large self-supervised models bridge the gap in domain adaptive object detection | arXiv: 2503.23220
large-scale multi-view tensor clustering with implicit linear kernels
large-scale text-to-image model with inpainting is a zero-shot subject-driven im
latent drifting in diffusion models for counterfactual medical image synthesis | arXiv: 2412.20651
latent space imaging | arXiv: 2407.07052
latent space super-resolution for higher-resolution image generation with diffus
latenthoi on the generalizable hand object motion generation with latent hand di
latexblend scaling multi-concept customized generation with latent textual blend | arXiv: 2503.06956
latte-mv learning to anticipate table tennis hits from monocular videos | arXiv: 2503.20936
lavin-dit large vision diffusion transformer | arXiv: 2411.11505
layer- and timestep-adaptive differentiable token compression ratios for efficie
layered image vectorization via semantic simplification | arXiv: 2406.05404
layered motion fusion lifting motion segmentation to 3d in egocentric videos | arXiv: 2506.05546
layoutvlm differentiable optimization of 3d layout via vision-language models | arXiv: 2412.02193
lc-mamba local and continuous mamba with shifted windows for frame interpolation
leangaussian breaking pixel or point cloud correspondence in modeling 3d gaussia
learnable infinite taylor gaussian for dynamic view rendering | arXiv: 2412.04282
learned binocular-encoding optics for rgbd imaging using joint stereo and focus
learned image compression with dictionary-based entropy model | arXiv: 2504.00496
learning 4d panoptic scene graph generation from rich 2d visual scene | arXiv: 2503.15019
learning affine correspondences by integrating geometric constraints | arXiv: 2504.04834
learning audio-guided video representation with gated attention for video-text r | arXiv: 2504.02397
learning bijective surface parameterization for inferring signed distance functi
learning class prototypes for unified sparse-supervised 3d object detection | arXiv: 2503.21099
learning compatible multi-prize subnetworks for asymmetric retrieval | arXiv: 2504.11879
learning conditional space-time prompt distributions for video class-incremental
learning dynamic collaborative network for semi-supervised 3d vessel segmentatio
learning endogenous attention for incremental object detection
learning extremely high density crowds as active matters | arXiv: 2503.12168
learning flow fields in attention for controllable person image generation | arXiv: 2412.08486
learning from neighbors category extrapolation for long-tail learning | arXiv: 2410.15980
learning from streaming video with orthogonal gradients | arXiv: 2504.01961
learning from synchronization self-supervised uncalibrated multi-view person ass
learning hazing to dehazing towards realistic haze generation for real-world ima
learning heterogeneous tissues with mixture of experts for gigapixel whole slide
learning occlusion-robust vision transformers for real-time uav tracking | arXiv: 2504.09228
learning on model weights using tree experts | arXiv: 2410.13569
learning partonomic 3d reconstruction from image collections
learning person-specific animatable face models from in-the-wild images via a sh
learning phase distortion with selective state space models for video turbulence | arXiv: 2504.02697
learning physics from video unsupervised physical parameter estimation for conti
learning physics-based full-body human reaching and grasping from brief walking | arXiv: 2503.07481
learning temporally consistent video depth from video diffusion priors | arXiv: 2406.01493
learning textual prompts for open-world semi-supervised learning
learning to detect objects from multi-agent lidar scans without manual labels | arXiv: 2503.08421
learning to filter outlier edges in global sfm
learning to highlight audio by watching movies | arXiv: 2505.12154
learning to normalize on the spd manifold under bures-wasserstein geometry | arXiv: 2504.00660
learning to sample effective and diverse prompts for text-to-image generation | arXiv: 2502.11477
learning visual composition through improved semantic guidance | arXiv: 2412.15396
learning visual generative priors without text | arXiv: 2412.07767
learning with noisy triplet correspondence for composed image retrieval
learning-enabled polynomial lyapunov function synthesis via high-accuracy counte
lediff latent exposure diffusion for hdr generation | arXiv: 2412.14456
lesionlocator zero-shot universal tumor segmentation and tracking in 3d whole-bo
less attention is more prompt transformer for generalized category discovery
less is more efficient image vectorization with adaptive parameterization
less is more efficient model merging with binary task switch | arXiv: 2412.00054
lessons and insights from a unifying study of parameter-efficient fine-tuning pe
let humanoids hike integrative skill development on complex trails | arXiv: 2505.06218
let samples speak mitigating spurious correlation by exploiting the clusterness | arXiv: 2512.22874
lets chorus partner-aware hybrid song-driven 3d head animation
lets verify and reinforce image generation step by step
leveraging 3d geometric priors in 2d rotation symmetry detection | arXiv: 2503.20235
leveraging perturbation robustness to enhance out-of-distribution detection | arXiv: 2503.18784
leveraging sd map to augment hd map-based trajectory prediction
leveraging temporal cues for semi-supervised multi-view 3d object detection
levitor 3d trajectory oriented image-to-video synthesis | arXiv: 2412.15214
libra-merging importance-redundancy and pruning-merging trade-off for accelerati
libragrad balancing gradient flow for universally better vision transformer attr
lidar-rt gaussian-based ray tracing for dynamic lidar re-simulation | arXiv: 2412.15199
lidargait learning local features and size awareness from lidar point clouds for
lifelong knowledge editing for vision language models with low-rank mixture-of-e
lift3d policy lifting 2d foundation models for robust 3d robotic manipulation | arXiv: 2411.18623
lifting motion to the 3d world via 2d diffusion | arXiv: 2411.18808
lifting the veil on visual information flow in mllms unlocking pathways to faste
light transport-aware diffusion posterior sampling for single-view reconstructio
light3r-sfm towards feed-forward structure-from-motion | arXiv: 2501.14914
lightloc learning outdoor lidar localization at light speed | arXiv: 2503.17814
lim large interpolator model for dynamic reconstruction | arXiv: 2503.22537
limoe mixture of lidar representation learners from automotive scenes | arXiv: 2501.04004
linear attention modeling for learned image compression | arXiv: 2502.05741
lineart a knowledge-guided training-free high-quality appearance transfer for de
lingen towards high-resolution minute-length text-to-video generation with linea
linguistics-aware masked image modeling for self-supervised scene text recogniti
link to the past temporal propagation for fast 3d human reconstruction from mono
link-based contrastive learning for one-shot unsupervised domain adaptation
lion-fs fast slow video-language thinker as online video assistant | arXiv: 2503.03663
lirm large inverse rendering model for progressive reconstruction of shape mater
lisu a dataset and method for lidar surface normal estimation | arXiv: 2503.08601
lita-gs illumination-agnostic novel view synthesis via reference-free 3d gaussia
livecc learning video llm with streaming speech transcription at scale | arXiv: 2504.16030
livos light video object segmentation with gated linear matching | arXiv: 2411.02818
llava-critic learning to evaluate multimodal models | arXiv: 2410.02712
llava-st a multimodal large language model for fine-grained spatial-temporal und
llavidal a large language vision model for daily activities of living | arXiv: 2406.09390
llm-driven multimodal and multi-identity listening head generation
llmdet learning strong open-vocabulary object detectors under the supervision of
lmo linear mamba operator for mri reconstruction
locality-aware zero-shot human-object interaction detection | arXiv: 2505.19503
localized concept erasure for text-to-image diffusion models using training-free
localizing events in videos with multimodal queries | arXiv: 2406.10079
locally orderless images for optimization in differentiable rendering | arXiv: 2503.21931
locore image re-ranking with long-context sequence modeling | arXiv: 2503.21772
lod-gs achieving levels of detail using scalable gaussian soup
logiczsl exploring logic-induced representation for compositional zero-shot lear
logits deconfusion with clip for few-shot learning | arXiv: 2504.12104
logosp local-global grouping of superpoints for unsupervised semantic segmentati
loki low-dimensional kan for efficient fine-tuning image models
long video diffusion generation with segmented cross-attention and content-rich | arXiv: 2412.01316
longdiff training-free long video generation in one go | arXiv: 2503.18150
longvale vision-audio-language-event benchmark towards time-aware omni-modal per
lookcloser frequency-aware radiance field for tiny-detail scene | arXiv: 2503.18513
lookingglass generative anamorphoses via laplacian pyramid warping | arXiv: 2504.08902
lora recycle unlocking tuning-free few-shot adaptability in visual foundation mo
lora subtraction for drift-resistant space in exemplar-free continual learning | arXiv: 2503.18985
loraclr contrastive adaptation for customization of diffusion models | arXiv: 2412.09622
lorasculpt sculpting lora for harmonizing general and specialized knowledge in m
lost in translation found in context sign language translation with contextual c | arXiv: 2501.09754
lotus large-scale machine unlearning with a taste of uncertainty | arXiv: 2503.18314
lotusfilter fast diverse nearest neighbor search via a learned cutoff table | arXiv: 2506.04790
low-biased general annotated dataset generation | arXiv: 2412.10831
low-rank adaptation in multilinear operator networks for security-preserving inc
lp-diff towards improved restoration of real-world degraded license plate
lposs label propagation over patches and pixels for open-vocabulary semantic seg
lr-sgs robust lidar-reflectance-guided salient gaussian splatting for self-drivi | arXiv: 2603.12647
lscenellm enhancing large 3d scene understanding using adaptive visual preferenc
lsnet see large focus small | arXiv: 2503.23135
lt3sd latent trees for 3d scene diffusion | arXiv: 2409.08215
lucas layered universal codec avatars | arXiv: 2502.19739
luminance-gs adapting 3d gaussian splatting to challenging lighting conditions w
luminet latent intrinsics meets diffusion models for indoor scene relighting | arXiv: 2412.00177
lux post facto learning portrait performance relighting with conditional video d
lyapunov stable graph neural flow | arXiv: 2603.12557
m-llm based video frame selection for efficient video understanding | arXiv: 2502.19680
m2-occ resilient 3d semantic occupancy prediction for autonomous driving with in | arXiv: 2603.09737
m3-vos multi-phase multi-transition and multi-scenery video object segmentation | arXiv: 2412.13803
m3amba memory mamba is all you need for whole slide image classification
m3gym a large-scale multimodal multi-view multi-person pose dataset for fitness
mac-ego3d multi-agent gaussian consensus for real-time collaborative ego-motion
mad memory-augmented detection of 3d objects
madcow marginal distortion correction for wide-angle photography with arbitrary
mage single image to material-aware 3d via the multi-view g-buffer estimation mo
magic-slam multi-agent gaussian globally consistent slam | arXiv: 2411.16785
magicarticulate make your 3d models articulation-ready | arXiv: 2502.12135
magicquill an intelligent interactive image editing system | arXiv: 2411.09703
magma a foundation model for multimodal ai agents | arXiv: 2502.13130
maintaining consistent inter-class topology in continual test-time adaptation
mair a locality- and continuity-preserving mamba for image restoration | arXiv: 2412.20066
make it count text-to-image generation with an accurate number of objects | arXiv: 2406.10210
make-it-animatable an efficient framework for authoring animation-ready 3d chara
making old film great again degradation-aware state space model for old film res
mamba as a bridge where vision foundation models meet vision language models for
mamba-adaptor state space model adaptor for visual recognition | arXiv: 2505.12685
mamba-reg vision mamba also needs registers
mamba4d efficient 4d point cloud video understanding with disentangled spatial-t
mambaic state space models for high-performance learned image compression | arXiv: 2503.12461
mambairv2 attentive state space restoration | arXiv: 2411.15269
mambaout do we really need mamba for vision | arXiv: 2405.07992
mambavision a hybrid mamba-transformer vision backbone | arXiv: 2407.08083
mambavlt time-evolving multimodal state space model for vision-language tracking | arXiv: 2411.15459
mambavo deep visual odometry based on sequential matching refinement and trainin
mammalps a multi-view video behavior monitoring dataset of wild mammals in the s | arXiv: 2503.18223
manganinja line art colorization with precise reference following | arXiv: 2501.08332
mani-gs gaussian splatting manipulation with triangular mesh | arXiv: 2405.17811
maniptrans efficient dexterous bimanual manipulation transfer via residual learn | arXiv: 2503.21860
manivideo generating hand-object manipulation video with dexterous and generaliz | arXiv: 2412.16212
manta a large-scale multi-view and visual-text anomaly detection dataset for tin
manta diffusion mamba for efficient and effective stochastic long-term dense act
map unleashing hybrid mamba-transformer vision backbones potential with masked a | arXiv: 2410.00871
mapgclr geospatial contrastive learning of representations for online vectorized | arXiv: 2603.10688
mar-3d progressive masked auto-regressor for high-resolution 3d generation | arXiv: 2503.20519
marble material recomposition and blending in clip-space | arXiv: 2506.05313
mari material retrieval integration across domains | arXiv: 2503.08111
markushgrapher joint visual and textual recognition of markush structures | arXiv: 2503.16096
marten visual question answering with mask generation for multi-modal document u | arXiv: 2503.14140
marvel-40m multi-level visual elaboration for high-fidelity text-to-3d content c | arXiv: 2411.17945
mash-vlm mitigating action-scene hallucination in video-llms through disentangle
mask-adapter the devil is in the masks for open-vocabulary segmentation | arXiv: 2412.04533
mask2dit dual mask-based diffusion transformer for multi-scene long video genera
masked point-entity contrast for open-vocabulary 3d scene understanding | arXiv: 2504.19500
masked scene modeling narrowing the gap between supervised and self-supervised l
maskgaussian adaptive 3d gaussian representation from probabilistic masks | arXiv: 2412.20522
maskgwm a generalizable driving world model with video mask reconstruction | arXiv: 2502.11663
masking meets supervision a strong learning alliance | arXiv: 2306.11339
mass13k a matting-level semantic segmentation benchmark | arXiv: 2503.18364
mast3r-slam real-time dense slam with 3d reconstruction priors | arXiv: 2412.12392
mastering negation boosting grounding models via grouped opposition-based learni | arXiv: 2603.12606
matanyone stable video matting with consistent memory propagation | arXiv: 2501.14677
matcha gaussians atlas of charts for high-quality geometry and photorealism from | arXiv: 2412.06767
matcha towards matching anything
material anything generating materials for any 3d object via diffusion | arXiv: 2411.15138
matrix-free shared intrinsics bundle adjustment
matrix3d large photogrammetry model all-in-one | arXiv: 2502.07685
mbq modality-balanced quantization for large vision-language models | arXiv: 2412.19509
mc2 multi-concept guidance for customized multi-concept generation
mccd multi-agent collaboration-based compositional diffusion for complex text-to
mdp multidimensional vision model pruning with latency constraint | arXiv: 2504.02168
meat multiview diffusion model for human generation on megapixels with mesh atte
medunifier unifying vision-and-language pre-training on medical data with vision
medusa a multi-scale high-order contrastive dual-diffusion approach for multi-vi
meet towards memory-efficient temporal sparse deep neural networks
mega hybrid mesh-gaussian head avatar for high-fidelity rendering and head editi
mega masked generative autoencoder for human mesh recovery | arXiv: 2405.18839
megasam accurate fast and robust structure and motion from casual dynamic videos | arXiv: 2412.04463
megasynth scaling up 3d scene reconstruction with synthesized data | arXiv: 2412.14166
memories of forgotten concepts | arXiv: 2412.00782
merge multi-faceted hierarchical graph-based gnn for gene expression prediction
mergevq a unified framework for visual generation and representation with disent
mesc-3dmining effective semantic cues for 3d reconstruction from a single image
mesh mamba a unified state space model for saliency prediction in non-textured a | arXiv: 2504.01466
meshart generating articulated meshes with structure-guided transformers | arXiv: 2412.11596
meshgen generating pbr textured mesh with render-enhanced auto-encoder and gener
met3r measuring multi-view consistency in generated images | arXiv: 2501.06336
meta-learning hyperparameters for parameter efficient fine-tuning | arXiv: 2603.01759
metascenes towards automated replica creation for real-world 3d scans | arXiv: 2505.02388
metashadow object-centered shadow detection removal and synthesis | arXiv: 2412.02635
metaspectra a compact broadband metasurface camera for snapshot hyperspectral im | arXiv: 2603.09116
metawriter personalized handwritten text recognition using meta-learned prompt t | arXiv: 2505.20513
metricgrids arbitrary nonlinear approximation with elementary metric grids based
mexd an expert-infused diffusion model for whole-slide image classification | arXiv: 2503.12401
mfoghub bridging multi-regional and multi-satellite data for global marine fog d | arXiv: 2505.10281
mg-motionllm a unified framework for motion comprehension and generation across | arXiv: 2504.02478
mi-detr an object detection model with multi-time inquiries mechanism | arXiv: 2503.01463
micas multi-grained in-context adaptive sampling for 3d point cloud processing | arXiv: 2411.16773
microvqa a multimodal reasoning benchmark for microscopy-based scientific resear
midi multi-instance diffusion for single image to 3d scene generation | arXiv: 2412.03558
mil-pf multiple instance learning on precomputed features for mammography classi | arXiv: 2603.09374
mimic in-context learning for multimodal tasks | arXiv: 2504.08851
mimir improving video diffusion models for precise text understanding | arXiv: 2412.03085
mimo a medical vision language model with visual referring multimodal input and | arXiv: 2510.10011
mimo controllable character video synthesis with spatial decomposed modeling | arXiv: 2409.16160
mind the gap confidence discrepancy can guide federated semi-supervised learning | arXiv: 2503.13227
mind the gap detecting black-box adversarial attacks in the making through query | arXiv: 2503.02986
mind the time temporally-controlled multi-event video generation | arXiv: 2412.05263
mind the trojan horse image prompt adapter enabling scalable and deceptive jailb
minding fuzzy regions a data-driven alternating learning paradigm for stable les
minima modality invariant image matching | arXiv: 2412.19412
minimal interaction seperated tuning a new paradigm for visual adaptation
minimizing labeled maximizing unlabeled an image-driven approach for video insta
minority-focused text-to-image generation via prompt optimization | arXiv: 2410.07838
mire matched implicit neural representations
mirrorverse pushing diffusion models to realistically reflect the world | arXiv: 2504.15397
missing target-relevant information prediction with world model for accurate zer
mitigating ambiguities in 3d classification with gaussian splatting | arXiv: 2503.08352
mitigating hallucinations in large vision-language models via dpo on-policy data
mitigating memorization in text-to-image diffusion via region-aware prompt augme | arXiv: 2603.13070
mitigating object hallucinations in large vision-language models with assembly o
mitigating the human-robot domain discrepancy in visual pre-training for robotic | arXiv: 2406.14235
mitracker multi-view integration for visual object tracking | arXiv: 2502.20111
mixermdm learnable composition of human motion diffusion models | arXiv: 2504.01019
mixture of submodules for domain adaptive person search
mllm-as-a-judge for image safety without human labeling | arXiv: 2501.00192
mlvu benchmarking multi-task long video understanding | arXiv: 2406.04264
mm-condchain a programmatically verified benchmark for visually grounded deep co | arXiv: 2603.12266
mm-or a large multimodal operating room dataset for semantic understanding of hi
mmar towards lossless multi-modal auto-regressive probabilistic modeling | arXiv: 2410.10798
mmaudio taming multimodal joint training for high-quality video-to-audio synthes
mmrl multi-modal representation learning for vision-language models | arXiv: 2503.08497
mmtl-uniad a unified framework for multimodal and multi-task learning in assisti
mmvu measuring expert-level multi-discipline video understanding | arXiv: 2501.12380
mne-slam multi-agent neural slam for mobile robots
mobile-gs real-time gaussian splatting for mobile devices | arXiv: 2603.11531
mobileh2r learning generalizable human to mobile robot handover exclusively from
mobilemamba lightweight multi-receptive visual mamba network | arXiv: 2411.15941
mobileportrait real-time one-shot neural head avatars on mobile devices | arXiv: 2407.05712
moda motion-drift augmentation for inertial human motion analysis
modec-gs global-to-local motion decomposition and temporal interval adjustment f
model diagnosis and correction via linguistic and implicit attribute editing
model poisoning attacks to federated learning via multi-round consistency | arXiv: 2404.15611
modeling multiple normal action representations for error detection in procedura
modeling thousands of human annotators for generalizable text-to-image person re | arXiv: 2503.09962
modeseq taming sparse multimodal motion prediction with sequential mode modeling | arXiv: 2411.11911
modfinity unsupervised domain adaptation with multimodal information flow intert
moedit on learning quantity perception for multi-object image editing | arXiv: 2503.10112
moee mixture of emotion experts for audio-driven portrait animation | arXiv: 2501.01808
moflow one-step flow matching for human trajectory forecasting via implicit maxi
moge unlocking accurate monocular geometry estimation for open-domain images wit
mokus leveraging cross-modal knowledge transfer for knowledge-aware concept cust | arXiv: 2603.12743
molmo and pixmo open weights and open data for state-of-the-art vision-language | arXiv: 2409.17146
momanipvla transferring vision-language-action models for general mobile manipul | arXiv: 2503.13446
mono-internvl pushing the boundaries of monolithic multimodal large language mod
mono2stereo a benchmark and empirical study for stereo conversion | arXiv: 2503.22262
mono3dvlt monocular-video-based 3d visual language tracking
monocular and generalizable gaussian talking head animation | arXiv: 2504.00665
monodgp monocular 3d object detection with decoupled-query and geometry-error pr
monoinstance enhancing monocular priors via multi-view instance alignment for ne
monoplace3d learning 3d-aware object placement for 3d monocular detection | arXiv: 2504.06801
monosplat generalizable 3d gaussian splatting from monocular depth foundation mo
monotakd teaching assistant knowledge distillation for monocular 3d object detec
monster marry monodepth to stereo unleashes power
morpheus text-driven 3d gaussian splat shape and color stylization | arXiv: 2503.02009
mos modeling object-scene associations in generalized category discovery | arXiv: 2503.12035
mos-attack a scalable multi-objective adversarial attack framework | arXiv: 2501.07251
mosaic of modalities a comprehensive benchmark for multimodal graph learning | arXiv: 2406.16321
mosaic3d foundation dataset and model for open-vocabulary 3d segmentation | arXiv: 2502.02548
mosca dynamic gaussian fusion from casual videos via 4d motion scaffolds | arXiv: 2405.17421
most efficient monarch sparse tuning for 3d representation learning | arXiv: 2503.18368
motif making text count in image animation with motion focal loss | arXiv: 2412.16153
motion modes what could happen next | arXiv: 2412.00148
motion prompting controlling video generation with motion trajectories | arXiv: 2412.02700
motion-grounded video reasoning understanding and perceiving motion at pixel lev
motionanymesh physics-grounded articulation for simulation-ready digital twins | arXiv: 2603.12936
motionbench benchmarking and improving fine-grained video motion understanding f
motionmap representing multimodality in human pose forecasting | arXiv: 2412.18883
motionpro a precise motion controller for image-to-video generation | arXiv: 2505.20287
motionpro exploring the role of pressure in human mocap and beyond | arXiv: 2504.05046
motions as queries one-stage multi-person holistic human motion capture
motionstone decoupled motion intensity modulation with diffusion transformer for | arXiv: 2412.05848
move-in-2d 2d-conditioned human motion generation | arXiv: 2412.13185
move-kd knowledge distillation for vlms with mixture of visual encoders | arXiv: 2501.01709
movie weaver tuning-free multi-concept video personalization with anchored promp
moviebench a hierarchical movie level dataset for long video generation | arXiv: 2411.15262
movis enhancing multi-object novel view synthesis for indoor scenes | arXiv: 2412.11457
mp-gui modality perception with mllms for gui understanding | arXiv: 2503.14021
mp-sfm monocular surface priors for robust structure-from-motion | arXiv: 2504.20040
mpdrive improving spatial understanding with marker-based prompt learning for au
mr detr instructive multi-route training for detection transformers | arXiv: 2412.10028
mtadiffusion mask text alignment diffusion model for object inpainting | arXiv: 2506.23482
multi-focal conditioned latent diffusion for person image synthesis | arXiv: 2503.15686
multi-granularity class prototype topology distillation for class-incremental so
multi-group proportional representations for text-to-image models | arXiv: 2505.24023
multi-label prototype visual spatial search for weakly supervised semantic segme
multi-layer visual feature fusion in multimodal llms methods analysis and best p | arXiv: 2503.06063
multi-modal aerial-ground cross-view place recognition with neural odes
multi-modal contrastive learning with negative sampling calibration for phenotyp
multi-modal contrastive masked autoencoders a two-stage progressive pre-training | arXiv: 2408.02245
multi-modal knowledge distillation-based human trajectory forecasting | arXiv: 2503.22201
multi-modal medical diagnosis via large-small model collaboration
multi-modal synergistic implicit image enhancement for efficient optical flow es
multi-modal topology-embedded graph learning for spatially resolved genes predic
multi-modal vision pre-training for medical image analysis | arXiv: 2410.10604
multi-party collaborative attention control for image customization | arXiv: 2505.01428
multi-resolution pathology-language pre-training model with text-guided visual r | arXiv: 2504.18856
multi-scale neighborhood occupancy masked autoencoder for self-supervised learni
multi-sensor object anomaly detection unifying appearance geometry and internal | arXiv: 2412.14592
multi-subject open-set personalization in video generation | arXiv: 2501.06187
multi-view pose-agnostic change localization with zero labels | arXiv: 2412.03911
multi-view reconstruction via sfm-guided monocular depth estimation | arXiv: 2503.14483
multigo towards multi-level geometry learning for monocular 3d textured human re
multimodal autoregressive pre-training of large vision encoders | arXiv: 2411.14402
multimodal classification of radiation-induced contrast enhancements and tumor r | arXiv: 2603.11827
multimodal ocr parse anything from documents | arXiv: 2603.13032
multimodal protein language models for enzyme kinetic parameters from substrate | arXiv: 2603.12845
multimodalstudio a heterogeneous sensor dataset and framework for neural renderi
multimorph on-demand atlas construction | arXiv: 2504.00247
multiple object tracking as id prediction | arXiv: 2403.16848
multirate neural image compression with adaptive lattice vector quantization
multiscale structure-guided latent diffusion for multimodal mri translation | arXiv: 2603.12581
multitwine multi-object compositing with text and layout control | arXiv: 2502.05165
multivent 20 a massive multilingual benchmark for event-centric video retrieval
must the first dataset and unified framework for multispectral uav single object | arXiv: 2503.17699
must3r multi-view network for stereo 3d reconstruction | arXiv: 2503.01661
mutri multi-view tri-alignment for oct to octa 3d image translation | arXiv: 2504.01428
mv-dust3r single-stage scene reconstruction from sparse views in 2 seconds | arXiv: 2412.06974
mv-math evaluating multimodal math reasoning in multi-visual contexts | arXiv: 2502.20808
mv-ssm multi-view state space modeling for 3d human pose estimation | arXiv: 2509.00649
mvboost boost 3d reconstruction with multi-view refinement | arXiv: 2411.17772
mvdoppler-pose multi-modal multi-view mmwave sensing for long-distance self-occl
mvgenmaster scaling multi-view generation from any image via 3d priors enhanced | arXiv: 2411.16157
mvpaint synchronized multi-view diffusion for painting anything 3d | arXiv: 2411.02336
mvportrait text-guided motion and emotion control for multi-view vivid portrait | arXiv: 2503.19383
mvsanywhere zero-shot multi-view stereo | arXiv: 2503.22430
mxnorm reusing mxfp block scales for efficient tensor normalisation | arXiv: 2603.13180
nader neural architecture design via multi-agent collaboration | arXiv: 2412.19206
narrating the video boosting text-video retrieval via comprehensive utilization
navigating image restoration with vars distribution alignment prior | arXiv: 2412.21063
navigating the unseen zero-shot scene graph generation via capsule-based equivar
navigation world models | arXiv: 2412.03572
nbavatar neural billboards avatars with realistic hand-face interaction | arXiv: 2603.12063
nearly zero-cost protection against mimicry by personalized diffusion models | arXiv: 2412.11423
neighborretr balancing hub centrality in cross-modal retrieval | arXiv: 2503.10526
neisf neural incident stokes field for polarized inverse rendering of conductors | arXiv: 2411.10189
nerfprior learning neural radiance field as a prior for indoor scene reconstruct | arXiv: 2503.18361
nested diffusion models using hierarchical latent priors | arXiv: 2412.05984
neural gate mitigating privacy risks in lvlms via neuron-level gradient gating | arXiv: 2603.12598
neural hierarchical decomposition for single image plant modeling
neural inverse rendering from propagating light | arXiv: 2506.05347
neural lightrig unlocking accurate object normal and material estimation with mu
neural motion simulator pushing the limit of world models in reinforcement learn | arXiv: 2504.07095
neural video compression with context modulation | arXiv: 2505.14541
neuro-3d towards 3d visual decoding from eeg signals | arXiv: 2411.12248
neuro-symbolic evaluation of text-to-video models using formal verification | arXiv: 2411.16718
neuron learning context-aware evolving representations for zero-shot skeleton ac
nexusgs sparse view synthesis with epipolar depth priors in 3d gaussian splattin
nightadapter learning a frequency adapter for generalizable night-time scene seg
nitrofusion high-fidelity single-step diffusion through dynamic adversarial trai
nlprompt noise-label prompt learning for vision-language models | arXiv: 2412.01256
nn-former rethinking graph structure in neural architecture representation | arXiv: 2507.00880
nnwnet rethinking the use of transformers in biomedical image segmentation and c
no pains more gains recycling sub-salient patches for efficient high-resolution
no thing nothing highlighting safety-critical classes for robust lidar semantic
node-rf learning generalized continuous space-time scene dynamics with neural od | arXiv: 2603.12078
noir neural operator mapping for implicit representations | arXiv: 2603.13118
noise calibration and spatial-frequency interactive network for stem image enhan
noise diffusion for enhancing semantic faithfulness in text-to-image synthesis | arXiv: 2411.16503
noise modeling in one hour minimizing preparation efforts for self-supervised lo
noise-consistent siamese-diffusion for medical image synthesis and segmentation | arXiv: 2505.06068
noise-resistant video anomaly detection via rgb error-guided multiscale predicti
noisectrl a sampling-algorithm-agnostic conditional generation method for diffus
non-natural image understanding with advancing frequency-based vision encoders
nonisotropic gaussian diffusion for realistic 3d human motion prediction | arXiv: 2501.06035
nopain no-box point cloud attack via optimal transport singular boundary | arXiv: 2503.00063
not all parameters matter masking diffusion models for enhancing generation abil | arXiv: 2505.03097
not federated unlearning via weight negation | arXiv: 2503.05657
not just text uncovering vision modality typographic threats in image generation | arXiv: 2412.05538
not only text exploring compositionality of visual representations in vision-lan
notes-guided mllm reasoning enhancing mllm with knowledge and visual notes for v
novel architecture of rpa in oral cancer lesion detection | arXiv: 2603.10928
novel view synthesis with pixel-space diffusion models | arXiv: 2411.07765
nsd-imagery a benchmark dataset for extending fmri vision decoding methods to me
ntclick achieving precise interactive segmentation with noise-tolerant clicks
ntr-gaussian nighttime dynamic thermal reconstruction with 4d gaussian splatting
nullu mitigating object hallucinations in large vision-language models via hallu
number it temporal grounding videos like flipping manga | arXiv: 2411.10332
nvcomposer boosting generative novel view synthesis with multiple sparse and unp
nvila efficient frontier visual language models | arXiv: 2412.04468
nyxus a next generation image feature extraction library for the big data and ai | arXiv: 2603.12016
o-tpt orthogonality constraints for calibrating test-time prompt tuning in visio
o3n omnidirectional open-vocabulary occupancy prediction | arXiv: 2603.12144
object detection using event camera a moe heat conduction based detector and a n | arXiv: 2412.06647
object-aware sound source localization via audio-visual scene understanding | arXiv: 2506.18557
object-centric prompt-driven vision-language-action model for robotic manipulati
object-shot enhanced grounding network for egocentric video | arXiv: 2505.04270
objectmover generative object movement with video prior | arXiv: 2503.08037
occlusion-aware text-image-point cloud pretraining for open-world 3d object reco
occmamba semantic occupancy prediction with state space models | arXiv: 2408.09859
ocrt boosting foundation models in the open world with object-concept-relation t | arXiv: 2503.18695
octopus alleviating hallucination via dynamic contrastive decoding | arXiv: 2503.00361
oda-gan orthogonal decoupling alignment gan assisted by weakly-supervised learni
odd-one-out anomaly detection by comparing with neighbors | arXiv: 2406.20099
ode open-set evaluation of hallucinations in multimodal large language models | arXiv: 2409.09318
odhsr online dense 3d reconstruction of humans and scenes from monocular videos | arXiv: 2504.13167
ofer occluded face expression reconstruction | arXiv: 2410.21629
offsetopt explicit surface reconstruction without normals | arXiv: 2503.15763
olympus a universal task router for computer vision tasks | arXiv: 2412.09612
omni-id holistic identity representation designed for generative tasks | arXiv: 2412.09694
omni-rgpt unifying image and video region-level understanding via token marks | arXiv: 2501.08326
omni-scene omni-gaussian representation for ego-centric sparse-view scene recons
omnia de egotempo benchmarking temporal understanding of multi-modal llms in ego
omnidirectional multi-object tracking | arXiv: 2503.04565
omnidocbench benchmarking diverse pdf document parsing with comprehensive annota
omnidrive a holistic vision-language dataset for autonomous driving with counter
omniflow any-to-any generation with multi-modal rectified flows | arXiv: 2412.01169
omnigen unified image generation | arXiv: 2409.11340
omniguard hybrid manipulation localization via augmented versatile deep image wa
omnimanip towards general robotic manipulation via object-centric interaction pr
omnimmi a comprehensive multi-modal interaction benchmark in streaming video con
omnisplat taming feed-forward 3d gaussian splatting for omnidirectional images w
omnistereo real-time omnidireactional depth estimation with multiview fisheye ca
omnistyle filtering high quality style transfer data at scale | arXiv: 2505.14028
on denoising walking videos for gait recognition | arXiv: 2505.18582
on the consistency of video large language models in temporal comprehension | arXiv: 2411.12951
on the generalization of handwritten text recognition models | arXiv: 2411.17332
on the out-of-distribution generalization of large multimodal models | arXiv: 2402.06599
on the possible detectability of image-in-image steganography | arXiv: 2603.11876
on the zero-shot adversarial robustness of vision-language models a truly zero-s
on-device self-supervised learning of low-latency monocular depth from only even
once-tuning-multiple-variants tuning once and expanded as multiple vision-langua
onda-pose occlusion-aware neural domain adaptation for self-supervised 6d object
one diffusion to generate them all | arXiv: 2411.16318
one is plenty a polymorphic feature interpreter for immutable heterogeneous coll
one model for all low-level task interaction is a key to task-agnostic image fus
one model many budgets elastic latent interfaces for diffusion transformers | arXiv: 2603.12245
one token two fates a unified framework via vision token manipulation against ml | arXiv: 2603.10360
one-for-more continual diffusion model for anomaly detection | arXiv: 2502.19848
one-minute video generation with test-time training | arXiv: 2504.05298
one-shot 3d object canonicalization based on geometric and semantic consistency
one-step event-driven high-speed autofocus | arXiv: 2503.01214
one-way ticket time-independent unified encoder for distilling text-to-image dif
one2any one-reference 6d pose estimation for any object | arXiv: 2505.04109
online task-free continual learning via dynamic expansionable memory distributio
online video understanding ovbench and videochat-online | arXiv: 2501.00584
onlineanyseg online zero-shot 3d segmentation by visual foundation model guided
oodd test-time out-of-distribution detection with dynamic dictionary | arXiv: 2503.10468
open ad-hoc categorization with contextualized feature learning | arXiv: 2512.16202
open set label shift with test time out-of-distribution reference | arXiv: 2505.05868
open-canopy towards very high resolution forest monitoring | arXiv: 2407.09392
open-vocabulary functional 3d scene graphs for real-world indoor spaces | arXiv: 2503.19199
open-world amodal appearance completion | arXiv: 2411.13019
open-world objectness modeling unifies novel object detection
openhumanvid a large-scale high-quality dataset for enhancing human-centric vide
opening a comprehensive benchmark for judging open-ended interleaved image-text | arXiv: 2411.18499
openmibood open medical imaging benchmarks for out-of-distribution detection | arXiv: 2503.16247
opensdi spotting diffusion-generated images in the open world | arXiv: 2503.19653
opportunistic single-photon time of flight
optical leveraging optimal transport for contribution allocation in dataset dist
optical-flow guided prompt optimization for coherent video generation | arXiv: 2411.15540
opticalnet an optical imaging dataset and benchmark beyond the diffraction limit
optimal transport-guided source-free adaptation for face anti-spoofing | arXiv: 2503.22984
optimizing for the shortest path in denoising diffusion model | arXiv: 2503.03265
optimus-2 multimodal minecraft agent with goal-observation-action conditioned po
oralxrays-9 towards hospital-scale panoramic x-ray anomaly detection via persona
order-one rolling shutter cameras | arXiv: 2403.11295
order-robust class incremental learning graph-driven dynamic similarity grouping | arXiv: 2502.20032
orida object-centric real-world image composition dataset | arXiv: 2506.08964
osdface one-step diffusion model for face restoration | arXiv: 2411.17163
osloprompt bridging low-supervision challenges and open-set domain generalizatio
osmamba omnidirectional spectral mamba with dual-domain prior generator for expo
osv one step is enough for high-quality image to video generation | arXiv: 2409.11367
ouroboros3d image-to-3d generation via 3d-aware recursive diffusion | arXiv: 2406.03184
out of sight out of mind evaluating state evolution in video world models | arXiv: 2603.13215
overcoming shortcut problem in vlm for robust out-of-distribution detection
overcoming visual clutter in vision language action models via concept-gated vis | arXiv: 2603.10340
overlock an overview-first-look-closely-next convnet with context-mixing dynamic | arXiv: 2502.20087
ovo-bench how far is your video-llms from real-world online video understanding | arXiv: 2501.05510
ow-ovd unified open world and open vocabulary object detection
p-slcr unsupervised point cloud semantic segmentation via prototypes structure l | arXiv: 2603.06321
pact pruning and clustering-based token reduction for faster visual language mod
paint by inpaint learning to add image objects by removing them first | arXiv: 2404.18212
panda towards panoramic depth anything with unlabeled panoramas and mobius spati
pano360 perspective to panoramic vision with geometric consistency | arXiv: 2603.12013
panoaffordancenet towards holistic affordance grounding in 360 indoor environmen | arXiv: 2603.09760
panogs gaussian-based panoptic segmentation for 3d open vocabulary scene underst
panorama generation from nfov image done right | arXiv: 2503.18420
panoramic multimodal semantic occupancy prediction for quadruped robots | arXiv: 2603.13108
pansplat 4k panorama synthesis with feed-forward gaussian splatting | arXiv: 2412.12096
paper title lov3d grounding cognitive prognosis reasoning in longitudinal 3d bra | arXiv: 2603.12071
parahome parameterizing everyday home activities towards 3d generative modeling
parallel sequence modeling via generalized spatial propagation network | arXiv: 2501.12381
parallelized autoregressive visual generation | arXiv: 2412.15119
parameter efficient mamba tuning via projector-targeted diagonal-centric linear | arXiv: 2411.15224
parameter-efficient fine-tuning in hyperspherical space for open-vocabulary sema
parameterized blur kernel prior learning for local motion deblurring
parametric point cloud completion for polygonal surface reconstruction | arXiv: 2503.08363
parc a quantitative framework uncovering the symmetries within vision language m | arXiv: 2506.14808
partgen part-level 3d generation and reconstruction with multi-view diffusion mo
partrm modeling part-level dynamics with large cross-state reconstruction model | arXiv: 2503.19913
passionsr post-training quantization with adaptive scale in one-step diffusion b
patch matters training-free fine-grained image caption enhancement via local per
patchdemux a certifiably robust framework for multi-label classifiers against ad
patchdpo patch-level dpo for finetuning-free personalized image generation | arXiv: 2412.03177
patchguard adversarially robust anomaly detection and localization through visio
patchvsr breaking video diffusion resolution limits with patch-wise video super- | arXiv: 2509.26025
pathways on the image manifold image editing via video generation | arXiv: 2411.16819
patient-level anatomy meets scanning-level physics personalized federated low-do
pattern analogies learning to perform programmatic image edits by analogy | arXiv: 2412.12463
pave patching and adapting video large language models | arXiv: 2503.19794
pay attention to the foreground in object-centric learning
pbr-nerf inverse rendering with physics-based neural fields | arXiv: 2412.09680
pcdreamer point cloud completion through multi-view diffusion priors | arXiv: 2411.19036
pcm picard consistency model for fast parallel sampling of diffusion models | arXiv: 2503.19731
pdfactor learning tri-perspective view policy diffusion field for multi-task rob
peace empowering geologic map holistic understanding with mllms | arXiv: 2501.06184
peer pressure model-to-model regularization for single source domain generalizat
perceive what matters relevance-driven scheduling for multimodal streaming perce | arXiv: 2603.13176
percept memory and imagine world feature simulating for open-domain unknown obje
perception tokens enhance visual reasoning in multimodal language models | arXiv: 2412.03548
perceptual inductive bias is what you need before contrastive learning | arXiv: 2506.01201
perceptual video compression with neural wrapping
perceptually accurate 3d talking head generation new definitions speech-mesh rep
period-llm extending the periodic capability of multimodal large language model | arXiv: 2505.24476
perla perceptive 3d language assistant | arXiv: 2411.19774
perse personalized 3d generative avatars from a single portrait | arXiv: 2412.21206
person de-reidentification a variation-guided identity shift modeling
personabooth personalized text-to-motion generation | arXiv: 2503.07390
personahoi effortlessly improving face personalization in human-object interacti
personalized preference fine-tuning of diffusion models | arXiv: 2501.06655
perturb-and-revise flexible 3d editing with generative trajectories | arXiv: 2412.05279
pfedmxf personalized federated class-incremental learning with mixture of freque
pgc physics-based gaussian cloth from a single pose | arXiv: 2503.20779
phd a chatgpt-prompted visual hallucination evaluation dataset | arXiv: 2403.11116
phgc procedural heterogeneous graph completion for natural language task verific
phoenix a motion-based self-reflection framework for fine-grained robotic action | arXiv: 2504.14588
phys-edit physics-aware semantic image editing with text description
physanimator physics-guided generative cartoon animation | arXiv: 2501.16550
physgen3d crafting a miniature interactive world from a single image | arXiv: 2503.20746
physical plausibility-aware trajectory prediction via locomotion embodiment | arXiv: 2503.17267
physicsgen can generative models learn from images to predict complex physical r | arXiv: 2503.05333
physmodpo physically-plausible humanoid motion with preference optimization | arXiv: 2603.13228
physvlm enabling visual language models to understand robotic physical reachabil
phyt2v llm-guided iterative self-refinement for physics-grounded text-to-video g | arXiv: 2412.00596
pi-hmr towards robust in-bed temporal human shape reconstruction with contact pr
piad pose and illumination agnostic anomaly detection
picd versatile perceptual image compression with diffusion rendering | arXiv: 2505.05853
pico reconstructing 3d people in contact with objects | arXiv: 2504.17695
picosam3 real-time in-sensor region-of-interest segmentation | arXiv: 2603.11917
pidloc cross-view pose optimization network inspired by pid controllers | arXiv: 2503.02388
pidsr complementary polarized image demosaicing and super-resolution | arXiv: 2504.07758
pillarhist a quantization-aware pillar feature encoder based on height-aware his
pioneering 4-bit fp quantization for diffusion models mixup-sign quantization an
pippo high-resolution multi-view humans from a single image | arXiv: 2502.07785
pixel-aligned rgb-nir stereo imaging and dataset for robot vision | arXiv: 2411.18025
pixel-level and semantic-level adjustable super-resolution a dual-lora approach | arXiv: 2412.03017
planarsplatting accurate planar surface reconstruction in 3 minutes | arXiv: 2412.03451
playing the fool jailbreaking llms and multimodal llms with out-of-distribution | arXiv: 2503.20823
pleas - merging models with permutations and least squares | arXiv: 2407.02447
plug-and-play interpretable responsible text-to-image generation via dual-space
plug-and-play ppo an adaptive point prompt optimizer making sam greater
plug-and-play versatile compressed video enhancement | arXiv: 2504.15380
pma towards parameter-efficient point cloud understanding via point mamba adapte | arXiv: 2505.20941
po3ad predicting point offsets toward better 3d point cloud anomaly detection | arXiv: 2412.12617
point cloud upsampling using conditional diffusion module with adaptive noise su
point clouds meets physics dynamic acoustic field fitting network for point clou
point-cache test-time dynamic and hierarchical cache for robust and generalizabl
point-to-region loss for semi-supervised point-based crowd counting | arXiv: 2505.21943
point2rbox-v2 rethinking point-supervised oriented object detection with spatial
pointlora low-rank adaptation with token selection for point cloud learning | arXiv: 2504.16023
pointsr self-regularized point supervision for drone-view object detection
polarfree polarization-based reflection-free imaging | arXiv: 2503.18055
polarized color screen matting
polarnext rethink instance segmentation with polar representation
polishing the sky wide-field and high-dynamic range interferometric image recons | arXiv: 2603.09162
poly-autoregressive prediction for modeling interactions | arXiv: 2502.08646
pomp physics-consistent motion generative model through phase manifolds
pop-gs next best view in 3d-gaussian splatting with p-optimality | arXiv: 2503.07819
popen preference-based optimization and ensemble for lvlm-based reasoning segmen
population normalization for federated learning
pos3r 6d pose estimation for unseen objects made easy
pose priors from language models | arXiv: 2405.03689
pose-guided temporal enhancement for robust low-resolution hand reconstruction
posebh prototypical multi-dataset training beyond human pose estimation | arXiv: 2505.17475
posetraj pose-aware trajectory control in video diffusion | arXiv: 2503.16068
positive2negative breaking the information-lossy barrier in self-supervised sing
post-pre-training for modality alignment in vision-language foundation models | arXiv: 2504.12717
posta a go-to framework for customized artistic poster generation | arXiv: 2503.14908
postermaker towards high-quality product poster generation with accurate text re
postero structuring layout trees to enable language models in generalized conten
pot prototypical optimal transport for weakly supervised semantic segmentation
potential field based deep metric learning | arXiv: 2405.18560
pow3r empowering unconstrained 3d reconstruction with camera and scene priors | arXiv: 2503.17316
pqpp a joint benchmark for text-to-image prompt and query performance prediction | arXiv: 2406.04746
practical solutions to the relative pose of three calibrated cameras | arXiv: 2303.16078
prada projective radial distortion averaging | arXiv: 2504.16499
precise event spotting in sports videos solving long-range dependency and class | arXiv: 2503.00147
precise fast and low-cost concept erasure in value space orthogonal complement m | arXiv: 2412.06143
precisecam precise camera control for text-to-image generation | arXiv: 2501.12910
preconditioners for the stochastic training of neural fields | arXiv: 2402.08784
preditor3d fast and precise 3d shape editing | arXiv: 2412.06592
preserve or modify context-aware evaluation for balancing preservation and modif
preserving clusters in prompt learning for unsupervised domain adaptation | arXiv: 2506.11493
prior does matter visual navigation via denoising diffusion bridge models | arXiv: 2504.10041
prior-free 3d object tracking
proapo progressively automatic prompt optimization for visual classification | arXiv: 2502.19844
probabilistic prompt distribution learning for animal pose estimation | arXiv: 2503.16120
probability density geodesics in image diffusion latent space | arXiv: 2504.06675
probesdf light field probes for neural surface reconstruction | arXiv: 2412.10084
probing the mid-level vision capabilities of self-supervised learning | arXiv: 2411.17474
probpose a probabilistic approach to 2d human pose estimation | arXiv: 2412.02254
prof robot differentiable robot rendering without static and self-collisions | arXiv: 2503.11269
progress-aware video frame captioning | arXiv: 2412.02071
progressive correspondence regenerator for robust 3d registration | arXiv: 2502.02163
progressive focused transformer for single image super-resolution | arXiv: 2503.20337
progressive rendering distillation adapting stable diffusion for instant text-to
prohoc probabilistic hierarchical out-of-distribution classification via multi-d
projattacker a configurable physical adversarial attack for face recognition via
project-probe-aggregate efficient fine-tuning for group robustness | arXiv: 2503.09487
proker a kernel perspective on few-shot adaptation of large vision-language mode
prometheus 3d-aware latent diffusion models for feed-forward text-to-3d scene ge
prompt-cam making vision transformers interpretable for fine-grained analysis | arXiv: 2501.09333
prompt-driven lightweight foundation model for instance segmentation-based fault | arXiv: 2603.12624
prompt2perturb p2p text-guided diffusion-based adversarial attack on breast ultr
prompthashaffinity-prompted collaborative cross-modal learning for adaptive hash
prompthmr promptable human mesh recovery | arXiv: 2504.06397
prompting depth anything for 4k resolution accurate metric depth estimation | arXiv: 2412.14015
proreflow progressive reflow with decomposed velocity | arXiv: 2503.04824
prosody-enhanced acoustic pre-training and acoustic-disentangled prosody adaptin
protecting your video content disrupting automated video-based llm annotations | arXiv: 2503.21824
protodepth unsupervised continual depth completion with prototypes | arXiv: 2503.12745
ProtoOcc: 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation | arXiv: 2503.15185
prototype-based image prompting for weakly supervised histopathological image se
prototype-based knowledge guidance for fine-grained structured radiology reporti | arXiv: 2603.11938
provoking multi-modal few-shot lvlm via exploration-exploitation in-context lear
proximal algorithm unrolling flexible and efficient reconstruction networks for | arXiv: 2505.23180
proxytransformation preshaping point cloud manifold with proxy attention for 3d | arXiv: 2502.19247
ps-diffusion photorealistic subject-driven image editing with disentangled contr
ps-eip robust photometric stereo based on event interval profile | arXiv: 2503.18341
psa-ssl pose and size-aware self-supervised learning on lidar point clouds | arXiv: 2503.13914
psbd prediction shift uncertainty unlocks backdoor detection | arXiv: 2406.05826
pseudo visible feature fine-grained fusion for thermal object detection
pshuman photorealistic single-image 3d human reconstruction using cross-scale mu
ptdiffusion free lunch for generating optical illusion hidden pictures with phas
pup 3d-gs principled uncertainty pruning for 3d gaussian splatting | arXiv: 2406.10219
pura parameter update-recovery test-time adaption for rgb-t tracking
pursuing temporal-consistent video virtual try-on via dynamic pose interaction | arXiv: 2505.16980
pvc progressive visual token compression for unified image and video processing
pytorchgeonodes enabling differentiable shape programs for 3d shape reconstructi
q-bench-video benchmark the video quality understanding of lmms | arXiv: 2409.20063
q-dit accurate post-training quantization for diffusion transformers | arXiv: 2406.17343
q-eval-100k evaluating visual quality and alignment level for text-to-vision con
q-part quasi-periodic adaptive regression with test-time training for pediatric
qmambabsr burst image super-resolution with query state space model | arXiv: 2408.08665
quad-pixel image defocus deblurring a new benchmark and model
quaffure real-time quasi-static neural hair simulation | arXiv: 2412.10061
quantization without tears | arXiv: 2411.13918
quartdepth post-training quantization for real-time depth estimation on the edge | arXiv: 2503.16709
qucoop a versatile framework for solving composite and binary-parametrised probl
query efficient black-box visual prompting with subspace learning
question-aware gaussian experts for audio-visual question answering | arXiv: 2503.04459
r-score revisiting scene coordinate regression for robust large-scale visual loc
r-tpt improving adversarial robustness of vision-language models through test-ti
r2c mapping room to chessboard to unlock llm as low-level action planner
racformer towards high-quality 3d object detection via query-based radar-camera | arXiv: 2412.12725
rad region-aware diffusion models for image inpainting | arXiv: 2412.09191
radio frequency ray tracing with neural object representation for enhanced rf mo
radiov25 improved baselines for agglomerative vision foundation models
raencoder a label-free reversible adversarial examples encoder for dataset intel
rainygs efficient rain synthesis with physically-based gaussian splatting | arXiv: 2503.21442
randar decoder-only autoregressive visual generation in random orders | arXiv: 2412.01827
random conditioning for diffusion model compression with distillation | arXiv: 2504.02011
range retrieval augmented neural fields for multi-resolution geo-embeddings | arXiv: 2502.19781
rap retrieval-augmented personalization for multimodal large language models | arXiv: 2410.13360
rashomon sets for prototypical-part networks editing interpretable models in rea
rasp revisiting 3d anamorphic art for shadow-guided packing of irregular objects | arXiv: 2504.02465
rass improving denoising diffusion samplers with reinforced active sampling sche
rate-in information-driven adaptive dropout rates for improved inference-time un
rayflow instance-aware diffusion acceleration via adaptive flow trajectories | arXiv: 2503.07699
rc-autocalib an end-to-end radar-camera automatic calibration network | arXiv: 2505.22427
rcp-bench benchmarking robustness for collaborative perception under diverse cor
rdd robust feature detector and descriptor using deformable transformer | arXiv: 2505.08013
rdnet region proportion-aware dynamic adaptive salient object detection network | arXiv: 2603.12215
re-hold video hand object interaction reenactment via adaptive layout-instructed | arXiv: 2503.16942
re-thinking temporal search for long-form video understanding | arXiv: 2504.02259
real-iad d3 a real-world 2dpseudo-3d3d dataset for industrial anomaly detection
real-time free-view human rendering from sparse-view rgb videos using double unp
real-time high-fidelity gaussian human avatars with position-based interpolation
realedit reddit edits as a large-scale empirical dataset for image transformatio
realistic test-time adaptation of vision-language models | arXiv: 2501.03729
reanimating images using neural representations of dynamic stimuli | arXiv: 2406.02659
reason-before-retrieve one-stage reflective chain-of-thoughts for training-free
reasongrounder lvlm-guided hierarchical feature splatting for open-vocabulary 3d
reasoning in visual navigation of end-to-end trained agents a dynamical systems | arXiv: 2503.08306
reasoning mamba hypergraph-guided region relation calculating for weakly supervi
reasoning over video evaluating how mllms extract integrate and reconstruct spat | arXiv: 2603.13091
reasoning to attend try to understand how seg token works | arXiv: 2412.17741
recap better gaussian relighting with cross-environment captures | arXiv: 2412.07534
recapture generative video camera controls for user-provided videos using masked | arXiv: 2411.05003
recognition-synergistic scene text editing | arXiv: 2503.08387
recon enhancing true correspondence discrimination through relation consistency
reconciling stochastic and deterministic strategies for zero-shot image restorat
recondreamer crafting world models for driving scene reconstruction via online r | arXiv: 2411.19548
reconstructing animals and the wild | arXiv: 2411.18807
reconstructing close human interaction with appearance and proxemics reasoning | arXiv: 2507.02565
reconstructing humans with a biomechanically accurate skeleton | arXiv: 2503.21751
reconstructing in-the-wild open-vocabulary human-object interactions | arXiv: 2503.15898
reconstructing people places and cameras | arXiv: 2412.17806
reconstruction vs generation taming optimization dilemma in latent diffusion mod
recover and match open-vocabulary multi-label recognition through knowledge-cons
recovering dynamic 3d sketches from videos | arXiv: 2503.20321
rectification-specific supervision and constrained estimator for online stereo r
rectified diffusion guidance for conditional generation | arXiv: 2410.18737
recurrence-enhanced vision-and-language transformers for robust multimodal docum
recurrent feature mining and keypoint mixup padding for category-agnostic pose e | arXiv: 2503.21140
redefining creative in dictionary towards an enhanced semantic understanding of | arXiv: 2410.24160
rediffdet rotation-equivariant diffusion model for oriented object detection
reducing class-wise confusion for incremental learning with disentangled manifol
ref-gs directional factorization for 2d gaussian splatting | arXiv: 2412.00905
reference-based 3d-aware image editing with triplanes | arXiv: 2404.03632
reference-free image quality assessment for virtual try-on via human feedback | arXiv: 2603.13057
refpose leveraging reference geometric correspondences for accurate 6d pose esti
regularizing inr with diffusion prior self-supervised 3d reconstruction of neutr | arXiv: 2603.10947
reinforcing the weakest links modernizing siena with targeted deep learning inte | arXiv: 2603.12951
relation-rich visual document generator for visual information extraction | arXiv: 2504.10659
relation3d enhancing relation modeling for point cloud instance segmentation | arXiv: 2506.17891
relationfield relate anything in radiance fields | arXiv: 2412.13652
relative pose estimation through affine corrections of monocular depth priors | arXiv: 2501.05446
reloc3r large-scale training of relative camera pose regression for generalizabl
relocate a simple training-free baseline for visual query localization using reg
remote photoplethysmography in real-world and extreme lighting scenarios | arXiv: 2503.11465
removing reflections from raw photos | arXiv: 2404.14414
reneg learning negative embedding with reward guidance | arXiv: 2412.19637
reno real-time neural compression for 3d lidar point clouds | arXiv: 2503.12382
reperformer immersive human-centric volumetric videos from playback to photoreal | arXiv: 2503.12242
representation learning for spatiotemporal physical systems | arXiv: 2603.13227
reproducible vision-language models meet concepts out of pre-training
repurposing pre-trained video diffusion models for event-based video interpolati
repurposing stable diffusion attention for training-free unsupervised interactiv
reraw rgb-to-raw image reconstruction via stratified sampling for efficient obje
resclip residual attention for training-free dense vision-language inference | arXiv: 2411.15851
residual sodap residual self-organizing domain-adaptive prompting with structura | arXiv: 2603.12816
resilient sensor fusion under adverse sensor failures via multi-modal expert fus
respec relevance and specificity grounded online filtering for learning on video
restorgs depth-aware gaussian splatting for efficient 3d scene restoration
retaining knowledge and enhancing long-text representations in clip through dual
rethinking correspondence-based category-level object pose estimation
rethinking decoder design improving biomarker segmentation using depth-to-space
rethinking diffusion for text-driven human motion generation redundant represent
rethinking end-to-end 2d to 3d scene segmentation in gaussian splatting | arXiv: 2503.14029
rethinking epistemic and aleatoric uncertainty for active open-set annotation an | arXiv: 2502.19691
rethinking few-shot adaptation of vision-language models in two stages | arXiv: 2503.11609
rethinking lanes and points in complex scenarios for monocular 3d lane detection | arXiv: 2503.06237
rethinking noisy video-text retrieval via relation-aware alignment
rethinking personalized aesthetics assessment employing physique aesthetics asse
rethinking query-based transformer for continual image segmentation | arXiv: 2507.07831
rethinking reconstruction and denoising in the dark new perspective general arch
rethinking spiking self-attention mechanism implementing a-xnor similarity calcu
rethinking temporal fusion with a unified gradient descent view for 3d semantic | arXiv: 2504.12959
rethinking the adversarial robustness of multi-exit neural networks in an attack
rethinking token reduction with parameter-efficient fine-tuning in vit for pixel
rethinking training for de-biasing text-to-image generation unlocking the potent
rethinking vision-language model in face forensics multi-modal interpretable for | arXiv: 2503.20188
rethinking vlms for image forgery detection and localization | arXiv: 2603.12930
retrieving semantics from the deep an rag solution for gesture synthesis | arXiv: 2412.06786
revealing key details to see differences a novel prototypical perspective for sk
reversible decoupling network for single image reflection removal | arXiv: 2410.08063
reversing flow for image restoration | arXiv: 2506.16961
revisionllm recursive vision-language model for temporal grounding in hour-long | arXiv: 2411.14901
revisiting audio-visual segmentation with vision-centric transformer | arXiv: 2506.23623
revisiting backdoor attacks against large vision-language models from domain shi
revisiting fairness in multitask learning a performance-driven approach for vari
revisiting generative replay for class incremental object detection
revisiting mae pre-training for 3d medical image segmentation | arXiv: 2410.23132
revisiting model stitching in the foundation model era | arXiv: 2603.12433
revisiting source-free domain adaptation insights into representativeness genera
reward fine-tuning two-step diffusion models via learning differentiable latent- | arXiv: 2411.15247
rewind real-time egocentric whole-body motion diffusion with exemplar-based iden
rewind understanding long videos with instructed learnable memory | arXiv: 2411.15556
rewis3d reconstruction improves weakly-supervised semantic segmentation | arXiv: 2603.06374
rgbavatar reduced gaussian blendshapes for online modeling of head avatars | arXiv: 2503.12886
riccardo radar hit prediction and convolution for camera-radar 3d object detecti
riggs rigging of 3d gaussians for modeling articulated objects in videos | arXiv: 2503.16822
ripvis rip currents video instance segmentation benchmark for beach monitoring a | arXiv: 2504.01128
rivuletmlp an mlp-based architecture for efficient compressed video quality enha
rl-rc-dot a block-level rl agent for task-aware video compression | arXiv: 2501.12216
rlaif-v open-source ai feedback leads to super gpt-4v trustworthiness | arXiv: 2405.17220
rng relightable neural gaussians | arXiv: 2409.19702
roadsocial a diverse videoqa dataset and benchmark for road event understanding
robobrain a unified brain model for robotic manipulation from abstract to concre
roboground robotic manipulation with grounded vision-language priors | arXiv: 2504.21530
robopepp vision-based robot pose and joint angle estimation through embedding pr
robosense large-scale dataset and benchmark for egocentric robot perception and
robospatial teaching spatial understanding to 2d and 3d vision-language models f | arXiv: 2411.16537
robotic visual instruction | arXiv: 2505.00693
robotwin dual-arm robot benchmark with generative digital twins | arXiv: 2504.13059
robsense a robust multi-modal foundation model for remote sensing with static te
robust 3d shape reconstruction in zero-shot from a single image in the wild | arXiv: 2403.14539
robust audio-visual segmentation via audio-guided visual convergent alignment | arXiv: 2503.12847
robust message embedding via attention flow-based steganography | arXiv: 2405.16414
robust multi-object 4d generation for in-the-wild videos
robust multimodal survival prediction with conditional latent differentiation va
robust-mvton learning cross-pose feature alignment and fusion for robust multi-v
rocket-1 mastering open-world interaction with visual-temporal context prompting | arXiv: 2410.17856
rod-mllm towards more reliable object detection in multimodal large language mod
rogsplat learning robust generalizable human gaussian splatting from sparse mult
roictrl boosting instance control for visual generation | arXiv: 2411.17949
roll robust noisy pseudo-label learning for multi-view clustering with noisy cor
rooftop wind field reconstruction using sparse sensors from deterministic to gen | arXiv: 2603.13077
roompainter view-integrated diffusion for consistent indoor scene texturing | arXiv: 2412.16778
roomtour3d geometry-aware video-instruction tuning for embodied navigation | arXiv: 2412.08591
rorem training a robust object remover with human-in-the-loop | arXiv: 2501.00740
ros-sam high-quality interactive segmentation for remote sensing moving object | arXiv: 2503.12006
rotation-equivariant self-supervised method in image denoising | arXiv: 2505.19618
rsar restricted state angle resolver and rotated sar benchmark | arXiv: 2501.04440
rsonet region-guided selective optimization network for rgb-t salient object det | arXiv: 2603.12685
rubik a structured benchmark for image matching across geometric challenges | arXiv: 2502.19955
s2d-lfe sparse-to-dense light field event generation
s2gaussian sparse-view super-resolution 3d gaussian splatting | arXiv: 2503.04314
s3-face sss-compliant facial reflectance estimation via diffusion priors
s4-driver scalable self-supervised driving multimodal large language model with
sacb-net spatial-awareness convolutions for medical image registration | arXiv: 2503.19592
saist segment any infrared small target model guided by contrastive language-ima
salad skeleton-aware latent diffusion for text-driven motion generation and edit | arXiv: 2503.13836
salient frequency-aware paired diffusion for controllable long-tail ct detection | arXiv: 2602.23447
saliuitl ensemble salience guided recovery of adversarial patches against cnns
salova segment-augmented long video assistant for targeted retrieval and routing
sam-i2v upgrading sam to support promptable video segmentation with less than 02
sam-ref introducing image-prompt synergy during interaction for detail enhanceme
sam2-love segment anything model 2 in language-aided audio-visual scenes | arXiv: 2506.01558
sam2object consolidating view consistency via sam2 for zero-shot 3d instance seg
samam style-aware state space model for arbitrary image style transfer | arXiv: 2503.15934
samba a unified mamba-based framework for general salient object detection
samble shape-specific point cloud sampling for an optimal trade-off between loca
sample- and parameter-efficient auto-regressive image models | arXiv: 2411.15648
sampling innovation-based adaptive compressive sensing | arXiv: 2503.13241
samwise infusing wisdom in sam2 for text-driven video segmentation | arXiv: 2411.17646
sap segment any 4k panorama | arXiv: 2603.12759
sapave towards active perception and manipulation in vision-language-action mode | arXiv: 2603.12193
sapiensid foundation for human recognition | arXiv: 2504.04708
sar3d autoregressive 3d object generation and understanding via multi-scale 3d v | arXiv: 2411.16856
sasep saliency-aware structured separation of geometry and feature for open set
sat-hmr real-time multi-person 3d mesh estimation via scale-adaptive tokens | arXiv: 2411.19824
sata spatial autocorrelation token analysis for enhancing the robustness of visi
satellite observations guided diffusion model for accurate meteorological states
satellite to groundscape - large-scale consistent ground view generation from sa
saw toward a surgical action world model via controllable and scalable video gen | arXiv: 2603.13024
scalable autoregressive monocular depth estimation | arXiv: 2411.11361
scalable video-to-dataset generation for cross-platform mobile agents | arXiv: 2505.12632
scale efficient training for large datasets | arXiv: 2503.13385
scalelsd scalable deep line segment detection streamlined | arXiv: 2506.09369
scaling down text encoders of text-to-image diffusion models | arXiv: 2503.19897
scaling inference time compute for diffusion models
scaling mesh generation via compressive tokenization | arXiv: 2411.07025
scaling properties of diffusion models for perceptual tasks | arXiv: 2411.08034
scaling up image segmentation across data and tasks
scaling vision pre-training to 4k resolution | arXiv: 2503.19903
scamo exploring the scaling law in autoregressive motion generation model | arXiv: 2412.14559
scap transductive test-time adaptation via supportive clique-based attribute pro
scenario dreamer vectorized latent diffusion for generating driving simulation e | arXiv: 2503.22496
scene map-based prompt tuning for navigation instruction generation
scene splatter momentum 3d scene generation from single image with video diffusi
scene-agnostic pose regression for visual localization | arXiv: 2503.19543
scene-centric unsupervised panoptic segmentation | arXiv: 2504.01955
scene4u hierarchical layered 3d scene reconstruction from single panoramic image
sceneassistant a visual feedback agent for open-vocabulary 3d scene generation | arXiv: 2603.12238
scenecrafter controllable multi-view driving scene editing | arXiv: 2506.19488
scenediffuser city-scale traffic simulation via a generative world model | arXiv: 2506.21976
scenefactor factored latent 3d diffusion for controllable 3d scene generation | arXiv: 2412.01801
scenetap scene-coherent typographic adversarial planner against vision-language
scflow2 plug-and-play object pose refiner with shape-constraint scene flow | arXiv: 2504.09160
schedule on the fly diffusion time prediction for faster and better image genera
science-t2i addressing scientific illusions in image synthesis | arXiv: 2504.13129
scope scene-contextualized incremental few-shot 3d segmentation | arXiv: 2603.06572
scope semantic coreset with orthogonal projection embeddings for federated learn | arXiv: 2603.12976
scribblelight single image indoor relighting with scribbles | arXiv: 2411.17696
scsa a plug-and-play semantic continuous-sparse attention for arbitrary semantic | arXiv: 2503.04119
scsegamba lightweight structure-aware vision mamba for crack segmentation in str
sdbf steep-decision-boundary fingerprinting for hard-label tampering detection o
sdf-net structure-aware disentangled feature learning for opticall-sar ship re-i | arXiv: 2603.12588
sdgocc semantic and depth-guided birds-eye view transformation for 3d multimodal | arXiv: 2507.17083
sea-ing in low-light
seal semantic attention learning for long video representation | arXiv: 2412.01798
sealion semantic part-aware latent point diffusion models for 3d generation | arXiv: 2505.17721
search and detect training-free long tail object detection via web-image retriev | arXiv: 2409.18733
sec-promptsemantic complementary prompting for few-shot class-incremental learni
secap self-calibrating and adaptive prompts for cross-view person re-identificat
secret lies in color enhancing ai-generated images detection with color distribu
see further when clear curriculum consistency model | arXiv: 2412.06295
seedvr seeding infinity in diffusion transformer towards generic video restorati
seeground see and ground for zero-shot open-vocabulary 3d visual grounding | arXiv: 2412.04383
seeing a 3d world in a grain of sand | arXiv: 2503.00260
seeing far and clearly mitigating hallucinations in mllms with attention causal | arXiv: 2505.16652
seeing is not believing adversarial natural object optimization for hard-label 3
seeing more with less human-like representations in vision models
seeing speech and sound distinguishing and locating audio sources in visual scen
seeing the abstract translating the abstract language for vision language models | arXiv: 2505.03242
seeing what matters empowering clip with patch generation-to-selection | arXiv: 2503.17080
seek common ground while reserving differences semi-supervised image-text sentim
seeking consistent flat minima for better domain generalization via refining los
seen-da semantic entropy guided domain-aware attention for domain adaptive objec
segagent exploring pixel understanding capabilities in mllms by imitating human | arXiv: 2503.08625
segearth-ov towards training-free open-vocabulary segmentation for remote sensin
segman omni-scale context modeling with state space models and local attention f
segment any motion in videos | arXiv: 2503.22268
segment any-quality images with generative latent space enhancement | arXiv: 2503.12507
segment anything even occluded | arXiv: 2503.06261
segment this thing foveated tokenization for efficient point-prompted segmentati
segmenting maxillofacial structures in cbct volumes
self-cross diffusion guidance for text-to-image synthesis of similar subjects | arXiv: 2411.18936
self-evolving visual concept library using vision-language critics | arXiv: 2504.00185
self-expansion of pre-trained models with mixture of adapters for continual lear
self-learning hyperspectral and multispectral image fusion via adaptive residual
self-supervised controlnet with spatio-temporal mamba for real-world video super | arXiv: 2506.01037
self-supervised cross-view correspondence with predictive cycle consistency
self-supervised large scale point cloud completion for archaeological site resto
self-supervised learning for color spike camera reconstruction
self-supervised spatial correspondence across modalities | arXiv: 2506.03148
selfsplat pose-free and 3d prior-free generalizable 3d gaussian splatting | arXiv: 2411.17190
semalign3d semantic correspondence between rgb-images through aligning 3d object | arXiv: 2503.22462
semantic and expressive variations in image captions across languages | arXiv: 2310.14356
semantic and sequential alignment for referring video object segmentation
semantic class distribution learning for debiasing semi-supervised medical image | arXiv: 2603.05202
semantic library adaptation lora retrieval and fusion for open-vocabulary semant | arXiv: 2503.21780
semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
semantic-guided cross-modal prompt learning for skeleton-based zero-shot action
semanticdraw towards real-time interactive content creation from image diffusion | arXiv: 2403.09055
semgeomo dynamic contextual human motion generation with semantic and geometric | arXiv: 2503.01291
semi-supervised state-space model with dynamic stacking filter for real-world vi
semidavil semi-supervised domain adaptation with vision-language guidance for se
semiets integrating spatial and content consistencies for semi-supervised end-to
semitooth a generalizable semi-supervised framework for multi-source tooth segme | arXiv: 2603.11616
sensitivity-aware efficient fine-tuning via compact dynamic-rank adaptation
separation of powers on segregating knowledge from observation in llm-enabled kn
seq2time sequential knowledge transfer for video llm temporal grounding | arXiv: 2411.16932
seqafford sequential 3d affordance reasoning via multimodal large language model | arXiv: 2412.01550
seqmvrl a sequential fusion framework for multi-view representation learning
serialgen personalized image generation by first standardization then personaliz
seriesbench a benchmark for narrative-driven drama series understanding | arXiv: 2504.21435
set spectral enhancement for tiny object detection
seurat from moving points to depth | arXiv: 2504.14687
sf2t self-supervised fragment finetuning of video-llms for fine-grained understa
sf3d stable fast 3d mesh reconstruction with uv-unwrapping and illumination dise
sfdm robust decomposition of geometry and reflectance for realistic face renderi
sfm-free 3d gaussian splatting via hierarchical training | arXiv: 2412.01553
sgc-net stratified granular comparison network for open-vocabulary hoi detection | arXiv: 2503.00414
sgcr spherical gaussians for efficient 3d curve reconstruction | arXiv: 2505.04668
sgformer satellite-ground fusion for 3d semantic scene completion | arXiv: 2503.16825
sgma semantic-guided modality-aware segmentation for remote sensing with incompl | arXiv: 2603.02505
sgmatch semantic-guided non-rigid shape matching with flow regularization | arXiv: 2603.12937
sgsst scaling gaussian splatting style transfer
shading meets motion self-supervised indoor 3d reconstruction via simultaneous s
shadow generation using diffusion model with geometry prior
shape abstraction via marching differentiable support functions
shape and texture what influences reliable optical flow estimation
shape my moves text-driven shape-aware synthesis of human motions | arXiv: 2504.03639
shapeshifter 3d variations using multiscale and sparse point-voxel diffusion | arXiv: 2502.02187
shapewords guiding text-to-image synthesis with 3d shape-aware prompts | arXiv: 2412.02912
sharp-it a multi-view to multi-view diffusion model for 3d synthesis and manipul | arXiv: 2412.02631
sharpdepth sharpening metric depth predictions using diffusion distillation | arXiv: 2411.18229
shift the lens environment-aware unsupervised camouflaged object detection
shiftwiseconv small convolutional kernel with large kernel effect | arXiv: 2401.12736
shining yourself high-fidelity ornaments virtual try-on with diffusion model | arXiv: 2503.16065
shotadapter text-to-multi-shot video generation with diffusion models | arXiv: 2505.07652
show and segment universal medical image segmentation via in-context learning | arXiv: 2503.19359
show and tell visually explainable deep neural nets via spatially-aware concept | arXiv: 2502.20134
show dont tell detecting novel objects by watching human videos | arXiv: 2603.12751
showhowto generating scene-conditioned step-by-step visual instructions | arXiv: 2412.01987
showmak3r compositional tv show reconstruction | arXiv: 2504.19584
showui one vision-language-action model for gui visual agent | arXiv: 2411.17465
shrec a spectral embedding-based approach for ab-initio reconstruction of helica | arXiv: 2603.12307
sida social media image deepfake detection localization and explanation with lar
silence is golden leveraging adversarial examples to nullify audio control in ld
silent branding attack trigger-free data poisoning attack on text-to-image diffu
silmm self-improving large multimodal models for compositional text-to-image gen
sim-to-real causal transfer a metric learning approach to causally-aware interac
simavatar simulation-ready avatars with layered hair and clothing | arXiv: 2412.09545
similarity-guided layer-adaptive vision transformer for uav tracking | arXiv: 2503.06625
simlingo vision-only closed-loop autonomous driving with language-action alignme
simltd simple supervised and semi-supervised long-tailed object detection | arXiv: 2412.20047
simmotionedit text-based human motion editing with motion similarity prediction | arXiv: 2503.18211
simpler diffusion 15 fid on imagenet512 with pixel-space diffusion
simplification is all you need against out-of-distribution overconfidence
simulator hc regression-based online simulation of starting problem-solution pai
simvs simulating world inconsistencies for robust view synthesis | arXiv: 2412.07696
single domain generalization for few-shot counting via universal representation | arXiv: 2505.16778
single pixel image classification using an ultrafast digital light projector | arXiv: 2603.12036
sings animatable single-image human gaussian splats with kinematic priors
sinr sparsity driven compressed implicit neural representations | arXiv: 2503.19576
sir-diff sparse image sets restoration with multi-view diffusion model | arXiv: 2503.14463
six-cd benchmarking concept removals for text-to-image diffusion models | arXiv: 2406.14855
skdream controllable multi-view and 3d generation with arbitrary skeletons
ske-layout spatial knowledge enhanced layout generation with llms
sketch down the flops towards efficient networks for human sketch | arXiv: 2505.23763
sketchagent language-driven sequential sketch generation | arXiv: 2411.17673
sketchfusion learning universal sketch features through fusing foundation models | arXiv: 2503.14129
sketchtopia a dataset and foundational agents for benchmarking asynchronous mult
sketchvideo sketch-based video generation and editing | arXiv: 2503.23284
sketchy bounding-box supervision for 3d instance segmentation | arXiv: 2505.16399
skillmimic learning basketball interaction skills from demonstrations | arXiv: 2408.15270
skip tuning pre-trained vision-language models are effective and efficient adapt | arXiv: 2412.11509
skysense-o towards open-world remote sensing interpretation with vision-centric
slade shielding against dual exploits in large vision-language models
slam3r real-time dense scene reconstruction from monocular rgb videos | arXiv: 2412.09401
sldprtnet a large-scale multimodal dataset for cad generation in language-driven | arXiv: 2603.13098
sleepermark towards robust watermark against fine-tuning text-to-image diffusion | arXiv: 2412.04852
slidechat a large vision-language assistant for whole-slide pathology image unde
slvr super-light visual reconstruction via blueprint controllable convolutions a
small target detection based on mask-enhanced attention fusion of visible and in | arXiv: 2603.06925
smartclip modular vision-language alignment with identification guarantees | arXiv: 2507.22264
smarteraser remove anything from images using masked-region guidance | arXiv: 2501.08279
smile infusing spatial and motion semantics in masked video learning | arXiv: 2504.00527
smtpd a new benchmark for temporal prediction of social media popularity | arXiv: 2503.04446
snapgen taming high-resolution text-to-image models for mobile devices with effi
snapgen-v generating a five-second video within five seconds on a mobile device | arXiv: 2412.10494
snowmaster comprehensive real-world image desnowing via mllm with multi-model fe
soap vision-centric 3d semantic scene completion with scene-adaptive decoder and
socialgesture delving into multi-person gesture understanding | arXiv: 2504.02244
socialmoif multi-order intention fusion for pedestrian trajectory prediction | arXiv: 2504.15616
soft self-labeling and potts relaxations for weakly-supervised segmentation | arXiv: 2507.01721
softshadow leveraging soft masks for penumbra-aware shadow removal | arXiv: 2409.07041
softvq-vae efficient 1-dimensional continuous tokenizer | arXiv: 2412.10958
sogs second-order anchor for advanced 3d gaussian splatting | arXiv: 2503.07476
solami social vision-language-action modeling for immersive interaction with 3d | arXiv: 2412.00174
solve synergy of language-vision and end-to-end networks for autonomous driving | arXiv: 2505.16805
solving instance detection from an open-world perspective | arXiv: 2503.00359
soma singular value decomposed minor components adaptation for domain generaliza
sonata self-supervised learning of reliable point representations | arXiv: 2503.16429
sonic shifting focus to global audio perception in portrait animation | arXiv: 2411.16331
sortscrews a dataset and baseline for real-time screw classification | arXiv: 2603.13027
sound bridge associating egocentric and exocentric videos via audio cues
soundvista novel-view ambient sound synthesis via visual-acoustic binding | arXiv: 2504.05576
sp3d boosting sparsely-supervised 3d object detection via accurate cross-modal s | arXiv: 2503.06467
spa-vl a comprehensive safety preference alignment dataset for vision language m | arXiv: 2406.12030
spar3d stable point-aware reconstruction of 3d objects from single images | arXiv: 2501.04689
sparc score prompting and adaptive fusion for zero-shot multi-label recognition
sparrow learning spatial precision and temporal referential consistency in pixel | arXiv: 2603.12382
spars3r semantic prior alignment and regularization for sparse 3d reconstruction | arXiv: 2411.12592
sparse point cloud patches rendering via splitting 2d gaussians | arXiv: 2505.09413
sparse voxels rasterization real-time high-fidelity radiance field rendering | arXiv: 2412.04459
sparse2dgs geometry-prioritized gaussian splatting for surface reconstruction fr
sparsealign a fully sparse framework for cooperative object detection | arXiv: 2503.12982
spatial reasoning is not a free lunch a controlled study on llava | arXiv: 2603.12545
spatial transport optimization by repositioning attention map for training-free | arXiv: 2503.22168
spatial-temporal graph diffusion policy with kinematic modeling for bimanual rob
spatial-ttt streaming visual-based spatial intelligence with test-time training | arXiv: 2603.12255
spatial457 a diagnostic benchmark for 6d spatial reasoning of large mutimodal mo
spatialclip learning 3d-aware image representations from spatially discriminativ
spatialdreamer self-supervised stereo video synthesis from monocular input | arXiv: 2411.11934
spatialllm a compound 3d-informed design towards spatially-intelligent large mul
spatio-semantic expert routing architecture with mixture-of-experts for referrin | arXiv: 2603.12538
spatiotemporal decoupling for efficient vision-based occupancy forecasting | arXiv: 2411.14169
spatiotemporal skip guidance for enhanced video diffusion sampling | arXiv: 2411.18664
spc-gs gaussian splatting with semantic-prompt consistency for indoor open-world
spectral defense against resource-targeting attack in 3d gaussian splatting | arXiv: 2603.12796
spectral informed mamba for robust point cloud processing | arXiv: 2503.04953
spectral state space model for rotation-invariant visual representation learning | arXiv: 2503.06369
spectral-geometric neural fields for pose-free lidar view synthesis | arXiv: 2603.12903
spectre-gs modeling highly specular surfaces with reflected nearby objects by tr
spectromotion dynamic 3d reconstruction of specular scenes | arXiv: 2410.17249
speedy-splat fast 3d gaussian splatting with sparse pixels and sparse primitives | arXiv: 2412.00578
sphereuformer a u-shaped transformer for spherical 360 perception | arXiv: 2412.06968
spherical manifold guided diffusion model for panoramic image generation
spiking transformer introducing accurate addition-only spiking self-attention fo
spiking transformer with spatial-temporal attention | arXiv: 2409.19764
spiritsight agent advanced gui agent with one look | arXiv: 2503.03196
spk2srimgnet super-resolve dynamic scene from spike stream via motion aligned co
splatad real-time lidar and camera rendering with 3d gaussian splatting for auto
splatflow multi-view rectified flow model for 3d gaussian splatting synthesis | arXiv: 2411.16443
splatflow self-supervised dynamic gaussian splatting in neural motion flow field
splatter-360 generalizable 360 gaussian splatting for wide-baseline panoramic im
splinegs robust motion-adaptive spline for real-time dynamic 3d gaussians from m | arXiv: 2412.09982
split adaptation for pre-trained vision transformers | arXiv: 2503.00441
spmtrack spatio-temporal parameter-efficient fine-tuning with mixture of experts
spotting the unexpected stu a 3d lidar dataset for anomaly segmentation in auton
sshnet unsupervised cross-modal homography estimation via problem reformulation
staa-snn spatial-temporal attention aggregator for spiking neural networks | arXiv: 2503.02689
stabilizing and accelerating autofocus with expert trajectory regularized deep r
stable flow vital layers for training-free image editing | arXiv: 2411.14430
stable-score a stable registration-based framework for 3d shape correspondence | arXiv: 2503.21766
stableanimator high-quality identity-preserving human image animation | arXiv: 2411.17697
stacking brick by brick aligned feature isolation for incremental face forgery d | arXiv: 2411.11396
stagedesigner artistic stage generation for scenography via theater scripts | arXiv: 2503.02595
star with bilinear mapping
star-edge structure-aware local spherical curve representation for thin-walled e
stargen a spatiotemporal autoregression framework with video diffusion model for
starvector generating scalable vector graphics code from images and text | arXiv: 2312.11556
stcocc sparse spatial-temporal cascade renovation for 3d occupancy and scene flo
stdd spatio-temporal dual diffusion for video generation
stdgen semantic-decomposed 3d character generation from single images | arXiv: 2411.05738
steady progress beats stagnation mutual aid of foundation and conventional model
stealthy backdoor attack in self-supervised learning vision encoders for large v | arXiv: 2502.18290
steepest descent density control for compact 3d gaussian splatting | arXiv: 2505.05587
steering away from harm an adaptive approach to defending vision language model | arXiv: 2411.16721
step enhancing video-llms compositional reasoning by spatio-temporal graph-guide
steps sequential probability tensor estimation for text-to-image hard prompt sea
stereo a two-stage framework for adversarially robust concept erasing from text-
stereo anywhere robust zero-shot deep stereo matching even where either stereo o
stereo4d learning how things move in 3d from internet stereo videos | arXiv: 2412.09621
stickmotion generating 3d human motions by drawing a stickman | arXiv: 2503.04829
stil semi-supervised tabular-image learning for comprehensive task-relevant info
sting-bee towards vision-language model for real-world x-ray baggage security in | arXiv: 2504.02823
stinr deciphering spatial transcriptomics via implicit neural representation
stochastic human motion prediction with memory of action transition and action c | arXiv: 2507.04062
stop integrated spatial-temporal dynamic prompting for video understanding | arXiv: 2503.15973
stop learning it all to mitigate visual hallucination focus on the hallucination | arXiv: 2506.11417
stop walking in circles bailing out early in projected gradient descent | arXiv: 2503.19347
storygpt-v large language models as consistent story visualizers | arXiv: 2312.02252
stpro spatial and temporal progressive learning for weakly supervised spatio-tem
strap-vit segregated tokens with randomized -- transformations for defense again | arXiv: 2603.12688
streamingt2v consistent dynamic and extendable long video generation from text | arXiv: 2403.14773
streetcrafter street view synthesis with controllable video diffusion models | arXiv: 2412.13188
stretching each dollar diffusion training from scratch on a micro-budget | arXiv: 2407.15811
structure from collision | arXiv: 2505.21335
structure-aware correspondence learning for relative pose estimation | arXiv: 2503.18671
structure-from-motion with a non-parametric camera model
structured 3d latents for scalable and versatile 3d generation | arXiv: 2412.01506
style evolving along chain-of-thought for unknown-domain object detection | arXiv: 2503.09968
style quantization for data-efficient gan training | arXiv: 2503.24282
style-editor text-driven object-centric style editing | arXiv: 2408.08461
stylemaster stylize your video with artistic generation and translation | arXiv: 2412.07744
stylessp sampling startpoint enhancement for training-free diffusion-based metho
stylestudio text-driven style transfer with selective control of style elements | arXiv: 2412.08503
subnet-aware dynamic supernet training for neural architecture search | arXiv: 2503.10740
subspace constraint and contribution estimation for heterogeneous federated lear
sufficient invariant learning for distribution shift | arXiv: 2210.13533
sum parts benchmarking part-level semantic segmentation of urban meshes | arXiv: 2503.15300
superlightnet lightweight parameter aggregation network for multimodal brain tum
superpc a single diffusion model for point cloud completion upsampling denoising | arXiv: 2503.14558
supervising sound localization by in-the-wild egomotion
surg-r1 a hierarchical reasoning foundation model for scalable and interpretable | arXiv: 2603.12430
surgeon memory-adaptive fully test-time adaptation via dynamic activation sparsi
svdc consistent direct time-of-flight video depth completion with frequency sele
svfr a unified framework for generalized video face restoration | arXiv: 2501.01235
svg-ir spatially-varying gaussian splatting for inverse rendering | arXiv: 2504.06815
svlta benchmarking vision-language temporal alignment via synthetic video situat | arXiv: 2504.05925
swiftedit lightning fast text-guided image editing via one-step diffusion | arXiv: 2412.04301
symbolic representation for any-to-any generative tasks | arXiv: 2504.17261
symdpo boosting in-context learning of large multimodal models with symbol demon
symmetry strikes back from single-image symmetry detection to 3d generation | arXiv: 2411.17763
synchronized video-to-audio generation via mel quantization-continuum decomposit | arXiv: 2503.06984
syncsde a probabilistic framework for diffusion synchronization | arXiv: 2503.21555
syncvp joint diffusion for synchronous multi-modal video prediction | arXiv: 2503.18933
synergen-vl towards synergistic image understanding and generation with vision e
synergizing motion and appearance multi-scale compensatory codebooks for talking
syntab-llava enhancing multimodal table understanding with decoupled synthesis
synthetic data is an elegant gift for continual vision-language models | arXiv: 2503.04229
synthetic prior for few-shot drivable head avatar inversion | arXiv: 2501.06903
synthetic visual genome | arXiv: 2506.07643
synthetic-to-real self-supervised robust depth estimation via learning with moti
synthlight portrait relighting with diffusion model by learning to re-render syn
t-cil temperature scaling using adversarial perturbation for calibration in clas
t-fake synthesizing thermal images for facial landmarking | arXiv: 2408.15127
t2icount enhancing cross-modal understanding for zero-shot counting | arXiv: 2502.20625
t2isafety benchmark for assessing fairness toxicity and privacy in image generat
t2sg traffic topology scene graph for topology reasoning in autonomous driving | arXiv: 2411.18894
t2v-compbench a comprehensive benchmark for compositional text-to-video generati
tacodepth towards efficient radar-camera depth estimation with one-stage fusion | arXiv: 2504.11773
tadformer task-adaptive dynamic transformer for efficient multi-task learning | arXiv: 2501.04293
taet two-stage adversarial equalization training on long-tailed distributions | arXiv: 2503.01924
taga self-supervised learning for template-free animatable gaussian articulated
tailedcore few-shot sampling for unsupervised long-tail noisy anomaly detection | arXiv: 2504.02775
take the bull by the horns learning to segment hard samples
taming score-based denoisers in admm a convergent plug-and-play framework | arXiv: 2603.10281
taming teacher forcing for masked autoregressive video generation | arXiv: 2501.12389
taming video diffusion prior with scene-grounding guidance for 3d gaussian splat
tamt temporal-aware model tuning for cross-domain few-shot action recognition | arXiv: 2411.19041
tango training-free embodied ai agents for open-world tasks | arXiv: 2412.10402
taoavatar real-time lifelike full-body talking avatars for augmented reality via
tapt test-time adversarial prompt tuning for robust inference in vision-language | arXiv: 2411.13136
targeted forgetting of image subgroups in clip models | arXiv: 2506.03117
tarot towards essentially domain-invariant robustness with theoretical justifica
tartan imu a light foundation model for inertial positioning in robotics
task preference optimization improving multimodal large language models with vis
task singular vectors reducing task interference in model merging | arXiv: 2412.00081
task-agnostic guided feature expansion for class-incremental learning | arXiv: 2503.00823
task-aware clustering for prompting vision-language models
task-aware cross-modal feature refinement transformer with large language models
task-driven image fusion with learnable fusion loss | arXiv: 2412.03240
task-specific gradient adaptation for few-shot one-class classification
taste more taste better diverse data and strong model boost semi-supervised crow | arXiv: 2503.17984
taste-rob advancing video generation of task-oriented hand-object interaction fo
taxonomy-aware evaluation of vision-language models | arXiv: 2504.05457
tcfg tangential damping classifier-free guidance | arXiv: 2503.18137
teaching large language models to regress accurate image quality scores using sc | arXiv: 2501.11561
team leya in 10th abaw competition multimodal ambivalencehesitancy recognition a | arXiv: 2603.12848
team ras in 10th abaw competition multimodal valence and arousal estimation appr | arXiv: 2603.13056
teller real-time streaming audio-driven portrait animation with autoregressive m | arXiv: 2503.18429
temporal action detection model compression by progressive block drop | arXiv: 2503.16916
temporal alignment-free video matching for few-shot action recognition | arXiv: 2504.05956
temporal score analysis for understanding and correcting diffusion artifacts | arXiv: 2503.16218
temporal separation with entropy regularization for knowledge distillation in sp
temporally consistent object-centric learning by contrasting slots | arXiv: 2412.14295
tensoflow tensorial flow-based sampler for inverse rendering | arXiv: 2503.18328
test-time attention purification for backdoored large vision language models | arXiv: 2603.12989
test-time augmentation improves efficiency in conformal prediction | arXiv: 2505.22764
test-time backdoor detection for object detection models | arXiv: 2503.15293
test-time domain generalization via universe learning a multi-graph matching app
test-time fine-tuning of image compression models for multi-task adaptability
test-time visual in-context tuning | arXiv: 2503.21777
texgarment consistent garment uv texture generation via efficient 3d structure-g
texgaussian generating high-quality pbr material via octree-based 3d gaussian sp
text augmented correlation transformer for few-shot classification segmentation
text embedding is not all you need attention control for text-to-image semantic
text-driven fashion image editing with compositional concept learning and counte
text-guided sparse voxel pruning for efficient 3d visual grounding | arXiv: 2502.10392
text-phase synergy network with dual priors for unsupervised cross-domain image | arXiv: 2603.12711
textured gaussians for enhanced 3d scene appearance modeling | arXiv: 2411.18625
tfcustom customized image generation with time-aware frequency feature guidance
the art of deception color visual illusions and diffusion models | arXiv: 2412.10122
the change you want to detect semantic change detection in earth observation wit
the devil is in low-level features for cross-domain few-shot segmentation | arXiv: 2503.21150
the devil is in temporal token high quality video reasoning segmentation | arXiv: 2501.08549
the devil is in the prompts retrieval-augmented prompt optimization for text-to- | arXiv: 2504.11739
the illusion of unlearning the unstable nature of machine unlearning in text-to-
the impact label noise and choice of threshold has on cross-entropy and soft-dic
the language of motion unifying verbal and non-verbal language of 3d human motio
the panaf-fgbg dataset understanding the impact of backgrounds in wildlife behav
the photographers eye teaching multimodal large language models to see and criti
the power of context how multimodality improves image super-resolution | arXiv: 2503.14503
the scene language representing scenes with programs words and embeddings | arXiv: 2410.16770
theoretical insights in model inversion robustness and conditional entropy maxim
theory-inspired deep multi-view multi-label learning with incomplete views and n
thin-shell-sft fine-grained monocular non-rigid 3d surface tracking with neural | arXiv: 2503.19976
think and answer me benchmarking and exploring multi-entity reasoning grounding | arXiv: 2603.12788
think small act big primitive prompt learning for lifelong robot manipulation | arXiv: 2504.00420
thinking in dynamics how multimodal large language models perceive track and rea | arXiv: 2603.12746
thinking in space how multimodal large language models see remember and recall s | arXiv: 2412.14171
thinking in streaming video | arXiv: 2603.12938
three cars approaching within 100m enhancing distant geometry by tri-axis voxel
three-view focal length recovery from homographies | arXiv: 2501.07499
through-the-mask mask-based motion trajectories for image-to-video generation | arXiv: 2501.03059
tide training locally interpretable domain generalization models enables test-ti
tightening robustness verification of maxpool-based neural networks via minimizi
tiled diffusion | arXiv: 2412.15185
time of the flight of the gaussians optimizing depth indirectly in dynamic radia
timestep embedding tells its time to cache for video diffusion model | arXiv: 2411.19108
timetracker event-based continuous point tracking for video frame interpolation
timotion temporal and interactive framework for efficient human-human motion gen
tinyfusion diffusion transformers learned shallow | arXiv: 2412.01199
tinynav end-to-end tinyml for real-time autonomous navigation on microcontroller | arXiv: 2603.11071
tkg-dm training-free chroma key content generation diffusion model | arXiv: 2411.15580
token cropr faster vits for quite a few tasks | arXiv: 2412.00965
tokenflow unified image tokenizer for multimodal understanding and generation | arXiv: 2412.03069
tokenhsi unified synthesis of physical human-scene interactions through task tok
tokenize image patches global context fusion for effective haze removal in large | arXiv: 2504.09621
tokenmotion decoupled motion control via token disentanglement for human-centric | arXiv: 2504.08181
topnet transformer-efficient occupancy prediction network for octree-structured
topo-r1 detecting topological anomalies via vision-language models | arXiv: 2603.13054
topocellgen generating histopathology cell topology with a diffusion model | arXiv: 2412.06011
topv compatible token pruning with inference time optimization for fast and low-
tora trajectory-oriented diffusion transformer for video generation | arXiv: 2407.21705
tornadonet real-time building damage detection with ordinal supervision | arXiv: 2603.11557
touch2shape touch-conditioned 3d diffusion for shape exploration and reconstruct | arXiv: 2505.13091
toward generalized image quality assessment relaxing the perfect reference quali
toward real-world bev perception depth uncertainty estimation via gaussian splat | arXiv: 2504.01957
toward robust neural reconstruction from sparse point sets | arXiv: 2412.16361
towards a universal synthetic video detector from face or background manipulatio
towards all-in-one medical image re-identification | arXiv: 2503.08173
towards autonomous micromobility through scalable urban simulation | arXiv: 2505.00690
towards better alignment training diffusion models with reinforcement learning a
towards consistent multi-task learning unlocking the potential of task-specific
towards continual universal segmentation
towards cost-effective learning a synergy of semi-supervised and active learning
towards effective and sparse adversarial attack on spiking neural networks via b
towards efficient foundation model for zero-shot amodal segmentation
towards enhanced image inpainting mitigating unwanted object insertion and prese
towards explainable and unprecedented accuracy in matching challenging finger cr
towards explicit geometry-reflectance collaboration for generalized lidar segmen
towards faithful multimodal concept bottleneck models | arXiv: 2603.13163
towards fine-grained interpretability counterfactual explanations for misclassif
towards general visual-linguistic face forgery detection | arXiv: 2307.16545
towards generalizable scene change detection | arXiv: 2409.06214
towards generalizable trajectory prediction using dual-level representation lear
towards high-fidelity 3d talking avatar with personalized dynamic texture | arXiv: 2503.00495
towards human-understandable multi-dimensional concept discovery | arXiv: 2503.18629
towards improved text-aligned codebook learning multi-hierarchical codebook-text
towards in-the-wild 3d plane reconstruction from a single image | arXiv: 2506.02493
towards long-horizon vision-language navigation platform benchmark and method | arXiv: 2412.09082
towards lossless implicit neural representation via bit plane decomposition | arXiv: 2502.21001
towards million-scale adversarial robustness evaluation with stronger individual | arXiv: 2411.15210
towards more general video-based deepfake detection through facial component gui
towards natural language-based document image retrieval new dataset and benchmar
towards open-vocabulary audio-visual event localization | arXiv: 2411.11278
towards optimizing large-scale multi-graph matching in bioimaging
towards practical real-time neural video compression | arXiv: 2502.20762
towards precise embodied dialogue localization via causality guided diffusion
towards precise scaling laws for video diffusion transformers | arXiv: 2411.17470
towards raw object detection in diverse conditions | arXiv: 2411.15678
towards realistic example-based modeling via 3d gaussian stitching | arXiv: 2408.15708
towards satellite image road graph extraction a global-scale dataset and a novel | arXiv: 2411.16733
towards scalable human-aligned benchmark for text-guided image editing | arXiv: 2505.00502
towards smart point-and-shoot photography | arXiv: 2505.03638
towards source-free machine unlearning | arXiv: 2508.15127
towards spatio-temporal world scene graph generation from monocular videos | arXiv: 2603.13185
towards stable and storage-efficient dataset distillation matching convexified t | arXiv: 2406.19827
towards training-free anomaly detection with vision and language foundation mode
towards transformer-based aligned generation with self-coherence guidance | arXiv: 2503.17675
towards unbiased and robust spatio-temporal scene graph generation and anticipat
towards understanding and quantifying uncertainty for text-to-image generation | arXiv: 2412.03178
towards understanding how knowledge evolves in large vision-language models | arXiv: 2504.02862
towards universal ai-generated image detection by variational information bottle
towards universal computational aberration correction in photographic cameras a | arXiv: 2603.12083
towards universal dataset distillation via task-driven diffusion
towards universal soccer video understanding | arXiv: 2412.01820
towards visual discrimination and reasoning of real-world physical dynamics phys
towards zero-shot anomaly detection and reasoning with multimodal large language | arXiv: 2502.07601
tra-moe learning trajectory prediction model from multiple domains for adaptive | arXiv: 2411.14519
track any anomalous objecta granular video anomaly detection pipeline
track4gen teaching video diffusion models to track points improves video generat
tracktention leveraging point tracking to attend videos faster and better | arXiv: 2503.19904
traf-align trajectory-aware feature alignment for asynchronous multi-agent perce
training data provenance verification did your model use synthetic data from my | arXiv: 2503.09122
training-free dense-aligned diffusion guidance for modular conditional image syn
training-free neural architecture search through variance of knowledge of deep n | arXiv: 2502.04975
trajectory mamba efficient attention-mamba forecasting model based on selective | arXiv: 2503.10898
transfer your perspective controllable 3d generation from any viewpoint in a dri
transformer-based multi-region segmentation and radiomic analysis of hr-pqct ima | arXiv: 2603.09137
transformers without normalization | arXiv: 2503.10622
transpixeler advancing text-to-video generation with transparency | arXiv: 2501.03006
traversing distortion-perception tradeoff using a single score-based generative | arXiv: 2503.20297
treemeshgpt artistic mesh generation with autoregressive tree sequencing | arXiv: 2503.11629
tripartite weight-space ensemble for few-shot class-incremental learning | arXiv: 2506.15720
tritex learning texture from a single mesh via triplane semantic features | arXiv: 2503.16630
trust your critic robust reward modeling and reinforcement learning for faithful | arXiv: 2603.12247
tsam temporal sam augmented with multimodal prompts for referring audio-visual s
tsd-sr one-step diffusion with target score distillation for real-world image su
tsp-mamba the travelling salesman problem meets mamba for image super-resolution
tuning the frequencies robust training for sinusoidal neural networks | arXiv: 2407.21121
turbo3d ultra-fast text-to-3d generation | arXiv: 2412.04470
turbofill adapting few-step text-to-image model for fast image inpainting | arXiv: 2504.00996
twinner shining light on digital twins in a few snaps | arXiv: 2503.08382
two by two learning multi-task pairwise objects assembly for generalizable robot | arXiv: 2504.06961
two is better than one efficient ensemble defense for robust and compact models | arXiv: 2504.04747
u-know-diffpan an uncertainty-aware knowledge distillation diffusion framework w
ua-pose uncertainty-aware 6d object pose estimation and online object completion
ucm-veid v2 a richer dataset and a pre-training method for uav cross-modality ve
ucod-dpl unsupervised camouflaged object detection via dynamic pseudo-label lear
uhd-processer unified uhd image restoration with progressive frequency learning
uibdiffusion universal imperceptible backdoor attack for diffusion models | arXiv: 2412.11441
ultrafusion ultra high dynamic imaging using exposure fusion | arXiv: 2501.11515
ultrasoundagents hierarchical multi-agent evidence-chain reasoning for breast ul | arXiv: 2603.10852
umfn unified multi-domain face normalization for joint cross-domain prototype le
umotion uncertainty-driven human motion estimation from inertial and ultra-wideb
unbiased video scene graph generation via visual and semantic dual debiasing | arXiv: 2503.00548
unbiasing through textual descriptions mitigating representation bias in video b | arXiv: 2503.18637
unboxed geometrically and temporally consistent video outpainting
uncertain multimodal intention and emotion understanding in the wild
uncertainty meets diversity a comprehensive active learning framework for indoor
uncertainty weighted gradients for model calibration | arXiv: 2503.22725
uncertainty-aware concept and motion segmentation for semi-supervised angiograph | arXiv: 2603.00881
uncertainty-guided perturbation for image super-resolution diffusion model | arXiv: 2503.18512
uncertainty-instructed structure injection for generalizable hd map construction | arXiv: 2503.23109
uncommon objects in 3d | arXiv: 2501.07574
understanding fine-tuning clip for open-vocabulary semantic segmentation in hype
understanding multi-layered transmission matrices | arXiv: 2410.23864
understanding multi-task activities from single-task videos
unem unrolled generalized em for transductive few-shot learning | arXiv: 2412.16739
uni-renderer unifying rendering and inverse rendering via dual stream diffusion | arXiv: 2412.15050
uni4d unifying visual foundation models for 4d modeling from a single video | arXiv: 2503.21761
unialign scaling multimodal alignment within one unified model
uniap unifying inter- and intra-layer automatic parallelism by mixed integer qua
unic-adapter unified image-instruction adapter with multi-modal transformer for | arXiv: 2412.18928
unicl-sam uncertainty-driven in-context segmentation with part prototype discove
unicom unified multimodal modeling via compressed continuous semantic representa | arXiv: 2603.10702
unified dense prediction of video diffusion | arXiv: 2503.09344
unified medical lesion segmentation via self-referring indicator
unified reconstruction of static and dynamic scenes from events
unified uncertainty-aware diffusion for multi-agent trajectory modeling | arXiv: 2503.18589
unigoal towards universal zero-shot goal-oriented navigation | arXiv: 2503.10630
unigrasptransformer simplified policy distillation for scalable dexterous roboti
unihope a unified approach for hand-only and hand-object pose estimation | arXiv: 2503.13303
unik3d universal camera monocular 3d estimation | arXiv: 2503.16591
unimamba unified spatial-channel representation learning with group-efficient ma
uninet a contrastive learning-guided unified framework with feature selection fo
uniphy learning a unified constitutive model for inverse physics simulation | arXiv: 2505.16971
unipose a unified multimodal framework for human pose comprehension generation a | arXiv: 2411.16781
unipre3d unified pre-training of 3d point cloud models with cross-modal gaussian | arXiv: 2506.09952
unireal universal image generation and editing via learning real-world dynamics | arXiv: 2412.07774
unirestore unified perceptual and task-oriented image restoration model using di
uniscene unified occupancy-centric driving scene generation | arXiv: 2412.05435
unistainnet foundation-model-guided virtual staining of he to ihc | arXiv: 2603.12716
unistd towards unified spatio-temporal learning across diverse disciplines | arXiv: 2503.20748
unity in diversity video editing via gradient-latent purification
univad a training-free unified model for few-shot visual anomaly detection | arXiv: 2412.03342
universal actions for enhanced embodied foundation models | arXiv: 2501.10105
universal domain adaptation for semantic segmentation | arXiv: 2505.22458
universal scene graph generation | arXiv: 2503.15005
unlearning through knowledge overwriting reversible federated unlearning via sel
unleashing in-context learning of autoregressive models for few-shot image manip
unleashing the potential of consistency learning for detecting and grounding mul
unleashing the potential of multi-modal foundation models and video diffusion fo
unleashing video language models for fine-grained hrct report generation | arXiv: 2603.12469
unlocking generalization power in lidar point cloud registration | arXiv: 2503.10149
unlocking the potential of unlabeled data in semi-supervised domain generalizati
unmasking biases and reliability concerns in convolutional neural networks analy | arXiv: 2603.12445
unopose unseen object pose estimation with an unposed rgb-d reference image | arXiv: 2411.16106
unraveling normal anatomy via fluid-driven anomaly randomization | arXiv: 2501.13370
unseen visual anomaly generation | arXiv: 2406.01078
unsupervised continual domain shift learning with multi-prototype modeling
unsupervised discovery of facial landmarks and head pose
unsupervised foundation model-agnostic slide-level representation learning | arXiv: 2411.13623
unveil inversion and invariance in flow transformer for versatile image editing | arXiv: 2411.15843
unveiling differences in generative models a scalable differential clustering ap
unveiling the ignorance of mllms seeing clearly answering incorrectly | arXiv: 2406.10638
unveiling the mist over 3d vision-language understanding object-centric evaluati
unveiling visual perception in language models an attention head analysis approa
upme an unsupervised peer review framework for multimodal large language model e | arXiv: 2503.14941
urbancad towards highly controllable and photorealistic 3d vehicles for urban sc
urwkv unified rwkv model with multi-state perspective for low-light image restor | arXiv: 2505.23068
using diffusion priors for video amodal segmentation | arXiv: 2412.04623
using powerful prior knowledge of diffusion model in deep unfolding networks for | arXiv: 2503.08429
usp-gaussian unifying spike-based image reconstruction pose correction and gauss
uvgs reimagining unstructured 3d gaussian splatting using uv mapping | arXiv: 2502.01846
uwav uncertainty-weighted weakly-supervised audio-visual video parsing | arXiv: 2505.09615
v-bridge bridging video generative priors to versatile few-shot image restoratio | arXiv: 2603.13089
v-clr view-consistent learning for open-world instance segmentation | arXiv: 2504.01383
v-stylist video stylization via collaboration and reflection of mllm agents | arXiv: 2503.12077
v2dial unification of video and visual dialog via multimodal experts
v2v3d view-to-view denoised 3d reconstruction for light field microscopy
v2x-r cooperative lidar-4d radar fusion with denoising diffusion for 3d object d | arXiv: 2411.08402
variance-based membership inference attacks against large-scale image captioning
variational garrote for sparse inverse problems | arXiv: 2603.12562
varsplat uncertainty-aware 3d gaussian splatting for robust rgb-d slam | arXiv: 2603.09673
vasparse towards efficient visual hallucination mitigation via visual-aware toke
vastsd learning 3d vascular tree-state space diffusion model for angiography syn
vcbench a streaming counting benchmark for spatial-temporal state maintenance in | arXiv: 2603.12703
vdocrag retrieval-augmented generation over visually-rich documents | arXiv: 2504.09795
velociti benchmarking video-language compositional reasoning with strict entailm
vera explainable video anomaly detection via verbalized learning of vision-langu
verbdiff text-only diffusion models with enhanced interaction awareness | arXiv: 2503.16406
vesselfm a foundation model for universal 3d blood vessel segmentation | arXiv: 2411.17386
veu-bench towards comprehensive understanding of video editing | arXiv: 2504.17828
vggt visual geometry grounded transformer | arXiv: 2503.11651
vi3nr variance informed initialization for implicit neural representations | arXiv: 2504.19270
vicas a dataset for combining holistic and pixel-level video understanding using
vid2avatar-pro authentic avatar from videos in the wild via universal prior | arXiv: 2503.01610
vid2sim generalizable video-based reconstruction of appearance geometry and phys
vid2sim realistic and interactive simulation from video for urban navigation | arXiv: 2501.06693
vidbot learning generalizable 3d actions from in-the-wild 2d human videos for ze
vidcomposition can mllms analyze compositions in compiled videos | arXiv: 2411.10979
video depth anything consistent depth estimation for super-long videos | arXiv: 2501.12375
video depth without video models | arXiv: 2411.19189
video language model pretraining with spatio-temporal masking
video motion transfer with diffusion transformers | arXiv: 2412.07776
video streaming thinking videollms can watch and think simultaneously | arXiv: 2603.12262
video summarization with large language models | arXiv: 2504.11199
video-3d llm learning position-aware video representation for 3d scene understan
video-bench human-aligned video generation benchmark | arXiv: 2504.04907
video-colbert contextualized late interaction for text-to-video retrieval | arXiv: 2503.19009
video-guided foley sound generation with multimodal controls | arXiv: 2411.17698
video-mme the first-ever comprehensive evaluation benchmark of multi-modal llms
video-panda parameter-efficient alignment for encoder-free video-language models | arXiv: 2412.18609
video-xl extra-long vision language model for hour-scale video understanding | arXiv: 2409.14485
videoautoarena an automated arena for evaluating large multimodal models in vide
videocomp advancing fine-grained compositional and temporal alignment in video-t
videodirector precise video editing via text-to-video models | arXiv: 2411.17592
videodpo omni-preference alignment for video diffusion generation | arXiv: 2412.14167
videoespresso a large-scale chain-of-thought dataset for fine-grained video reas
videogem training-free action grounding in videos | arXiv: 2503.20348
videogigagan towards detail-rich video super-resolution | arXiv: 2404.12388
videoglamm a large multimodal model for pixel-level visual grounding in videos | arXiv: 2411.04923
videoguide improving video diffusion models without training through a teachers | arXiv: 2410.04364
videohandles editing 3d object compositions in videos using video generative pri
videoicl confidence-based iterative in-context learning for out-of-distribution
videomage multi-subject and motion customization of text-to-video diffusion mode
videorefer suite advancing spatial-temporal object understanding with video llm | arXiv: 2501.00599
videoscene distilling video diffusion model to generate 3d scenes in one step | arXiv: 2504.01956
videospats video spatiotemporal splines for disentangled occlusion appearance an
videotree adaptive tree-based video representation for llm reasoning on long vid
videoworld exploring knowledge learning from unlabeled videos | arXiv: 2501.09781
vidhalluc evaluating temporal hallucinations in multimodal large language models
vidmuse a simple video-to-music generation framework with long-short-term modeli
vidseg training-free video semantic segmentation based on diffusion models
vidtwin video vae with decoupled structure and dynamics | arXiv: 2412.17726
viewpoint rosetta stone unlocking unpaired ego-exo videos for view-invariant rep
viineus volumetric initialization for implicit neural surface reconstruction of
vikienet towards efficient 3d object detection with virtual key instance enhance
vila-m3 enhancing vision-language models with medical expert knowledge | arXiv: 2411.12915
vinabench benchmark for faithful and consistent visual narratives | arXiv: 2503.20871
vintage joint video and text conditioning for holistic audio generation | arXiv: 2412.10768
vird view-invariant representation through dual-axis transformation for cross-vi | arXiv: 2603.12918
vires video instance repainting via sketch and text guided generation | arXiv: 2411.16199
visco benchmarking fine-grained critique and correction towards self-improvement
vision-guided action enhancing 3d human motion prediction with gaze-informed aff
vision-language embodiment for monocular depth estimation | arXiv: 2503.16535
vision-language gradient descent-driven all-in-one deep unfolding networks | arXiv: 2503.16930
vision-language model ip protection via prompt-based learning | arXiv: 2503.02393
vision-language models do not understand negation | arXiv: 2501.09425
visionarena 230k real world user-vlm conversations with preference labels | arXiv: 2412.08687
visionpad a vision-centric pre-training paradigm for autonomous driving | arXiv: 2411.14716
visionzip longer is better but not necessary in vision language models | arXiv: 2412.04467
vista enhancing long-duration and high-resolution video understanding by video s | arXiv: 2412.00927
vista3d a unified segmentation foundation model for 3d medical imaging | arXiv: 2406.05285
vistream improving computation efficiency of visual streaming perception via law
visual agentic ai for spatial reasoning with a dynamic api | arXiv: 2502.06787
visual and semantic prompt collaboration for generalized zero-shot learning | arXiv: 2503.23030
visual consensus prompting for co-salient object detection | arXiv: 2504.14254
visual lexicon rich image features in language space | arXiv: 2412.06774
visual persona foundation model for full-body human customization | arXiv: 2503.15406
visual prompting for one-shot controllable video editing without inversion | arXiv: 2504.14335
visual representation learning through causal intervention for controllable imag
visual-erm reward modeling for visual equivalence | arXiv: 2603.13224
visual-instructed degradation diffusion for all-in-one image restoration | arXiv: 2506.16960
vited video temporal evidence distillation | arXiv: 2503.12855
viunit visual unit tests for more robust visual programming | arXiv: 2412.08859
vl-rewardbench a challenging benchmark for vision-language generative reward mod
vl2lite task-specific knowledge distillation from large vision-language models t
vladva discriminative fine-tuning of lvlms | arXiv: 2412.04378
vlms-guided representation distillation for efficient vision-based reinforcement
vlog video-language models by generative retrieval of narration vocabulary | arXiv: 2503.09402
vlogger multimodal diffusion for embodied avatar synthesis | arXiv: 2403.08764
vlsi verbalized layers-to-interactions from large to small vision language model | arXiv: 2412.01822
voco-llama towards vision compression with large language models | arXiv: 2406.12275
vodiff controlling object visibility order in text-to-image generation
volformer explore more comprehensive cube interaction for hyperspectral image re
volume tells dual cycle-consistent diffusion for 3d fluorescence microscopy de-n
volumetric surfaces representing fuzzy geometries with layered meshes | arXiv: 2409.02482
volumetrically consistent 3d gaussian rasterization | arXiv: 2412.03378
voteflow enforcing local rigidity in self-supervised scene flow | arXiv: 2503.22328
voxelsplat dynamic gaussian splatting as an effective loss for occupancy and flo
vsnet focusing on the linguistic characteristics of sign language
vton 360 high-fidelity virtual try-on from any viewing direction | arXiv: 2503.12165
vton-handfit virtual try-on for arbitrary hand pose guided by hand priors embedd
watermarking one for all a robust watermarking scheme against partial image thef
wav2sem plug-and-play audio semantic decoupling for 3d speech-driven facial anim | arXiv: 2505.23290
wave weight templates for adaptive initialization of variable-sized models | arXiv: 2406.17503
wavelet and prototype augmented query-based transformer for pixel-level surface
weakly supervised contrastive adversarial training for learning robust features
weakly supervised semantic segmentation via progressive confidence region expans
weakly supervised teacher-student framework with progressive pseudo-mask refinem | arXiv: 2603.08605
weakly supervised temporal action localization via dual-prior collaborative lear
weakmcn multi-task collaborative network for weakly supervised referring express
wear classification of abrasive flap wheels using a hierarchical deep learning a | arXiv: 2603.12852
weathergen a unified diverse weather generator for lidar point clouds via spider | arXiv: 2504.13561
wegen a unified model for interactive multimodal generation as we chat | arXiv: 2503.01115
wf-vae enhancing video vae by wavelet-driven energy flow for latent video diffus
what makes a good dataset for knowledge distillation | arXiv: 2411.12817
whats in the image a deep-dive into the vision of vision language models | arXiv: 2411.17491
when domain generalization meets generalized category discovery an adaptive task
when the future becomes the past taming temporal correspondence for self-supervi
when to lock attention training-free kv control in video diffusion | arXiv: 2603.09657
where the devil hides deepfake detectors can no longer be trusted | arXiv: 2505.08255
wheres the liability in the generative era recovery-based black-box detection of | arXiv: 2505.01008
which viewpoint shows it best language for weakly supervising view selection in | arXiv: 2411.08753
why does it look there structured explanations for image classification | arXiv: 2603.10234
wildavatar learning in-the-wild 3d avatars from the web | arXiv: 2407.02165
wildgs-slam monocular gaussian splatting slam in dynamic environments | arXiv: 2504.03886
wilor end-to-end 3d hand localization and reconstruction in-the-wild | arXiv: 2409.12259
wise a framework for gigapixel whole-slide-image lossless compression | arXiv: 2503.18074
wish weakly supervised instance segmentation using heterogeneous labels
wisnet pseudo label generation on unbalanced and patch annotated waste images
wonderland navigating 3d scenes from a single image | arXiv: 2412.12091
wonderworld interactive 3d scene generation from a single image | arXiv: 2406.09394
words or vision do vision-language models have blind faith in text | arXiv: 2503.02199
world-consistent video diffusion with explicit 3d modeling | arXiv: 2412.01821
world2act latent action post-training via skill-compositional world models | arXiv: 2603.10422
x-dyna expressive dynamic human image animation | arXiv: 2501.10021
xlrs-bench could your multimodal llms understand extremely large ultra-high-reso
yochameleon personalized vision and language generation | arXiv: 2504.20998
you see it you got it learning 3d creation on pose-free videos at scale | arXiv: 2412.06699
your large vision-language model only needs a few attention heads for visual gro | arXiv: 2503.06287
your scale factors are my weapon targeted bit-flip attacks on vision transformer
your vit is secretly an image segmentation model | arXiv: 2503.19108
z-magic zero-shot multiple attributes guided image creator | arXiv: 2503.12124
zero-1-to-a zero-shot one image to animatable head avatars using video diffusion | arXiv: 2503.15851
zero-shot 3d question answering via voxel-based dynamic token compression
zero-shot 4d lidar panoptic segmentation | arXiv: 2504.00848
zero-shot blind-spot image denoising via implicit neural sampling
zero-shot head swapping in real-world scenarios | arXiv: 2503.00861
zero-shot image restoration using few-step guidance of consistency models and be | arXiv: 2412.20596
zero-shot monocular scene flow estimation in the wild | arXiv: 2501.10357
zero-shot novel view and depth synthesis with multi-view geometric diffusion | arXiv: 2501.18804
zero-shot rgb-d point cloud registration with pre-trained large vision model
zero-shot styled text image generation but make it autoregressive | arXiv: 2503.17074
zerograsp zero-shot shape reconstruction enabled robotic grasping | arXiv: 2504.10857
zerovo visual odometry with minimal assumptions | arXiv: 2506.08005
zo-sam zero-order sharpness-aware minimization for efficient sparse training | arXiv: 2603.13115
zoomldm latent diffusion model for multi-scale image generation | arXiv: 2411.16969
dual exposure stereo extended dr 3d | arXiv: 2412.02351
dualpm dual point maps shape pose | arXiv: 2412.04464
dune universal encoder distillation | arXiv: 2503.14405
dyn hamr recovering 4d interacting hand motion from a dynamic camera
faster focal token acquiring-and-scaling transformer for long-term 3d objection | arXiv: 2503.01899
flare sparse view reconstruction | arXiv: 2502.12138
magic-slam multi-agent gaussian globally consistent slam | arXiv: 2411.16785
murre sfm guided depth reconstruction | arXiv: 2503.14483
mv 3dcd multiview change detection | arXiv: 2412.03911
climbingcap multi-modal dataset and method for rock climbing in world | arXiv: 2503.21268
gdfusion temporal fusion occupancy | arXiv: 2504.12959
codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
motionrefit motion editing | arXiv: 2503.20724
dual diffusion unified generation understanding | arXiv: 2501.00289
dualanodiff few shot anomaly image generation | arXiv: 2408.13509
easycraft avatar crafting | arXiv: 2503.01158
fade fine grained erasure diffusion | arXiv: 2503.19783
filmcomposer llm music production | arXiv: 2503.08147
finelip clip long text fine grained | arXiv: 2504.01916
flipsketch sketch animation | arXiv: 2411.10818
mca ctrl attention control customization | arXiv: 2505.01428
dpir dual prompting restoration dit | arXiv: 2504.17825
advancing myopia to holism fully contrastive language-image pre-training | arXiv: 2412.00440
chathuman chatting about 3d humans with tools | arXiv: 2405.04533
cobra combinatorial retrieval augmentation for few-shot adaptation | arXiv: 2412.17684
docopilot improving multimodal models for document-level understanding | arXiv: 2507.14675
ezsr event-based zero-shot recognition | arXiv: 2407.21616
few-shot recognition via stage-wise retrieval-augmented finetuning | arXiv: 2406.11148
genius a generative framework for universal multimodal search | arXiv: 2503.19868
goal global-local object alignment learning | arXiv: 2503.17782
joint vision-language social bias removal for clip | arXiv: 2411.12785
lamra large multimodal model as your advanced retrieval assistant | arXiv: 2412.01720
lotusfilter fast diverse nearest neighbor search via a learned cutoff table | arXiv: 2506.04790
neighborretr balancing hub centrality in cross-modal retrieval | arXiv: 2503.10526
preserving clusters in prompt learning for unsupervised domain adaptation | arXiv: 2506.11493
range retrieval augmented neural fields for multi-resolution geo-embeddings | arXiv: 2502.19781
towards smart point-and-shoot photography | arXiv: 2505.03638
vdocrag retrieval-augmented generation over visually-rich documents | arXiv: 2504.09795
vladva discriminative fine-tuning of lvlms | arXiv: 2412.04378
albm attribute concept space | arXiv: 2503.20301
attribute-formed class-specific concept space endowing language bottleneck model
differentiable inverse rendering with interpretable basis brdfs | arXiv: 2411.17994
geometry-guided camera motion understanding in videollms | arXiv: 2603.13119
interpretable image classification via non-parametric part prototype learning | arXiv: 2503.10247
kvq boosting video quality assessment via saliency-guided local perception | arXiv: 2503.10259
l-swag layer-sample wise activation with gradients information for zero-shot nas
language guided concept bottleneck models for interpretable continual learning
learning on model weights using tree experts | arXiv: 2410.13569
learning visual composition through improved semantic guidance | arXiv: 2412.15396
lswag zero shot nas | arXiv: 2505.07300
on the possible detectability of image-in-image steganography | arXiv: 2603.11876
open ad-hoc categorization with contextualized feature learning | arXiv: 2512.16202
probing the mid-level vision capabilities of self-supervised learning | arXiv: 2411.17474
prompt-cam making vision transformers interpretable for fine-grained analysis | arXiv: 2501.09333
sample- and parameter-efficient auto-regressive image models | arXiv: 2411.15648
scaling vision pre-training to 4k resolution | arXiv: 2503.19903
tide domain generalization | arXiv: 2411.16788
tide training locally interpretable domain generalization models enables test-ti
tide training locally interpretable domain generalization models enables test time correction | arXiv: 2411.16788
towards faithful multimodal concept bottleneck models | arXiv: 2603.13163
towards human-understandable multi-dimensional concept discovery | arXiv: 2503.18629
why does it look there structured explanations for image classification | arXiv: 2603.10234
cad llama parametric | arXiv: 2505.04481
capo multi preference | arXiv: 2502.02588
inpo inversion preference optimization diffusion alignment | arXiv: 2503.18454
sam dpo semi supervised | arXiv: 2503.04639
sam dpo semi supervised medical segmentation | arXiv: 2503.04639
spo aesthetic post training | arXiv: 2406.04314
symdpo symbol icl | arXiv: 2411.11909
care transformer linear attention | arXiv: 2411.16170
moee mixture expert extraction | arXiv: 2505.15414
comfybench benchmarking llm-based agents in comfyui for autonomously designing c | arXiv: 2409.01392
context-cir learning from concepts in text for composed image retrieval | arXiv: 2505.20764
dense match summarization for faster two-view estimation
do imagenet-trained models learn shortcuts the impact of frequency shortcuts on | arXiv: 2503.03519
dora sampling and benchmarking for 3d shape variational auto-encoders | arXiv: 2412.17808
dual consolidation for pre-trained model-based domain-incremental learning | arXiv: 2410.00911
erase diffusion empowering object removal through calibrating diffusion pathways | arXiv: 2503.07026
event ellipsometer event-based mueller-matrix video imaging | arXiv: 2411.17313
exposure-slot exposure-centric representations learning with slot-in-slot attent
gradient-guided annealing for domain generalization | arXiv: 2502.20162
lotus large-scale machine unlearning with a taste of uncertainty | arXiv: 2503.18314
making old film great again degradation-aware state space model for old film res
on the generalization of handwritten text recognition models | arXiv: 2411.17332
oodd test-time out-of-distribution detection with dynamic dictionary | arXiv: 2503.10468
out of sight out of mind evaluating state evolution in video world models | arXiv: 2603.13215
polarfree polarization-based reflection-free imaging | arXiv: 2503.18055
postero structuring layout trees to enable language models in generalized conten | arXiv: 2505.07843
potential field based deep metric learning | arXiv: 2405.18560
practical solutions to the relative pose of three calibrated cameras | arXiv: 2303.16078
roadsocial a diverse videoqa dataset and benchmark for road event understanding
sata spatial autocorrelation token analysis for enhancing the robustness of visi
scene-agnostic pose regression for visual localization | arXiv: 2503.19543
sufficient invariant learning for distribution shift | arXiv: 2210.13533
traf-align trajectory-aware feature alignment for asynchronous multi-agent perce | arXiv: 2503.19391
uncertainty weighted gradients for model calibration | arXiv: 2503.22725
vinabench benchmark for faithful and consistent visual narratives | arXiv: 2503.20871
comrope rotary position | arXiv: 2506.03737
sec-promptsemantic complementary prompting for few-shot class-incremental learni
3d prior is all you need cross-task few-shot 2d gaze estimation | arXiv: 2502.04074
a unified framework for heterogeneous semi-supervised learning | arXiv: 2503.00286
bridging the vision-brain gap with an uncertainty-aware blur prior | arXiv: 2503.04207
dreamtext high fidelity scene text synthesis | arXiv: 2405.14701
hsemotion team at abaw-10 competition facial expression recognition valence-arou | arXiv: 2603.12693
improving autoregressive visual generation with cluster-oriented token predictio | arXiv: 2501.00880
lost in translation found in context sign language translation with contextual c | arXiv: 2501.09754
mxnorm reusing mxfp block scales for efficient tensor normalisation | arXiv: 2603.13180
precise event spotting in sports videos solving long-range dependency and class | arXiv: 2503.00147
robust message embedding via attention flow-based steganography
scamo exploring the scaling law in autoregressive motion generation model | arXiv: 2412.14559
softshadow leveraging soft masks for penumbra-aware shadow removal | arXiv: 2409.07041
the change you want to detect semantic change detection in earth observation wit
the scene language representing scenes with programs words and embeddings | arXiv: 2410.16770
vires video instance repainting via sketch and text guided generation
osrcir reflective cot | arXiv: 2412.11077
videoespresso cot reasoning | arXiv: 2411.14794
empowering llms to understand and generate complex vector graphics
order-robust class incremental learning graph-driven dynamic similarity grouping | arXiv: 2502.20032
mr plip multi resolution pathology | arXiv: 2504.18856
autossvh exploring automated frame sampling for efficient self-supervised video h | arXiv: 2504.03587
l swag zero shot nas vision transformers | arXiv: 2505.07300
harnessing frozen unimodal encoders for flexible multimodal alignment | arXiv: 2409.19425
semantic and expressive variations in image captions across languages | arXiv: 2310.14356
smtpd a new benchmark for temporal prediction of social media popularity | arXiv: 2503.04446
document haystacks vision-language reasoning over piles of 1000 documents | arXiv: 2411.16740
homesafe-bench evaluating vision-language models on unsafe action detection for | arXiv: 2603.11975
multi-modal contrastive masked autoencoders a two-stage progressive pre-training | arXiv: 2408.02245
on the out-of-distribution generalization of large multimodal models | arXiv: 2402.06599
videoglamm a large multimodal model for pixel-level visual grounding in videos | arXiv: 2411.04923
generative modeling of class probability for multi modal representation learning | arXiv: 2503.17417
mulsen ad multi sensor anomaly detection | arXiv: 2412.14592
sdf-net structure-aware disentangled feature learning for opticall-sar ship re-i | arXiv: 2603.12588
strap-vit segregated tokens with randomized -- transformations for defense again | arXiv: 2603.12688
calf communication aware distributed rl | arXiv: 2603.12543
asap advancing semantic alignment promotes multi-modal manipulation de | arXiv: 2412.12718
coordinated manipulation hybrid deformable rigid objects | arXiv: 2603.12940
foundations of the theory of performance based ranking | arXiv: 2412.04227
lift3d policy lifting 2d foundation models for robust 3d robotic manipulation | arXiv: 2411.18623
assessing and learning alignment of unimodal vision and language model | arXiv: 2412.04616
autossvh exploring automated frame sampling for efficient self-supervised video
chexworld exploring image world modeling for radiograph representation learning
as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
as language models scale low-order linear depth dynamics emerge v2 | arXiv: 2603.12541
classifier-guided clip distillation for unsupervised multi-label classification | arXiv: 2503.16873
classifier-to-bias toward unsupervised automatic bias detection for visual class | arXiv: 2504.20902
learning from neighbors category extrapolation for long-tail learning | arXiv: 2410.15980
let samples speak mitigating spurious correlation by exploiting the clusterness
4real-video learning generalizable photo-realistic 4d video diffusion | arXiv: 2412.04462
animateanything consistent and controllable animation for video generation | arXiv: 2411.10836
articulated kinematics distillation from video diffusion models | arXiv: 2504.01204
bf-stvsr b-splines and fourier---best friends for high fidelity spatia | arXiv: 2501.11043
bf-stvsr b-splines and fourier---best friends for high fidelity spatial-temporal | arXiv: 2501.11043
can text-to-video generation help video-language alignment | arXiv: 2503.18507
conmo controllable motion disentanglement and recomposition for zero-shot motion | arXiv: 2504.02451
dynamic camera poses and where to find them | arXiv: 2504.17788
dynamicscaler panoramic video | arXiv: 2412.11100
dynamicscaler seamless and scalable video generation for panoramic scenes
exploring temporally-aware features for point tracking | arXiv: 2501.12218
fade frequency-aware diffusion model factorization for video editing | arXiv: 2506.05934
flashmotion few-step controllable video generation with trajectory guidance | arXiv: 2603.12146
from slow bidirectional to fast autoregressive video diffusion models | arXiv: 2412.07772
gen3c 3d-informed world-consistent video generation with precise camera control | arXiv: 2503.03751
generative inbetweening through frame-wise conditions-driven video generation | arXiv: 2412.11755
geometry-guided online 3d video synthesis with multi-view temporal consistency | arXiv: 2505.18932
hoigen-1m a large-scale dataset for human-object interaction video generation | arXiv: 2503.23715
hunyuanportrait implicit condition control for enhanced portrait animation | arXiv: 2503.18860
hypernvd accelerating neural video decomposition via hypernetworks | arXiv: 2503.17276
identity-preserving text-to-video generation by frequency decomposition | arXiv: 2411.17440
idol instant photorealistic 3d human creation from a single image | arXiv: 2412.14963
improved video vae for latent video diffusion model | arXiv: 2411.06449
interdyn controllable interactive dynamics with video diffusion models | arXiv: 2412.11785
learning from streaming video with orthogonal gradients | arXiv: 2504.01961
learning temporally consistent video depth from video diffusion priors | arXiv: 2406.01493
levitor 3d trajectory oriented image-to-video synthesis | arXiv: 2412.15214
long video diffusion generation with segmented cross-attention and content-rich | arXiv: 2412.01316
longdiff training-free long video generation in one go | arXiv: 2503.18150
mimir improving video diffusion models for precise text understanding | arXiv: 2412.03085
mimo controllable character video synthesis with spatial decomposed modeling | arXiv: 2409.16160
mind the time temporally-controlled multi-event video generation | arXiv: 2412.05263
motif making text count in image animation with motion focal loss | arXiv: 2412.16153
motion modes what could happen next | arXiv: 2412.00148
motion prompting controlling video generation with motion trajectories | arXiv: 2412.02700
motionpro a precise motion controller for image-to-video generation | arXiv: 2505.20287
motionstone decoupled motion intensity modulation with diffusion transformer for | arXiv: 2412.05848
moviebench a hierarchical movie level dataset for long video generation | arXiv: 2411.15262
multi-subject open-set personalization in video generation | arXiv: 2501.06187
navigation world models | arXiv: 2412.03572
neuro-symbolic evaluation of text-to-video models using formal verification | arXiv: 2411.16718
one-minute video generation with test-time training | arXiv: 2504.05298
optical-flow guided prompt optimization for coherent video generation | arXiv: 2411.15540
osv one step is enough for high-quality image to video generation | arXiv: 2409.11367
parallelized autoregressive visual generation | arXiv: 2412.15119
patchvsr breaking video diffusion resolution limits with patch-wise video super- | arXiv: 2509.26025
pathways on the image manifold image editing via video generation | arXiv: 2411.16819
phyt2v llm-guided iterative self-refinement for physics-grounded text-to-video g | arXiv: 2412.00596
posetraj pose-aware trajectory control in video diffusion | arXiv: 2503.16068
protecting your video content disrupting automated video-based llm annotations | arXiv: 2503.21824
saw toward a surgical action world model via controllable and scalable video gen | arXiv: 2603.13024
semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
shotadapter text-to-multi-shot video generation with diffusion models | arXiv: 2505.07652
sketchvideo sketch-based video generation and editing | arXiv: 2503.23284
spatialdreamer self-supervised stereo video synthesis from monocular input | arXiv: 2411.11934
spatiotemporal skip guidance for enhanced video diffusion sampling | arXiv: 2411.18664
streamingt2v consistent dynamic and extendable long video generation from text | arXiv: 2403.14773
streetcrafter street view synthesis with controllable video diffusion models | arXiv: 2412.13188
taming teacher forcing for masked autoregressive video generation | arXiv: 2501.12389
teller real-time streaming audio-driven portrait animation with autoregressive m | arXiv: 2503.18429
the devil is in the prompts retrieval-augmented prompt optimization for text-to- | arXiv: 2504.11739
through-the-mask mask-based motion trajectories for image-to-video generation | arXiv: 2501.03059
timestep embedding tells its time to cache for video diffusion model | arXiv: 2411.19108
tokenmotion decoupled motion control via token disentanglement for human-centric | arXiv: 2504.08181
tora trajectory-oriented diffusion transformer for video generation | arXiv: 2407.21705
towards precise scaling laws for video diffusion transformers | arXiv: 2411.17470
tracktention leveraging point tracking to attend videos faster and better | arXiv: 2503.19904
transpixeler advancing text-to-video generation with transparency | arXiv: 2501.03006
unified dense prediction of video diffusion | arXiv: 2503.09344
veu-bench towards comprehensive understanding of video editing | arXiv: 2504.17828
video-bench human-aligned video generation benchmark | arXiv: 2504.04907
video-colbert contextualized late interaction for text-to-video retrieval | arXiv: 2503.19009
videodirector precise video editing via text-to-video models | arXiv: 2411.17592
videodpo omni-preference alignment for video diffusion generation | arXiv: 2412.14167
videogigagan towards detail-rich video super-resolution | arXiv: 2404.12388
videoguide improving video diffusion models without training through a teachers | arXiv: 2410.04364
videoscene distilling video diffusion model to generate 3d scenes in one step | arXiv: 2504.01956
vidtwin video vae with decoupled structure and dynamics | arXiv: 2412.17726
vires video instance repainting via sketch and text guided generation | arXiv: 2411.16199
visual prompting for one-shot controllable video editing without inversion | arXiv: 2504.14335
when to lock attention training-free kv control in video diffusion | arXiv: 2603.09657
world-consistent video diffusion with explicit 3d modeling | arXiv: 2412.01821
world2act latent action post-training via skill-compositional world models | arXiv: 2603.10422
zero-1-to-a zero-shot one image to animatable head avatars using video diffusion | arXiv: 2503.15851