CVPR2025 论文笔记 TODO¶
总计: 3299 篇 | 已完成: 2019 | 待更新: 1280
- 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification | arXiv: 2412.00678
- 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes | arXiv: 2411.14974
- 3D Dental Model Segmentation with Geometrical Boundary Preserving | arXiv: 2503.23702
- 3D Face Reconstruction From Radar Images | arXiv: 2412.02403
- 3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations | arXiv: 2504.14967
- 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency | arXiv: 2502.11801
- 3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation | arXiv: 2502.04074
- 3D Student Splatting and Scooping (SSS) | arXiv: 2503.10148
- 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation | arXiv: 2406.09126
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination | arXiv: 2406.05132
- 3D-GSW: 3D Gaussian Splatting for Robust Watermarking | arXiv: 2409.13222
- 3D-HGS: 3D Half-Gaussian Splatting | arXiv: 2406.02720
- 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer | arXiv: 2501.01163
- 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning | arXiv: 2411.17735
- 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation | arXiv: 2406.18158
- 3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping
- 3denhancer consistent multi-view diffusion for 3d enhancement | arXiv: 2412.18565
- 3dgut enabling distorted cameras and secondary rays in gaussian splatting | arXiv: 2412.12507
- 3dtopia-xl scaling high-quality 3d asset generation via primitive diffusion | arXiv: 2409.12957
- 4d langsplat 4d language gaussian splatting via multimodal large language models | arXiv: 2503.10437
- 4d-fly fast 4d reconstruction from a single monocular video
- 4deform neural surface deformation for robust shape interpolation | arXiv: 2502.20208
- 4dequine disentangling motion and appearance for 4d equine reconstruction from m | arXiv: 2603.10125
- 4dgc rate-aware 4d gaussian compression for efficient streamable free-viewpoint | arXiv: 2503.18421
- 4dtam non-rigid tracking and mapping via dynamic surface gaussians | arXiv: 2505.22859
- 4real-video learning generalizable photo-realistic 4d video diffusion | arXiv: 2412.04462
- 5100 breaking performance shackles of full fine-tuning on visual recognition tas
- a bias-free training paradigm for more general ai-generated image detection | arXiv: 2412.17671
- a closed-form solution for debiasing vision-language models with utility guarant | arXiv: 2603.12998
- a closer look at time steps is worthy of triple speed-up for diffusion model tra
- a comprehensive study of decoder-only llms for text-to-image generation | arXiv: 2506.08210
- a data-centric revisit of pre-trained vision models for robot learning | arXiv: 2503.06960
- a dataset for semantic segmentation in the presence of unknowns | arXiv: 2503.22309
- a distractor-aware memory for visual object tracking with sam2 | arXiv: 2411.17576
- a flag decomposition for hierarchical datasets | arXiv: 2502.07782
- a focused human body model for accurate anthropometric measurements extraction
- a general adaptive dual-level weighting mechanism for remote sensing pansharpeni
- a hubness perspective on representation learning for graph-based multi-view clus
- a lightweight udf learning framework for 3d reconstruction based on local shape | arXiv: 2407.01330
- a neuro-symbolic framework combining inductive and deductive reasoning for auton | arXiv: 2603.12421
- a new statistical model of star speckles for learning to detect and characterize
- a physics-informed blur learning framework for imaging systems | arXiv: 2502.11382
- a polarization-aided transformer for image deblurring via motion vector decompos
- a prediction-as-perception framework for 3d object detection | arXiv: 2603.12599
- a regularization-guided equivariant approach for image restoration | arXiv: 2505.19799
- a selective re-learning mechanism for hyperspectral fusion imaging
- a semantic knowledge complementarity based decoupling framework for semi-supervi
- a semi-supervised framework for breast ultrasound segmentation with training-fre | arXiv: 2603.06167
- a simple data augmentation for feature distribution skewed federated learning | arXiv: 2306.09363
- a simple yet effective layout token in large language models for document unders
- a stitch in time saves nine small vlm is a precise guidance for accelerating lar
- a tale of two classes adapting supervised contrastive learning to binary imbalan
- a theory of learning unified model via knowledge integration from label space va
- a unified approach to interpreting self-supervised pre-training methods for 3d p
- a unified framework for heterogeneous semi-supervised learning | arXiv: 2503.00286
- a unified image-dense annotation generation model for underwater scenes | arXiv: 2503.21771
- a unified latent schrodinger bridge diffusion model for unsupervised anomaly det
- a unified model for compressed sensing mri across undersampling patterns
- a unified resilient and explainable adversarial patch detector
- a universal scale-adaptive deformable transformer for image restoration across d
- a2z-10m geometric deep learning with a-to-z brep annotations for ai-assisted cad | arXiv: 2603.12605
- a3 few-shot prompt learning of unlearnable examples with cross-modal adversarial
- a4a adapter for adapter transfer via all-for-all mapping for cross-architecture
- aa-clip enhancing zero-shot anomaly detection via anomaly-aware clip | arXiv: 2503.06661
- abbspo adaptive bounding box scaling and symmetric prior based orientation predi
- abc-former auxiliary bimodal cross-domain transformer with interactive channel a
- abra teleporting fine-tuned knowledge across domains for open-vocabulary object | arXiv: 2603.12409
- ac3d analyzing and improving 3d camera control in video diffusion transformers
- acattack adaptive cross attacking rgb-t tracker via multi-modal response decoupl
- acc3d accelerating single image to 3d diffusion models via edge consistency guid
- accelerating diffusion transformer via increment-calibrated caching with channel
- accelerating multimodal large language models by searching optimal vision token
- accelerating stroke mri with diffusion probabilistic models through large-scale | arXiv: 2603.13007
- accurate differential operators for hybrid neural fields
- accurate scene text recognition with efficient model scaling and cloze self-dist
- ace anti-editing concept erasure in text-to-image models
- acl activating capability of linear attention for image restoration
- acquire and then adapt squeezing out text-to-image model for image restoration
- action detail matters refining video recognition with local action queries
- activating sparse part concepts for 3d class incremental learning
- active data curation effectively distills large-scale multimodal models | arXiv: 2411.18674
- active event-based stereo vision
- active hyperspectral imaging using an event camera
- activegamer active gaussian mapping through efficient rendering | arXiv: 2501.06897
- adacm2 on understanding extremely long-term video with adaptive cross-modality m
- adadare-gamma balancing stability and plasticity in multi-modal llms through eff
- adamms model merging for heterogeneous multimodal large language models with uns
- adaptation of weakly supervised localization in histopathology by debiasing pred | arXiv: 2603.12468
- adaptcmvc robust adaption to incremental views in continual multi-view clusterin
- adapter merging with centroid prototype mapping for scalable class-incremental l | arXiv: 2412.18219
- adapting dense matching for homography estimation with grid-based acceleration
- adapting pre-trained 3d models for point cloud video understanding via cross-fra
- adapting text-to-image generation with feature difference instruction for generi
- adapting to observation length of trajectory prediction via contrastive learning
- adapting to the unknown training-free audio-visual event perception with dynamic
- adaptive dropout unleashing dropout across layers for generalizable image super-
- adaptive keyframe sampling for long video understanding
- adaptive markup language generation for contextually-grounded visual document un
- adaptive non-uniform timestep sampling for accelerating diffusion model training
- adaptive parameter selection for tuning vision-language models
- adaptive part learning for fine-grained generalized category discovery a plug-an
- adaptive rectangular convolution for remote sensing pansharpening
- adaptive unimodal regulation for balanced multimodal information acquisition
- add attribution-driven data augmentation framework for boosting image super-reso
- addressing data scarcity in 3d trauma detection through self-supervised and semi | arXiv: 2603.12514
- admit adaptive multi-source tuning in dynamic environments
- adu adaptive detection of unknown categories in black-box domain adaptation
- adv-cpg a customized portrait generation framework with facial adversarial attac
- advancing adversarial robustness in gnerfs the il2-nerf attack
- advancing generalizable tumor segmentation with anomaly-aware open-vocabulary at
- advancing manga analysis comprehensive segmentation annotations for the manga109
- advancing multiple instance learning with continual learning for whole slide ima
- advancing myopia to holism fully contrastive language-image pre-training | arXiv: 2412.00440
- advancing semantic future prediction through multimodal visual sequence transfor
- adventurer optimizing vision mamba architecture designs for efficiency | arXiv: 2410.07599
- adversarial diffusion compression for real-world image super-resolution | arXiv: 2411.13383
- adversarial domain prompt tuning and generation for single domain generalization
- aerialmegadepth learning aerial-ground reconstruction and view synthesis | arXiv: 2504.13157
- aerogen enhancing remote sensing object detection with diffusion-driven data gen
- aespa attention-guided self-supervised parallel imaging for mri reconstruction
- aesthetic post-training diffusion models from generic preferences with step-by-s
- aesthetiq enhancing graphic layout design via aesthetic-aware preference alignme
- afforddp generalizable diffusion policy with transferable affordance
- afl a single-round analytic approach for federated learning with pre-trained mod
- ag-vpreid a challenging large-scale benchmark for aerial-ground video-based pers
- ai-face a million-scale demographically annotated ai-generated face dataset and
- aigv-assessor benchmarking and evaluating the perceptual quality of text-to-vide
- aim-fair advancing algorithmic fairness via selectively fine-tuning biased model
- aipparel a multimodal foundation model for digital garments
- airroom objects matter in room reidentification
- akira augmentation kit on rays for optical video generation
- alias-free latent diffusion models improving fractional shift equivariance of di
- alien implicit neural representations for human motion prediction under arbitrar
- align-a-video deterministic reward tuning of image diffusion models for consiste
- align-kd distilling cross-modal alignment knowledge for mobile vision-language l
- align3r aligned monocular depth estimation for dynamic videos
- alignmamba enhancing multimodal mamba with local and global cross-modal alignmen
- alignment mining and fusion representation alignment with hard negative mining a
- all languages matter evaluating lmms on culturally diverse 100 languages
- all-day multi-camera multi-target tracking
- all-directional disparity estimation for real-world qpd images
- all-optical nonlinear diffractive deep network for ultrafast image denoising
- alternating gradient flow utility a unified metric for structural pruning and dy | arXiv: 2603.12354
- amo sampler enhancing text rendering with overshooting | arXiv: 2411.19415
- amr-transformer enabling efficient long-range interaction for complex neural flu
- an end-to-end robust point cloud semantic segmentation network with single-step
- an fpga implementation of displacement vector search for intra pattern copy in j | arXiv: 2603.10671
- an image-like diffusion method for human-object interaction detection | arXiv: 2503.18134
- analyzing the synthetic-to-real domain gap in 3d hand pose estimation | arXiv: 2503.19307
- anatomical consistency and adaptive prior-informed transformation for multi-cont
- anchor-aware similarity cohesion in target frames enables predicting temporal mo
- anidoc animation creation made easier | arXiv: 2412.14173
- anigrad anisotropic gradient-adaptive sampling for 3d reconstruction from monocu
- anigs animatable gaussian avatar from a single image with inconsistent gaussian | arXiv: 2412.02684
- animate and sound an image
- animateanything consistent and controllable animation for video generation | arXiv: 2411.10836
- animer animal pose and shape estimation using family aware transformer | arXiv: 2412.00837
- animo species-aware model for text-driven animal motion generation
- annexe unified analyzing answering and pixel grounding for egocentric interactio
- annotation ambiguity aware semi-supervised medical image segmentation
- anomalyncd towards novel anomaly class discovery in industrial scenarios | arXiv: 2410.14379
- anomize better open vocabulary video anomaly detection | arXiv: 2503.18094
- antidote a unified framework for mitigating lvlm hallucinations in counterfactua
- any-resolution ai-generated image detection by spectral learning | arXiv: 2411.19417
- any3dis class-agnostic 3d instance segmentation by 2d mask tracking | arXiv: 2411.16183
- any6d model-free 6d pose estimation of novel objects | arXiv: 2503.18673
- anyattack towards large-scale self-supervised adversarial attacks on vision-lang
- anycam learning to recover camera poses and intrinsics from casual videos
- anydressing customizable multi-garment virtual dressing via latent diffusion mod
- anyedit mastering unified high-quality image editing for any idea
- anymap learning a general camera model for structure-from-motion with unknown di
- anymole any character motion in-betweening leveraging video diffusion models
- anysat one earth observation model for many resolutions scales and modalities
- aphq-vit post-training quantization with average perturbation hessian based reco
- apollo an exploration of video understanding in large multimodal models
- apply hierarchical-chain-of-generation to complex attributes text-to-3d generati
- apt adaptive personalized training for diffusion models with limited data
- ar-diffusion asynchronous video generation with auto-regressive diffusion
- arbitrary-steps image super-resolution via diffusion inversion | arXiv: 2412.09013
- arc2avatar generating expressive 3d avatars from a single image via id guidance
- arche autoregressive residual compression with hyperprior and excitation | arXiv: 2603.10188
- arcpro architectural programs for structured 3d abstraction of sparse points
- are general-purpose vision models all we need for 2d medical image segmentation | arXiv: 2603.13044
- are images indistinguishable to humans also indistinguishable to classifiers
- are spatial-temporal graph convolution networks for human action recognition ove
- argus a compact and versatile foundation model for vision
- argus vision-centric reasoning with grounded chain-of-thought
- arkit labelmaker a new scale for indoor 3d scene understanding
- arm appearance reconstruction model for relightable 3d generation | arXiv: 2411.10825
- around the world in 80 timesteps a generative approach to global visual geolocat
- art anonymous region transformer for variable multi-layer transparent image gene
- artformer controllable generation of diverse 3d articulated objects | arXiv: 2412.07237
- articulated kinematics distillation from video diffusion models | arXiv: 2504.01204
- articulatedgs self-supervised digital twin modeling of articulated objects using
- artifade learning to generate high-quality subject from blemished images | arXiv: 2409.03745
- artiscene language-driven artistic 3d scene generation through image intermediar
- as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
- as-bridge a bidirectional generative framework bridging next-generation astronom | arXiv: 2603.11928
- asap advancing semantic alignment promotes multi-modal manipulation detecting an | arXiv: 2412.12718
- ashita automatic scene-grounded hierarchical task analysis | arXiv: 2504.06553
- asign an anatomy-aware spatial imputation graphic network for 3d spatial transcr
- assessing and learning alignment of unimodal vision and language models | arXiv: 2412.04616
- association of radiologic ppfe change with mortality in lung cancer screening co | arXiv: 2603.09531
- associative transformer | arXiv: 2309.12862
- asynchronous collaborative graph representation for frames and events
- ata adaptive transformation agent for text-guided subject-position variable back
- atom aligning text-to-motion model at event-level with gpt-4vision reward
- atp adaptive threshold pruning for efficient data encoding in quantum neural net
- atp-llava adaptive token pruning for large vision language models
- attend to not attended structure-then-detail token merging for post-training dit
- attention distillation a unified approach to visual characteristics transfer
- attention iou examining biases in celeba using attention maps
- attraction diminishing and distributing for few-shot class-incremental learning
- attribute-formed class-specific concept space endowing language bottleneck model
- attribute-missing multi-view graph clustering
- audcast audio-driven human video generation by cascaded diffusion transformers
- audio-visual instance segmentation | arXiv: 2310.18709
- audio-visual semantic graph network for audio-visual event localization
- augmented deep contexts for spatially embedded video coding
- augmenting multimodal llms with self-reflective tokens for knowledge-based visua
- augmenting perceptual super-resolution via image quality predictors | arXiv: 2504.18524
- aurafusion360 augmented unseen region alignment for reference-based 360deg unbou
- auto cherry-picker learning from high-quality generative data driven by language
- auto-encoded supervision for perceptual image super-resolution
- autolut lut-based image super-resolution with automatic sampling and adaptive re
- automated detection of malignant lesions in the ovary using deep learning models | arXiv: 2603.11818
- automated generation of challenging multiple-choice questions for vision languag
- automated proof of polynomial inequalities via reinforcement learning
- automatic joint structured pruning and quantization for efficient neural network | arXiv: 2502.16638
- automatic spectral calibration of hyperspectral images method dataset and benchm
- autopresent designing structured visuals from scratch | arXiv: 2501.00912
- autoregressive distillation of diffusion transformers | arXiv: 2504.11295
- autoregressive sequential pretraining for visual tracking
- autossvh exploring automated frame sampling for efficient self-supervised video | arXiv: 2504.03587
- autourdf unsupervised robot modeling from point cloud frames using cluster regis
- avatarartist open-domain 4d avatarization | arXiv: 2503.19906
- avf-mae scaling affective video facial masked autoencoders via efficient audio-v
- avqacl a novel benchmark for audio-visual question answering continual learning
- bacon improving clarity of image captions via bag-of-concept graphs | arXiv: 2407.03314
- badgr bundle adjustment diffusion conditioned by gradients for wide-baseline flo
- badtoken token-level backdoor attacks to multi-modal large language models
- balanced direction from multifarious choices arithmetic meta-learning for domain
- balanced rate-distortion optimization in learned image compression
- balancing two classifiers via a simplex etf structure for model calibration
- bard-gs blur-aware reconstruction of dynamic scenes via gaussian splatting
- bases of steerable kernels for equivariant cnns from 2d rotations to the lorentz | arXiv: 2603.12459
- basket a large-scale video dataset for fine-grained skill estimation
- bayesian prompt flow learning for zero-shot anomaly detection
- bayesian test-time adaptation for vision-language models
- be more specific evaluating object-centric realism in synthetic images
- behaviorvlm unified finetuning-free behavioral understanding with vision-languag | arXiv: 2603.12176
- believing is seeing unobserved object detection using generative models
- benchmarking large vision-language models via directed scene graph for comprehen
- benchmarking object detectors under real-world distribution shifts in satellite
- bendfm a taxonomy and synthetic cad dataset for manufacturability assessment in | arXiv: 2603.13102
- beta-fft nonlinear interpolation and differentiated training strategies for semi
- bevdiffuser plug-and-play diffusion model for bev denoising with ground-truth gu
- beyond background shift rethinking instance replay in continual semantic segment
- beyond clean training data a versatile and model-agnostic framework for out-of-d
- beyond convolution a taxonomy of structured operators for learning-based image p | arXiv: 2603.12067
- beyond final answers crystal benchmark for transparent multimodal reasoning eval | arXiv: 2603.13099
- beyond generation a diffusion-based low-level feature extractor for detecting ai
- beyond human perception understanding multi-object world from monocular view
- beyond image classification a video benchmark and dual-branch hybrid discriminat
- beyond local sharpness communication-efficient global sharpness-aware minimizati
- beyond sight towards cognitive alignment in lvlm via enriched visual knowledge
- beyond single-modal boundary cross-modal anomaly detection through visual protot
- beyond single-sample reliable multi-sample distillation for video understanding | arXiv: 2603.11423
- beyond words augmenting discriminative richness via diffusions in unsupervised p | arXiv: 2504.11930
- bf-stvsr b-splines and fourier---best friends for high fidelity spatial-temporal | arXiv: 2501.11043
- bfanet revisiting 3d semantic segmentation with boundary feature analysis | arXiv: 2503.12539
- bg-triangle bezier gaussian triangle for 3d vectorization and rendering
- bhvit binarized hybrid vision transformer | arXiv: 2503.02394
- bias for action video implicit neural representations with bias modulation | arXiv: 2501.09277
- biclip bidirectional and consistent language-image processing for robust medical | arXiv: 2603.00156
- bigain unified token compression for joint generation and classification | arXiv: 2603.12240
- bigs bimanual category-agnostic interaction reconstruction from monocular videos
- bilora almost-orthogonal parameter spaces for continual learning
- bim-vfi bidirectional motion field-guided frame interpolation for video with non | arXiv: 2412.11365
- bimart a unified approach for the synthesis of 3d bimanual interaction with arti
- bimba selective-scan compression for long-range video question answering | arXiv: 2503.09590
- binarized mamba-transformer for lightweight quad bayer hybridevs demosaicing | arXiv: 2503.16134
- binarized neural network for multi-spectral image fusion
- binwang2hfnet geogran-aware hierarchical feature fusion network for salient obje | arXiv: 2603.12680
- biomedcoop learning to prompt for biomedical vision-language models
- biomedica an open biomedical image-caption archive dataset and vision-language m
- biox-cpath biologically-driven explainable diagnostics for multistain ihc comput
- bip3d bridging 2d images and 3d perception for embodied intelligence
- birth and death of a rose
- bizgen advancing article-level visual text rendering for infographics generation
- black hole-driven identity absorbing in diffusion models
- black swan abductive and defeasible video reasoning in unpredictable events
- black-box forgery attacks on semantic watermarks for diffusion models
- blade single-view body mesh estimation through accurate depth estimation | arXiv: 2412.08640
- blendergym benchmarking foundational model systems for graphics editing
- blind bitstream-corrupted video recovery via metadata-guided diffusion model
- blobgen-vid compositional text-to-video generation with blob video representatio
- blockdance reuse structurally similar spatio-temporal features to accelerate dif
- blood flow speed estimation with optical coherence tomography angiography images
- bluelm-v-3b algorithm and system co-design for multimodal large language models
- blurred lidar for sharper 3d robust handheld 3d scanning with diffuse lidar and
- blurry-edges photon-limited depth estimation from defocused boundaries | arXiv: 2503.23606
- boe-vit boosting orientation estimation with equivariance in self-supervised 3d
- bolt boost large vision-language model without training for long-form video unde
- boltzmann attention sampling for image analysis with small objects | arXiv: 2503.02841
- boost the inference with co-training a depth-guided mutual learning framework fo
- boost your human image generation model via direct preference optimization | arXiv: 2405.20216
- boosting adversarial transferability through augmentation in hypothesis space
- boosting domain incremental learning selecting the optimal parameters is all you | arXiv: 2505.23744
- boosting point-supervised temporal action localization through integrating query
- boosting the dual-stream architecture in ultra-high resolution segmentation with
- bootplace bootstrapped object placement with detection transformers | arXiv: 2503.21991
- bootstrap your own views masked ego-exo modeling for fine-grained view-invariant | arXiv: 2503.19706
- boow-vton boosting in-the-wild virtual try-on via mask-free pseudo data training | arXiv: 2408.06047
- boss a best-of-strategies selector as an oracle for deep active learning | arXiv: 2603.13109
- bounds on agreement between subjective and objective measurements | arXiv: 2603.13204
- brain-inspired spiking neural networks for energy-efficient object detection
- breaking the low-rank dilemma of linear attention | arXiv: 2411.07635
- breaking the memory barrier of contrastive loss via tile-based strategy
- breaking the tuning barrier zero-hyperparameters yield multi-corner analysis via | arXiv: 2603.13092
- brepgiff lightweight generation of complex b-rep with 3d gat diffusion
- bridge frame and event common spatiotemporal fusion for high-dynamic scene optic
- bridge the gap from weak to full supervision for temporal action localization wi
- bridging gait recognition and large language models sequence modeling
- bridging modalities improving universal multimodal retrieval by multimodal large
- bridging past and future end-to-end autonomous driving with historical predictio
- bridging the gap between gaussian diffusion models and universal quantization fo
- bridging the skill gap in clinical cbct interpretation with cbctrepd | arXiv: 2603.10933
- bridging the vision-brain gap with an uncertainty-aware blur prior
- bridging viewpoint gaps geometric reasoning boosts semantic correspondence
- bringing clip to the clinic dynamic soft labels and negation-aware learning for
- buffer anytime zero-shot video depth and normal from image priors
- building a mind palace structuring environment-grounded semantic graphs for effe
- building vision models upon heat conduction | arXiv: 2405.16555
- bwformer building wireframe reconstruction from airborne lidar point cloud with
- bytheway boost your text-to-video generation model to higher quality in a traini
- cachequant comprehensively accelerated diffusion models | arXiv: 2503.01323
- cad-llama leveraging large language models for computer-aided design parametric
- cadcrafter generating computer-aided design models from unconstrained images | arXiv: 2504.04753
- caddreamer cad object generation from single-view images | arXiv: 2502.20732
- cadref robust out-of-distribution detection via class-aware decoupled relative f
- calibrated multi-preference optimization for aligning diffusion models
- calico part-focused semantic co-segmentation with large vision-language models | arXiv: 2412.19331
- camera resection from known line pencils and a radially distorted scanline
- camfreediff camera-free image to panorama generation with diffusion model | arXiv: 2407.07174
- camouflage anything learning to hide using controlled out-painting and represent
- campoint boosting point cloud segmentation with virtual camera
- camuvid calibration-free multi-view detection
- can generative video models help pose estimation | arXiv: 2412.16155
- can large vision-language models correct semantic grounding errors by themselves | arXiv: 2404.06510
- can machines understand composition dataset and benchmark for photographic image
- can text-to-video generation help video-language alignment | arXiv: 2503.18507
- cant slow me down learning robust and hardware-adaptive object detectors against
- cap-net a unified network for 6d pose and size estimation of categorical articul
- cap4d creating animatable 4d portrait avatars with morphable multi-view diffusio
- care transformer mobile-friendly linear visual transformer via decoupled dual in
- caricaturebooth data-free interactive caricature generation in a photo booth
- carl a framework for equivariant image registration | arXiv: 2405.16738
- carplanner consistent auto-regressive trajectory planning for large-scale reinfo
- casagpt cuboid arrangement and scene assembly for interior design
- casp compression of large multimodal models based on attention sparsity
- casp consistency-aware audio-induced saliency prediction model for omnidirection
- cat4d create anything in 4d with multi-view video diffusion models
- catanet efficient content-aware token aggregation for lightweight image super-re
- category-agnostic neural object rigging | arXiv: 2505.20283
- causal composition diffusion model for closed-loop traffic generation
- cav-mae sync improving contrastive audio-visual mask autoencoders via fine-grain
- cawm-mamba a unified model for infrared-visible image fusion and compound advers | arXiv: 2603.02560
- ccin compositional conflict identification and neutralization for composed image
- cdi copyrighted data identification in diffusion models
- certified human trajectory prediction | arXiv: 2403.13778
- cgmatch a different perspective of semi-supervised learning
- ch3depth efficient and flexible depth foundation model with flow matching
- chain of attack on the robustness of vision-language models against transfer-bas
- chain of semantics programming in 3d gaussian splatting representation for 3d vi
- chainhoi joint-based kinematic chain modeling for human-object interaction gener
- change3d revisiting change detection and captioning from a video modeling perspe
- channel consistency prior and self-reconstruction strategy based unsupervised im
- channel-wise noise scheduled diffusion for inverse rendering in indoor scenes | arXiv: 2503.09993
- chapter-llama efficient chaptering in hour-long videos with llms | arXiv: 2504.00072
- charm the missing piece in vit fine-tuning for image aesthetic assessment | arXiv: 2504.02522
- chat-based person retrieval via dialogue-refined cross-modal alignment
- chat2svg vector graphics generation with large language models and image diffusi
- chatgarment garment estimation generation and editing via large language models | arXiv: 2412.17811
- chatgen automatic text-to-image generation from freestyle chatting | arXiv: 2411.17176
- chathuman chatting about 3d humans with tools | arXiv: 2405.04533
- cheb-gr rethinking k-nearest neighbor search in re-ranking for person re-identif
- chebyshev attention depth permutation texture network with latent texture attrib
- checkmanual a new challenge and benchmark for manual-based appliance manipulatio
- chexwhatsapp a dataset for exploring challenges in the diagnosis of chest x-rays
- chexworld exploring image world modeling for radiograph representation learning | arXiv: 2504.13820
- cholectrack20 a multi-perspective tracking dataset for surgical tools | arXiv: 2312.07352
- circumventing shortcuts in audio-visual deepfake detection datasets with unsuper
- citywalker learning embodied urban navigation from web-scale videos | arXiv: 2411.17820
- cl-lora continual low-rank adaptation for rehearsal-free class-incremental learn | arXiv: 2505.24816
- cl-moe enhancing multimodal large language model with dual momentum mixture-of-e
- classic video denoising in a machine learning world robust fast and controllable | arXiv: 2504.03136
- classifier-free guidance inside the attraction basin may cause memorization | arXiv: 2411.16738
- classifier-guided clip distillation for unsupervised multi-label classification | arXiv: 2503.16873
- classifier-to-bias toward unsupervised automatic bias detection for visual class
- cleandift diffusion features without noise | arXiv: 2412.03439
- clearsight visual signal enhancement for object hallucination mitigation in mult
- climbingcap multi-modal dataset and method for rock climbing in world coordinate | arXiv: 2503.21268
- clip is almost all you need towards parameter-efficient scene text retrieval wit
- clip is strong enough to fight back test-time counterattacks towards zero-shot a
- clip under the microscope a fine-grained analysis of multi-object representation | arXiv: 2502.19842
- clip-driven coarse-to-fine semantic guidance for fine-grained open-set semi-supe
- cloc contrastive learning for ordinal classification with multi-margin n-pair lo
- cloe expert consistency learning for missing modality segmentation | arXiv: 2603.09316
- closed-loop supervised fine-tuning of tokenized traffic models | arXiv: 2412.05334
- closest neighbors are harmful for lightweight masked auto-encoders
- cmmloc advancing text-to-pointcloud localization with cauchy-mixture-model based | arXiv: 2503.02593
- co-op correspondence-based novel object pose estimation | arXiv: 2503.17731
- co-speech gesture video generation with implicit motion-audio entanglement
- co-spy combining semantic and pixel features to detect synthetic images by ai | arXiv: 2503.18286
- coa towards real image dehazing via compression-and-adaptation | arXiv: 2504.05590
- coap memory-efficient training with correlation-aware gradient projection | arXiv: 2412.00071
- coarse correspondences boost spatial-temporal reasoning in multimodal language m | arXiv: 2408.00754
- cob-gs clear object boundaries in 3dgs segmentation based on boundary-adaptive g | arXiv: 2503.19443
- cobra combinatorial retrieval augmentation for few-shot adaptation | arXiv: 2412.17684
- cocoer aligning multi-level feature by competition and coordination for emotion
- cocogaussian leveraging circle of confusion for gaussian splatting from defocuse | arXiv: 2412.16028
- code-as-monitor constraint-aware visual programming for reactive and proactive r
- codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
- codrawagents a multi-agent dialogue framework for compositional image generation | arXiv: 2603.12829
- coe chain-of-explanation via automatic visual concept circuit description and po
- coeff-tuning a graph filter subspace view for tuning attention-based large model | arXiv: 2503.18337
- coherent 3d portrait video reconstruction via triplane fusion | arXiv: 2405.00794
- colabsfm collaborative structure-from-motion by point cloud registration | arXiv: 2503.17093
- collaborative decoding makes visual auto-regressive modeling efficient | arXiv: 2411.17787
- collaborative tree search for enhancing embodied multi-agent collaboration
- collm a large language model for composed image retrieval | arXiv: 2503.19910
- color alignment in diffusion | arXiv: 2503.06746
- comapgs covisibility map-based gaussian splatting for sparse novel view synthesi | arXiv: 2503.20998
- comatcher multi-view collaborative feature matching | arXiv: 2504.01872
- combo conflict mitigation via branched optimization for class incremental segmen
- comfybench benchmarking llm-based agents in comfyui for autonomously designing c
- comm a coherent interleaved image-text dataset for multimodal understanding and | arXiv: 2406.10462
- common3d self-supervised learning of 3d morphable models for common objects in n
- commonsense video question answering through video-grounded entailment tree reas
- community forensics using thousands of generators to train fake image detectors | arXiv: 2411.04125
- comparative evaluation of traditional methods and deep learning for brain glioma | arXiv: 2603.04796
- compass control multi object orientation control for text-to-image generation | arXiv: 2504.06752
- competition-aware cpc forecasting with near-market coverage | arXiv: 2603.13059
- compgs unleashing 2d compositionality for compositional text-to-3d via dynamical
- complementary advantages exploiting cross-field frequency correlation for nir-as
- completion as enhancement a degradation-aware selective image guided network for | arXiv: 2412.19225
- complexity experts are task-discriminative learners for any image restoration | arXiv: 2411.18466
- composing driving worlds through disentangled control for adversarial scenario g | arXiv: 2603.12864
- composing parts for expressive object generation | arXiv: 2406.10197
- compositional caching for training-free open-vocabulary attribute detection | arXiv: 2503.19145
- compositional targeted multi-label universal perturbations
- comprehensive information bottleneck for unveiling universal attribution to inte
- comprehensive relighting generalizable and consistent monocular human relighting | arXiv: 2504.03011
- comrope scalable and robust rotary position embedding parameterized by trainable
- concept lancet image editing with compositional representation transplant | arXiv: 2504.02828
- concept replacer replacing sensitive concepts in diffusion models via precision | arXiv: 2412.01244
- conceptguard continual personalized text-to-image generation with forgetting and | arXiv: 2503.10358
- condensing action segmentation datasets via generative network inversion | arXiv: 2503.14112
- conditional balance improving multi-conditioning trade-offs in image generation | arXiv: 2412.19853
- conformal prediction and mllm aided uncertainty quantification in scene graph ge
- conformal prediction for zero-shot models | arXiv: 2505.24693
- conical visual concentration for efficient large vision-language models
- conmo controllable motion disentanglement and recomposition for zero-shot motion | arXiv: 2504.02451
- consistency posterior sampling for diverse image synthesis
- consistency-aware self-training for iterative-based stereo matching | arXiv: 2503.23747
- consistent and controllable image animation with motion diffusion models | arXiv: 2407.15642
- consistent normal orientation for 3d point clouds via least squares on delaunay
- context-aware multimodal pretraining | arXiv: 2411.15099
- context-cir learning from concepts in text for composed image retrieval | arXiv: 2505.20764
- context-enhanced memory-refined transformer for online action detection | arXiv: 2503.18359
- contextual ad narration with interleaved multimodal sequence | arXiv: 2403.12922
- continual learning with vision-language models via semantic-geometry preservatio | arXiv: 2603.12055
- continual sft matches multimodal rlhf with negative supervision | arXiv: 2411.14797
- continuous 3d perception model with persistent state | arXiv: 2501.12387
- continuous adverse weather removal via degradation-aware distillation
- continuous locomotive crowd behavior generation | arXiv: 2504.04756
- continuous space-time video resampling with invertible motion steganography
- continuous subject-specific attribute control in t2i models by identifying seman
- controlface harnessing facial parametric control for face rigging | arXiv: 2412.01160
- controllable human image generation with personalized multi-garments | arXiv: 2411.16801
- convex combination star shape prior for data-driven image semantic segmentation
- convex relaxation for robust vanishing point estimation in manhattan world | arXiv: 2505.04788
- core4d a 4d human-object-human interaction dataset for collaborative object rear
- corrbev multi-view 3d object detection by correlation learning with multi-modal
- correcting deviations from normality a reformulated diffusion model for multi-cl
- correlative and discriminative label grouping for multi-label visual prompt tuni
- cosdh communication-efficient collaborative perception via supply-demand awarene
- coser towards consistent dense multiview text-to-image generator for 3d creation
- cosmic clique-oriented semantic multi-space integration for robust clip test-tim
- cosmos cross-modality self-distillation for vision language pre-training | arXiv: 2412.01814
- cospace benchmarking continuous space perception ability for vision-language mod
- cot-vla visual chain-of-thought reasoning for vision-language-action models | arXiv: 2503.22020
- countllm towards generalizable repetitive action counting via large language mod
- counts benchmarking object detectors and multimodal large language models under | arXiv: 2504.10158
- cpath-omni a unified multimodal foundation model for patch and whole slide image
- crab a unified audio-visual scene understanding model with explicit cooperation | arXiv: 2503.13068
- craftsman3d high-fidelity mesh generation with 3d native diffusion and interacti
- creating your editable 3d photorealistic avatar with tetrahedron-constrained gau
- crisp object pose and shape estimation with test-time adaptation | arXiv: 2412.01052
- critic-v vlm critics help catch vlm errors in multimodal reasoning | arXiv: 2411.18203
- crocodl cross-device collaborative dataset for localization
- cropper vision-language model for image cropping through in-context learning | arXiv: 2408.07790
- cross-modal 3d representation with multi-view images and point clouds
- cross-modal and uncertainty-aware agglomeration for open-vocabulary 3d scene und
- cross-modal causal relation alignment for video question grounding | arXiv: 2503.07635
- cross-modal distillation for 2d3d multi-object discovery from 2d motion
- cross-modal information flow in multimodal large language models | arXiv: 2411.18620
- cross-modal interactive perception network with mamba for lung tumor segmentatio
- cross-rejective open-set sar image registration
- cross-view completion models are zero-shot correspondence estimators | arXiv: 2412.09072
- crossearth-sar a sar-centric and billion-scale geospatial foundation model for d | arXiv: 2603.12008
- crossover 3d scene cross-modal alignment | arXiv: 2502.15011
- crosssdf 3d reconstruction of thin structures from cross-sections | arXiv: 2412.04120
- cryptoface end-to-end encrypted face recognition | arXiv: 2509.00332
- csc-pa cross-image semantic correlation via prototype attentions for single-netw
- ctrl-d controllable dynamic 3d scene editing with personalized 2d diffusion | arXiv: 2412.01792
- ctrl-o language-controllable object-centric visual representation learning | arXiv: 2503.21747
- cubify anything scaling indoor 3d object detection | arXiv: 2412.04458
- curriculum coarse-to-fine selection for high-ipc dataset distillation | arXiv: 2503.18872
- curriculum direct preference optimization for diffusion and consistency models | arXiv: 2405.13637
- custany customizing anything from a single example | arXiv: 2406.11643
- customized condition controllable generation for video soundtrack
- customkd customizing large vision foundation for edge model improvement via know
- cxpmrg-bench pre-training and benchmarking for x-ray medical report generation o
- cycleulm a unified label-free deep learning framework for ultrasound localisatio | arXiv: 2603.09840
- d2it dynamic diffusion transformer for accurate image generation
- d2sp dynamic dual-stage purification framework for dual noise mitigation in visi
- d3 scaling up deepfake detection by learning from discrepancy
- d3-human dynamic disentangled digital human from monocular video | arXiv: 2501.01589
- d3ctta domain-dependent decorrelation for continual test-time adaption of 3d lid
- da-vpt semantic-guided visual prompt tuning for vision transformers | arXiv: 2505.23694
- dacapo score distillation as stacked bridge for fast and high-quality 3d editing
- dagsm disentangled avatar generation with gs-enhanced mesh | arXiv: 2411.15205
- damm-diffusion learning divergence-aware multi-modal diffusion model for nanopar
- darkir robust low-light image restoration | arXiv: 2412.13443
- dart disease-aware image-text alignment and self-correcting re-alignment for tru
- dashgaussian optimizing 3d gaussian splatting in 200 seconds | arXiv: 2503.18402
- data distributional properties as inductive bias for systematic generalization | arXiv: 2502.20499
- data synthesis with diverse styles for face recognition via 3dmm-guided diffusio
- data-free group-wise fully quantized winograd convolution via learnable scales | arXiv: 2412.19867
- data-free universal adversarial perturbation with pseudo-semantic prior | arXiv: 2502.21048
- dataset distillation with neural characteristic function a minmax perspective | arXiv: 2502.20653
- dcevo discriminative cross-dimensional evolutionary learning for infrared and vi
- de2gaze deformable and decoupled representation learning for 3d gaze estimation
- deal data-efficient adversarial learning for high-quality infrared imaging | arXiv: 2503.00905
- debiasing multimodal large language models via noise-aware preference optimizati
- decafnet delegate and conquer for efficient temporal grounding in long videos | arXiv: 2505.16376
- decentralized diffusion models | arXiv: 2501.05450
- decision spikeformer spike-driven transformer for decision making | arXiv: 2504.03800
- declip decoupled learning for open-vocabulary dense perception | arXiv: 2505.04410
- decloth decomposable 3d cloth and human body reconstruction from a single image | arXiv: 2503.19373
- decoder gradient shield provable and high-fidelity prevention of gradient-based
- decoding matters efficient mamba-based decoder with distribution-aware deep supe | arXiv: 2603.12547
- decompositional neural scene reconstruction with generative diffusion prior | arXiv: 2503.14830
- deconstructing the failure of ideal noise correction a three-pillar diagnosis | arXiv: 2603.12997
- decouple distortion from perception region adaptive diffusion for extreme-low bi
- decouple-then-merge finetune diffusion models as multi-task learning | arXiv: 2410.06664
- decoupled distillation to erase a general unlearning method for any class-centri
- decoupled motion expression video segmentation
- decoupledgaussian object-scene decoupling for physics-based interaction | arXiv: 2503.05484
- decoupling fine detail and global geometry for compressed depth map super-resolu
- decoupling training-free guided diffusion by admm | arXiv: 2411.12773
- dede detecting backdoor samples for ssl encoders via decoders | arXiv: 2411.16154
- deep change monitoring a hyperbolic representative learning framework and a data
- deep fair multi-view clustering with attention kan
- deep learning based estimation of blood glucose levels from multidirectional scl | arXiv: 2603.12715
- deep learning-based assessment of the relation between the third molar and mandi | arXiv: 2603.11850
- deepcompress-vit rethinking model compression to enhance efficiency of vision tr
- deepla-net very deep local aggregation networks for point cloud analysis
- defectfill realistic defect generation with inpainting diffusion model for visua
- defmamba deformable visual state space model | arXiv: 2504.05794
- defom-stereo depth foundation model based stereo matching | arXiv: 2501.09466
- deformable radial kernel splatting | arXiv: 2412.11752
- deformcl learning deformable centerline representation for vessel extraction in
- degradation-aware feature perturbation for all-in-one image restoration | arXiv: 2505.12630
- deim detr with improved matching for fast convergence | arXiv: 2412.04234
- dejavid encoder-agnostic learned temporal matching for video classification | arXiv: 2506.12585
- delt a simple diversity-driven earlylate training for dataset distillation | arXiv: 2411.19946
- denoising functional maps diffusion models for shape correspondence | arXiv: 2503.01845
- dense dispersed structured light for hyperspectral 3d imaging of dynamic scenes | arXiv: 2412.01140
- dense match summarization for faster two-view estimation | arXiv: 2506.02893
- dense-sfm structure from motion with dense consistent matching | arXiv: 2501.14277
- denver deformable neural vessel representations for unsupervised video vessel se
- depth any camera zero-shot metric depth estimation from any camera | arXiv: 2501.02464
- depth-guided bundle sampling for efficient generalizable neural radiance field r | arXiv: 2505.19793
- depthcrafter generating consistent long depth sequences for open-world videos | arXiv: 2409.02095
- depthcues evaluating monocular depth perception in large vision models | arXiv: 2411.17385
- depthsplat connecting gaussian splatting and depth | arXiv: 2410.13862
- derivative-free diffusion manifold-constrained gradient for unified xai | arXiv: 2411.15265
- ders towards extremely efficient upcycled mixture-of-experts models | arXiv: 2503.01359
- descriptor-in-pixel point-feature tracking for pixel processor arrays
- design2garmentcode turning design concepts to tangible garments through program | arXiv: 2412.08603
- designdiffusion high-quality text-to-design image generation with diffusion mode
- desire-gs 4d street gaussians for static-dynamic decomposition and surface recon
- desplat decomposed gaussian splatting for distractor-free rendering | arXiv: 2411.19756
- detail-preserving latent diffusion for stable shadow removal | arXiv: 2412.17630
- detect any mirrors boosting learning reliability on large-scale unlabeled data w
- detect-and-guide self-regulation of diffusion models for safe text-to-image gene
- detecting adversarial data using perturbation forgery | arXiv: 2405.16226
- detecting backdoor attacks in federated learning via direction alignment inspect | arXiv: 2503.07978
- detecting open world objects via partial attribute assignment
- detecting out-of-distribution through the lens of neural collapse | arXiv: 2311.01479
- detection-friendly nonuniformity correction a union framework for infrared uav t
- deterministic certification of graph neural networks against graph poisoning att
- deterministic image-to-image translation via denoising brownian bridge models wi
- deterministic-to-stochastic diverse latent feature mapping for human motion synt
- developing foundation models for universal segmentation from 3d whole-body posit | arXiv: 2603.11627
- devil is in the detail towards injecting fine details of image prompt in image g
- devils in middle layers of large vision-language models interpreting detecting a
- dexgrasp anything towards universal robotic dexterous grasping with physics awar | arXiv: 2503.08257
- dexhanddiff interaction-aware diffusion planning for adaptive dexterous manipula
- dflmoe decentralized federated learning via mixture of experts for medical data | arXiv: 2503.10412
- dfm differentiable feature matching for anomaly detection
- dformerv2 geometry self-attention for rgbd semantic segmentation | arXiv: 2504.04701
- dh-set improving vision-language alignment with diverse and hybrid set-embedding
- di-pcg diffusion-based efficient inverse procedural content generation for high-
- dic rethinking conv3x3 designs in diffusion models | arXiv: 2501.00603
- diet-gs diffusion prior and event stream-assisted motion deblurring 3d gaussian | arXiv: 2503.24210
- diff-palm realistic palmprint generation with polynomial creases and intra-class
- diff2flow training flow matching models via diffusion model alignment | arXiv: 2506.02221
- diffcam data-driven saliency maps by capturing feature differences
- differ disentangling identity features via semantic cues for clothes-changing pe
- difference inversion interpolate and isolate the difference with token consisten
- differentiable inverse rendering with interpretable basis brdfs | arXiv: 2411.17994
- difffno diffusion fourier neural operator | arXiv: 2411.09911
- difflo semantic-aware lidar odometry with diffusion-based refinement
- difflocks generating 3d hair from a single image using diffusion models | arXiv: 2505.06166
- diffportrait360 consistent portrait diffusion for 360 view synthesis | arXiv: 2503.15667
- diffsensei bridging multi-modal llms and diffusion models for customized manga g | arXiv: 2412.07589
- diffusion bridge leveraging diffusion model to reduce the modality gap between t
- diffusion model is effectively its own teacher
- diffusion renderer neural inverse and forward rendering with video diffusion mod
- diffusion self-distillation for zero-shot customized image generation | arXiv: 2411.18616
- diffusion-4k ultra-high-resolution image synthesis with latent diffusion models | arXiv: 2503.18352
- diffusion-based event generation for high-quality image deblurring
- diffusion-based feature denoising and using nnmf for robust brain tumor classifi | arXiv: 2603.13182
- diffusion-based realistic listening head generation via hybrid motion modeling
- diffusiondrive truncated diffusion model for end-to-end autonomous driving | arXiv: 2411.15139
- diffusionsfm predicting structure and motion via ray origin and endpoint diffusi
- diffvsgg diffusion-driven online video scene graph generation | arXiv: 2503.13957
- difiisr a diffusion model with gradient guidance for infrared image super-resolu
- difix3d improving 3d reconstructions with single-step diffusion models | arXiv: 2503.01774
- dig scalable and efficient diffusion models with gated linear attention | arXiv: 2405.18428
- digit multi-dilated gated encoder and central-adjacent region integrated decoder
- digital twin catalog a large-scale photorealistic 3d object digital twin dataset | arXiv: 2504.08541
- din diffusion model for robust medical vqa with semantic noisy labels | arXiv: 2503.18536
- dinomaly the less is more philosophy in multi-class unsupervised anomaly detecti
- dinov2 meets text a unified framework for image- and pixel-level vision-language | arXiv: 2412.16334
- dio decomposable implicit 4d occupancy-flow world model
- directional label diffusion model for learning from noisy labels
- directtrigs triplane-based gaussian splatting field representation for 3d genera
- disciple learning interpretable programs for scientific visual discovery | arXiv: 2502.10060
- disco4d disentangled 4d human generation and animation from a single image | arXiv: 2409.17280
- discovering fine-grained visual-concept relations by disentangled optimal transp
- discovla discrepancy reduction in vision language and alignment for parameter-ef
- discrete to continuous generating smooth transition poses from sign language obs
- disentangled pose and appearance guidance for multi-pose generation
- disentangling safe and unsafe image corruptions via anisotropy and locality
- diskvps vanishing point detector via hough transform in a disk region
- dispider enabling video llms with active real-time interaction via disentangled
- disrt-in-bed diffusion-based sim-to-real transfer framework for in-bed human mes
- dissecting and mitigating diffusion bias via mechanistic interpretability | arXiv: 2503.20483
- distilled prompt learning for incomplete multimodal survival prediction | arXiv: 2503.01653
- distilling long-tailed datasets | arXiv: 2408.14506
- distilling monocular foundation model for fine-grained depth completion | arXiv: 2503.16970
- distilling multi-modal large language models for autonomous driving | arXiv: 2501.09757
- distilling spatially-heterogeneous distortion perception for blind image quality
- distilling spectral graph for object-context aware open-vocabulary semantic segm
- distinctad distinctive audio description generation in contexts | arXiv: 2411.18180
- distinguish then exploit source-free open set domain adaptation via weight barco
- distraction is all you need for multimodal large language model jailbreaking | arXiv: 2502.10794
- distribution prototype diffusion learning for open-set supervised anomaly detect | arXiv: 2502.20981
- dit-ic aligned diffusion transformer for efficient image compression | arXiv: 2603.13162
- ditask multi-task fine-tuning with diffeomorphic transformations | arXiv: 2502.06029
- ditctrl exploring attention control in multi-modal diffusion transformer for tun
- div-ff dynamic image-video feature fields for environment understanding in egoce
- diverseflow sample-efficient diverse mode coverage in flows | arXiv: 2504.07894
- divide and conquer heterogeneous noise integration for diffusion-based adversari | arXiv: 2503.01407
- divot diffusion powers video tokenizer for comprehension and generation | arXiv: 2412.04432
- divprune diversity-based visual token pruning for large multimodal models | arXiv: 2503.02175
- dkc differentiated knowledge consolidation for cloth-hybrid lifelong person re-i
- dkdm data-free knowledge distillation for diffusion models with any architecture | arXiv: 2409.03550
- dl2g degradation-guided local-to-global restoration for eyeglass reflection remo
- dnf unconditional 4d generation with dictionary-based neural fields | arXiv: 2412.05161
- dnlut ultra-efficient color image denoising via channel-aware lookup tables | arXiv: 2503.15931
- do computer vision foundation models learn the low-level characteristics of the
- do imagenet-trained models learn shortcuts the impact of frequency shortcuts on | arXiv: 2503.03519
- do visual imaginations improve vision-and-language navigation agents | arXiv: 2503.16394
- do we always need the simplicity bias looking for optimal inductive biases in th
- do we really need curated malicious data for safety alignment in multi-modal lar
- do your best and get enough rest for continual learning | arXiv: 2503.18371
- doclayllm an efficient multi-modal extension of large language models for text-r
- docopilot improving multimodal models for document-level understanding | arXiv: 2507.14675
- docsam unified document image segmentation via query decomposition and heterogen
- document haystacks vision-language reasoning over piles of 1000 documents | arXiv: 2411.16740
- docvlm make your vlm an efficient reader | arXiv: 2412.08746
- dof-gaussian controllable depth-of-field for 3d gaussian splatting | arXiv: 2503.00746
- dof-gs adjustable depth-of-field 3d gaussian splatting for post-capture refocusi
- domain adaptive diabetic retinopathy grading with model absence and flowing data | arXiv: 2412.01203
- domain generalization in clip via learning with diverse text prompts
- dont shake the wheel momentum-aware planning in end-to-end autonomous driving
- doppelgangers and adversarial vulnerability
- doppelgangers improved visual disambiguation with geometric 3d features | arXiv: 2412.05826
- dora sampling and benchmarking for 3d shape variational auto-encoders | arXiv: 2412.17808
- doracycle domain-oriented adaptation of unified generative model in multimodal c | arXiv: 2503.03651
- dornet a degradation oriented and regularized network for blind depth super-reso
- dpc dual-prompt collaboration for tuning vision-language models | arXiv: 2503.13443
- dpflow adaptive optical flow estimation with a dual-pyramid framework | arXiv: 2503.14880
- dpseg dual-prompt cost volume learning for open-vocabulary semantic segmentation | arXiv: 2505.11676
- dpu dynamic prototype updating for multimodal out-of-distribution detection | arXiv: 2411.08227
- dr splat directly referring 3d gaussian splatting via direct language embedding | arXiv: 2502.16652
- dragin3d image editing by dragging in 3d space
- drawer digital reconstruction and articulation with environment realism | arXiv: 2504.15278
- dreamcache finetuning-free lightweight personalized image generation via feature | arXiv: 2411.17786
- dreamomni unified image generation and editing | arXiv: 2412.17098
- dreamrelation bridging customization and relation generation | arXiv: 2410.23280
- dreamtext high fidelity scene text synthesis | arXiv: 2405.14701
- dreamtrack dreaming the future for multimodal visual object tracking
- dreamvideo-omni omni-motion controlled multi-subject video customization with la | arXiv: 2603.12257
- drive diffusion-based rigging empowers generation of versatile and expressive ch
- drivedreamer4d world models are effective data machines for 4d driving scene rep
- drivegen generalized and robust 3d detection in driving via controllable text-to
- drivegpt4-v2 harnessing large language model capabilities for enhanced closed-lo
- drivescape high-resolution driving video generation by multi-view feature fusion
- driving by the rules a benchmark for integrating traffic sign regulations into v | arXiv: 2410.23780
- drivingsphere building a high-fidelity 4d world for closed-loop simulation | arXiv: 2411.11252
- dronesplat 3d gaussian splatting for robust 3d reconstruction from in-the-wild d | arXiv: 2503.16964
- dropgaussian structural regularization for sparse-view gaussian splatting | arXiv: 2504.00773
- dropoutgs dropping out gaussians for better sparse-view rendering | arXiv: 2504.09491
- drvideo document retrieval based long video understanding | arXiv: 2406.12846
- dspnet dual-vision scene perception for robust 3d question answering | arXiv: 2503.03190
- dsv-lfs unifying llm-driven semantic cues with visual features for robust few-sh
- dtgbrepgen a novel b-rep generative model through decoupling topology and geomet
- dtos dynamic time object sensing with large multimodal model
- dual consolidation for pre-trained model-based domain-incremental learning | arXiv: 2410.00911
- dual diffusion for unified image generation and understanding | arXiv: 2501.00289
- dual energy-based model with open-world uncertainty estimation for out-of-distri
- dual exposure stereo for extended dynamic range 3d imaging | arXiv: 2412.02351
- dual focus-attention transformer for robust point cloud registration
- dual prompting image restoration with diffusion transformers | arXiv: 2504.17825
- dual semantic guidance for open vocabulary semantic segmentation
- dual-agent optimization framework for cross-domain few-shot segmentation
- dual-granularity semantic guided sparse routing diffusion model for general pans
- dual-interrelated diffusion model for few-shot anomaly image generation | arXiv: 2408.13509
- dual-view x-ray detection can ai detect prohibited items from dual-view x-ray im
- dualpm dual posed-canonical point maps for 3d shape and pose reconstruction | arXiv: 2412.04464
- dualtalk dual-speaker interaction for 3d talking head conversations | arXiv: 2505.18096
- dune distilling a universal encoder from heterogeneous 2d and 3d teachers | arXiv: 2503.14405
- dv-matcher deformation-based non-rigid point cloud matching guided by pre-traine
- dvhgnn multi-scale dilated vision hgnn for efficient vision recognition | arXiv: 2503.14867
- dvin dynamic visual routing network for weakly supervised referring expression c
- dycoke dynamic compression of tokens for fast video large language models | arXiv: 2411.15024
- dycon dynamic uncertainty-aware consistency and contrastive learning for semi-su
- dyfo a training-free dynamic focus visual search for enhancing lmms in fine-grai
- dymo training-free diffusion model alignment with dynamic multi-objective schedu
- dyn-hamr recovering 4d interacting hand motion from a dynamic camera | arXiv: 2412.12861
- dynamic camera poses and where to find them | arXiv: 2504.17788
- dynamic content prediction with motion-aware priors for blind face video restora
- dynamic derivation and elimination audio visual segmentation with enhanced audio | arXiv: 2503.12840
- dynamic group normalization spatio-temporal adaptation to evolving data statisti
- dynamic integration of task-specific adapters for class incremental learning | arXiv: 2409.14983
- dynamic motion blending for versatile motion editing | arXiv: 2503.20724
- dynamic neural surfaces for elastic 4d shape representation and analysis | arXiv: 2503.03132
- dynamic pseudo labeling via gradient cutting for high-low entropy exploration
- dynamic stereotype theory induced micro-expression recognition with oriented def
- dynamic updates for language adaptation in visual-language tracking | arXiv: 2503.06621
- dynamicscaler seamless and scalable video generation for panoramic scenes | arXiv: 2412.11100
- dynamode-nerf motion-aware deblurring neural radiance field for dynamic scenes
- dynfocus dynamic cooperative network empowers llms with video understanding | arXiv: 2411.12355
- dynpose largely improving the efficiency of human pose estimation by a simple dy
- dynrefer delving into region-level multimodal tasks via dynamic resolution | arXiv: 2405.16071
- dynscene scalable generation of dynamic robotic manipulation scenes for embodied
- eap-gs efficient augmentation of pointcloud for 3d gaussian splatting in few-sho
- early-bird diffusion investigating and leveraging timestep-aware early-bird tick
- earthdial turning multi-sensory earth observations to interactive dialogues | arXiv: 2412.15190
- easemvcefficient dual selection mechanism for deep multi-view clustering
- easy-editable image vectorization with multi-layer multi-scale distributed visua
- easycraft a robust and efficient framework for automatic avatar crafting | arXiv: 2503.01158
- easyhoi unleashing the power of large models for reconstructing hand-object inte
- ebs-ekf accurate and high frequency event-based star tracking | arXiv: 2503.20101
- ecbench can multi-modal foundation models understand the egocentric world a holi
- echomatch partial-to-partial shape matching via correspondence reflection
- echomimicv2 towards striking simplified and semi-body human animation | arXiv: 2411.10061
- echoone segmenting multiple echocardiography planes in one model | arXiv: 2412.02993
- echotraffic enhancing traffic anomaly understanding with audio-visual insights
- echoworld learning motion-aware world models for echocardiography probe guidance | arXiv: 2504.13065
- ecvc exploiting non-local correlations in multiple frames for contextual video c | arXiv: 2410.09706
- edcflow exploring temporally dense difference maps for event-based optical flow | arXiv: 2506.03512
- eden enhanced diffusion for high-quality large-motion video frame interpolation | arXiv: 2503.15831
- edge-sd-sr low latency and parameter efficient on-device super-resolution with s
- edgediff edge-aware diffusion network for building reconstruction from point clo
- edgemovingnet edge-preserving point cloud reconstruction via joint geometry feat
- edgetam on-device track anything model | arXiv: 2501.07256
- edit away and my face will not stay personal biometric defense against malicious
- editar unified conditional generation with autoregressive models | arXiv: 2501.04699
- editing away the evidence diffusion-based image manipulation and the failure mod | arXiv: 2603.12949
- editsplat multi-view fusion and attention-guided optimization for view-consisten
- edm equirectangular projection-oriented dense kernelized feature matching | arXiv: 2502.20685
- eee-bench a comprehensive multimodal electrical and electronics engineering benc
- effective cloud removal for remote sensing images by an improved mean-reverting
- effective sam combination for open-vocabulary semantic segmentation | arXiv: 2411.14723
- efficient ann-guided distillation aligning rate-based features of spiking neural
- efficient data driven mixture-of-expert extraction from trained networks | arXiv: 2505.15414
- efficient decoupled feature 3d gaussian splatting via hierarchical compression
- efficient depth estimation for unstable stereo camera systems on ar glasses | arXiv: 2411.10013
- efficient diffusion as low light enhancer | arXiv: 2410.12346
- efficient dynamic scene editing via 4d gaussian-based static-dynamic separation | arXiv: 2502.02091
- efficient event-based object detection a hybrid neural network with spatial and | arXiv: 2403.10173
- efficient fine-tuning and concept suppression for pruned diffusion models | arXiv: 2412.15341
- efficient long video tokenization via coordinate-based patch reconstruction | arXiv: 2411.14762
- efficient motion-aware video mllm | arXiv: 2503.13016
- efficient personalization of quantized diffusion model without backpropagation | arXiv: 2503.14868
- efficient rgb-d scene understanding via multi-task adaptive learning and cross-d | arXiv: 2603.07570
- efficient test-time adaptive object detection via sensitivity-guided pruning | arXiv: 2506.02462
- efficient transfer learning for video-language foundation models | arXiv: 2411.11223
- efficient video face enhancement with enhanced spatial-temporal consistency | arXiv: 2411.16468
- efficient video super-resolution for real-time rendering with decoupled g-buffer
- efficient visual state space model for image deblurring | arXiv: 2405.14343
- efficientllava generalizable auto-pruning for large vision-language models
- efficientvim efficient vision mamba with hidden state mixer based state space du | arXiv: 2411.15241
- effidec3d an optimized decoder for high-performance and efficient 3d medical ima
- effortless active labeling for long-term test-time adaptation | arXiv: 2503.14564
- ego4o egocentric human motion capture and understanding from multi-modal input | arXiv: 2504.08449
- egolife towards egocentric life assistant | arXiv: 2503.03803
- egolm multi-modal language model of egocentric motions | arXiv: 2409.18127
- egopressure a dataset for hand pressure and pose estimation in egocentric vision | arXiv: 2409.02224
- egotextvqa towards egocentric scene-text aware video question answering | arXiv: 2502.07411
- eidt-v exploiting intersections in diffusion trajectories for model-agnostic zer
- eigengs representation from eigenspace to gaussian image space | arXiv: 2503.07446
- electromyography-informed facial expression reconstruction for physiological-bas
- embodied scene understanding for vision language models via metavqa | arXiv: 2501.09167
- embracing collaboration over competition condensing multiple prompts for visual | arXiv: 2504.21263
- emodubber towards high quality and emotion controllable movie dubbing | arXiv: 2412.08988
- emoe modality-specific enhanced dynamic emotion experts
- emoedit evoking emotions through image manipulation | arXiv: 2405.12661
- emotivetalk expressive talking head generation through audio information decoupl
- emova empowering language models to see hear and speak with vivid emotions | arXiv: 2409.18042
- emphasizing discriminative features for dataset distillation in complex scenario | arXiv: 2410.17193
- empowering large language models with 3d situation awareness | arXiv: 2503.23024
- empowering llms to understand and generate complex vector graphics | arXiv: 2412.11102
- empowering vector graphics with consistently arbitrary viewing and view-dependen
- encapsulated composition of text-to-image and text-to-video models for high-qual
- end-to-end hoi reconstruction transformer with graph-based encoding | arXiv: 2503.06012
- end-to-end implicit neural representations for classification | arXiv: 2503.18123
- enduring efficient and robust trajectory prediction attack in autonomous driving
- energymogen compositional human motion generation with energy-based diffusion mo
- enhanced contrastive learning with multi-view longitudinal data for chest x-ray | arXiv: 2502.20056
- enhanced ood detection through cross-modal alignment of multi-modal representati
- enhanced then progressive fusion with view graph for multi-view clustering
- enhanced visual-semantic interaction with tailored prompts for pedestrian attrib
- enhancing 3d gaze estimation in the wild using weak supervision with gaze follow | arXiv: 2502.20249
- enhancing adversarial transferability with checkpoints of a single models traini
- enhancing creative generation on stable diffusion-based models | arXiv: 2503.23538
- enhancing dance-to-music generation via negative conditioning latent diffusion m | arXiv: 2503.22138
- enhancing dataset distillation via non-critical region refinement | arXiv: 2503.18267
- enhancing diversity for data-free quantization
- enhancing facial privacy protection via weakening diffusion purification | arXiv: 2503.10350
- enhancing few-shot class-incremental learning via training-free bi-level modalit
- enhancing image aesthetics with dual-conditioned diffusion models guided by mult | arXiv: 2603.11556
- enhancing online continual learning with plug-and-play state space model and cla
- enhancing privacy-utility trade-offs to mitigate memorization in diffusion model | arXiv: 2504.18032
- enhancing sam with efficient prompting and preference optimization for semi-supe
- enhancing testing-time robustness for trusted multi-view classification in the w
- enhancing video-llm reasoning via agent-of-thoughts distillation | arXiv: 2412.01694
- enhancing virtual try-on with synthetic pairs and error-aware noise scheduling | arXiv: 2501.04666
- enhancing vision-language compositional understanding with multimodal synthetic | arXiv: 2503.01167
- enliveninggs active locomotion of 3dgs
- entityerasure erasing entity cleanly via amodal entity segmentation and completi
- entitysam segment everything in video
- entropymark towards more harmless backdoor watermark via entropy-based constrain
- envgs modeling view-dependent appearance with environment gaussian | arXiv: 2412.15215
- envposer environment-aware realistic human motion estimation from sparse observa
- equipose exploiting permutation equivariance for relative camera pose estimation
- equivania a spectral method for rotation-equivariant anisotropic image analysis | arXiv: 2603.11294
- erase diffusion empowering object removal through calibrating diffusion pathways | arXiv: 2503.07026
- erasing undesirable influence in diffusion models | arXiv: 2401.05779
- erupt efficient rendering with unposed patch transformer | arXiv: 2503.24374
- esc erasing space concept for knowledge deletion | arXiv: 2504.02199
- escape equivariant shape completion via anchor point encoding | arXiv: 2412.00952
- escaping platos cave towards the alignment of 3d and text latent spaces | arXiv: 2503.05283
- espire a diagnostic benchmark for embodied spatial reasoning of vision-language | arXiv: 2603.13033
- estimating body and hand motion in an ego-sensed world | arXiv: 2410.03665
- etap event-based tracking of any point | arXiv: 2412.00133
- ev-3dod pushing the temporal boundaries of 3d object detection with event camera | arXiv: 2502.19630
- eval3d interpretable and fine-grained evaluation for 3d generation | arXiv: 2504.18509
- evaluating model perception of color illusions in photorealistic scenes | arXiv: 2412.06184
- evaluating vision-language models as evaluators in path planning | arXiv: 2411.18711
- evenhancer empowering effectiveness efficiency and generalizability for continuo
- event ellipsometer event-based mueller-matrix video imaging | arXiv: 2411.17313
- event fields capturing light fields at high speed resolution and dynamic range | arXiv: 2412.06191
- event-based video super-resolution via state space models
- event-equalized dense video captioning
- eventfly event camera perception from ground to the sky | arXiv: 2503.19916
- eventgpt event stream understanding with multimodal large language models | arXiv: 2412.00832
- eventpsr surface normal and reflectance estimation from photometric stereo using
- eventsplat 3d gaussian splatting from moving event cameras for real-time renderi
- every sam drop counts embracing semantic priors for multi-modality image fusion | arXiv: 2503.01210
- everything to the synthetic diffusion-driven test-time adaptation via synthetic- | arXiv: 2406.04295
- evidential learning driven breast tumor segmentation with stage-divided vision-l | arXiv: 2603.11206
- evocc accurate semantic occupancy for automated driving using evidence theory
- evolsplat efficient volume-based gaussian splatting for urban view synthesis | arXiv: 2503.20168
- evolving high-quality rendering and reconstruction in a unified framework with c | arXiv: 2503.00881
- evos efficient implicit neural training via evolutionary selector | arXiv: 2412.10153
- evotok a unified image tokenizer via residual latent evolution for visual unders | arXiv: 2603.12108
- evpgs enhanced view prior guidance for splatting-based extrapolated view synthes
- exact exploring space-time perceptive clues for weakly supervised satellite imag
- expert pyramid tuning efficient parameter fine-tuning for expertise-driven task | arXiv: 2603.12577
- expertaf expert actionable feedback from video | arXiv: 2408.00672
- explainable saliency articulating reasoning with contextual prioritization
- explaining domain shifts in language concept erasing for interpretable image cla
- explaining in diffusion explaining a classifier with diffusion semantics
- explicit depth-aware blurry video frame interpolation guided by differential cur
- exploiting deblurring networks for radiance fields | arXiv: 2502.14454
- exploiting temporal state space sharing for video semantic segmentation | arXiv: 2503.20824
- exploration-driven generative interactive environments | arXiv: 2504.02515
- exploring clips dense knowledge for weakly supervised semantic segmentation | arXiv: 2503.20826
- exploring contextual attribute density in referring expression counting | arXiv: 2503.12460
- exploring historical information for rgbe visual tracking with mamba
- exploring intrinsic normal prototypes within a single image for universal anomal
- exploring scene affinity for semi-supervised lidar semantic segmentation | arXiv: 2408.11280
- exploring semantic feature discrimination for perceptual image super-resolution
- exploring simple open-vocabulary semantic segmentation | arXiv: 2401.12217
- exploring sparse moe in gans for text-conditioned image synthesis | arXiv: 2309.03904
- exploring temporally-aware features for point tracking | arXiv: 2501.12218
- exploring the deep fusion of large language models and diffusion transformers fo
- exploring timeline control for facial motion generation | arXiv: 2505.20861
- exploring visual vulnerabilities via multi-loss adversarial search for jailbreak
- exposure-slot exposure-centric representations learning with slot-in-slot attent
- extrapolating and decoupling image-to-video generation models motion modeling is
- extreme rotation estimation in the wild | arXiv: 2411.07096
- ezsr event-based zero-shot recognition | arXiv: 2407.21616
- f-lmm grounding frozen large multimodal models | arXiv: 2406.05821
- f3ocus - federated finetuning of vision-language foundation models with optimal
- face forgery video detection via temporal forgery cue unraveling
- facebench a multi-view multi-level facial attribute vqa dataset for benchmarking
- factchexcker mitigating measurement hallucinations in chest x-ray report generat
- factored-neus reconstructing surfaces illumination and materials of possibly glo
- fada fast diffusion avatar synthesis with mixed-supervised multi-cfg distillatio
- fade frequency-aware diffusion model factorization for video editing | arXiv: 2506.05934
- faithdiff unleashing diffusion priors for faithful image super-resolution | arXiv: 2411.18824
- falcon fairness learning via contrastive attention approach to continual semanti
- fam diffusion frequency and attention modulation for high-resolution image gener
- fancy123 one image to high-quality 3d mesh generation via plug-and-play deformat
- fast and accurate gigapixel pathological image classification with hierarchical
- fast3r towards 3d reconstruction of 1000 images in one forward pass | arXiv: 2501.13928
- faster focal token acquiring-and-scaling transformer for long-term 3d objection | arXiv: 2503.01899
- faster parameter-efficient tuning with token redundancy reduction | arXiv: 2503.20282
- fastvlm efficient vision encoding for vision language models | arXiv: 2412.13303
- fate full-head gaussian avatar with textural editing from monocular video | arXiv: 2411.15604
- fc-track overlap-aware post-association correction for online multi-object track | arXiv: 2603.12758
- fdeid-toolbox face de-identification toolbox | arXiv: 2603.13121
- fds frequency-aware denoising score for text-guided latent diffusion image editi
- feat2gs probing visual foundation models with gaussian splatting | arXiv: 2412.09606
- feature information driven position gaussian distribution estimation for tiny ob
- feature selection for latent factor models | arXiv: 2412.10128
- feature spectrum learning for remote sensing change detection
- feature-preserving mesh decimation for normal integration | arXiv: 2504.00867
- feature4x bridging any monocular video to 4d agentic ai with versatile gaussian | arXiv: 2503.20776
- fedawa adaptive optimization of aggregation weights in federated learning using | arXiv: 2503.15842
- fedbip heterogeneous one-shot federated learning with personalized latent diffus
- fedcalm conflict-aware layer-wise mitigation for selective aggregation in deeper
- fedcs coreset selection for federated learning
- federated learning with domain shift eraser | arXiv: 2503.13063
- federated modality-specific encoders and partially personalized fusion decoder f | arXiv: 2603.04887
- fedmia an effective membership inference attack exploiting all for one principle
- fedspa generalizable federated graph learning under homophily heterogeneity
- feededit text-based image editing with dynamic feedback regulation
- ferret an efficient online continual learning framework under varying memory con
- few-shot implicit function generation via equivariance | arXiv: 2501.01601
- few-shot personalized scanpath prediction | arXiv: 2504.05499
- few-shot recognition via stage-wise retrieval-augmented finetuning | arXiv: 2406.11148
- ffacenerf few-shot face editing in neural radiance fields | arXiv: 2503.17095
- ffr frequency feature rectification for weakly supervised semantic segmentation
- fg2 fine-grained cross-view localization by fine-grained feature matching
- fiction 4d future interaction prediction from video | arXiv: 2412.00932
- fifa fine-grained inter-frame attention for drivers video gaze estimation
- filmcomposer llm-driven music production for silent film clips | arXiv: 2503.08147
- filter images first generate instructions later pre-instruction data selection f
- fima-q post-training quantization for vision transformers by fisher information | arXiv: 2506.11543
- finding local diffusion schrodinger bridge using kolmogorov-arnold network
- fine-grained erasure in text-to-image diffusion-based foundation models | arXiv: 2503.19783
- fine-grained image-text correspondence with cost aggregation for open-vocabulary | arXiv: 2501.09688
- finecaption compositional image captioning focusing on wherever you want at any | arXiv: 2411.15411
- finelip extending clips reach via fine-grained alignment with longer text inputs | arXiv: 2504.01916
- finephys fine-grained human action generation by explicitly incorporating physic
- finer-cam spotting the difference reveals finer details for visual explanation | arXiv: 2501.11309
- finevq fine-grained user generated content video quality assessment | arXiv: 2412.19238
- fingerprinting denoising diffusion probabilistic models
- finite difference flow optimization for rl post-training of text-to-image models | arXiv: 2603.12893
- finsler multi-dimensional scaling manifold learning for asymmetric dimensionalit
- fire fixed-points of restoration priors for solving inverse problems | arXiv: 2411.18970
- fire robust detection of diffusion-generated images via frequency-guided reconst
- fireedit fine-grained instruction-based image editing via region-aware vision la
- fireplace geometric refinements of llm common sense reasoning for 3d object plac
- fish-vista a multi-purpose dataset for understanding identification of traits fr
- fishertune fisher-guided robust tuning of vision foundation models for domain ge
- fitted neural lossless image compression
- flair vlm with fine-grained language-informed image representations | arXiv: 2412.03561
- flame frozen large language models enable data-efficient language-image pre-trai
- flare feed-forward geometry appearance and camera estimation from uncalibrated s | arXiv: 2502.12138
- flash-split 2d reflection removal with flash cues and latent diffusion separatio
- flash3d super-scaling point transformers through joint hardware-geometry localit
- flashgs efficient 3d gaussian splatting for large-scale and high-resolution rend
- flashmotion few-step controllable video generation with trajectory guidance | arXiv: 2603.12146
- flashsloth lightning multimodal large language models via embedded visual compre
- flavc learned video compression with feature level attention
- flexdrive toward trajectory flexibility in driving scene gaussian splatting reco
- flexgs train once deploy everywhere with many-in-one flexible 3d gaussian splatt
- flexible frame selection for efficient video reasoning
- flexible group count enables hassle-free structured pruning
- flexidit your diffusion transformer can easily generate high-quality samples wit
- flexuod the answer to real-world unsupervised image outlier detection
- flipsketch flipping static drawings to text-guided sketch animations | arXiv: 2411.10818
- floating no more object-ground reconstruction from a single image | arXiv: 2407.18914
- florence-vl enhancing vision-language models with generative vision encoder and | arXiv: 2412.04424
- flovd optical flow meets video diffusion model for enhanced camera-controlled vi
- flow-nerf joint learning of geometry poses and dense flow within unified neural | arXiv: 2503.10464
- flowing from words to pixels a noise-free framework for cross-modality evolution | arXiv: 2412.15213
- flowram grounding flow matching policy with region-aware mamba framework for rob
- floxels fast unsupervised voxel based scene flow estimation | arXiv: 2503.04718
- fluidnexus 3d fluid reconstruction and prediction from a single video | arXiv: 2503.04720
- fluxspace disentangled semantic editing in rectified flow models
- focal split untethered snapshot depth from differential defocus | arXiv: 2504.11202
- focus knowledge-enhanced adaptive visual compression for few-shot whole slide im
- focus-n-fix region-aware fine-tuning for text-to-image generation | arXiv: 2501.06481
- focusing on tracks for online multi-object tracking
- foley-flow coordinated video-to-audio generation with masked audio-visual alignm
- font-agent enhancing font understanding with large language models
- forensic self-descriptions are all you need for zero-shot detection open-set sou
- forensics adapter adapting clip for generalizable face forgery detection | arXiv: 2411.19715
- forensics-bench a comprehensive forgery detection benchmark suite for large visi
- forensiczip more tokens are better but not necessary in forensic vision-language | arXiv: 2603.12208
- forestlpr lidar place recognition in forests attentioning multiple bev density i | arXiv: 2503.04475
- forming auxiliary high-confident instance-level loss to promote learning from la
- fortifying federated learning towards trustworthiness via auditable data valuati
- foundations of the theory of performance-based ranking | arXiv: 2412.04227
- foundationstereo zero-shot stereo matching | arXiv: 2501.09898
- foundhand large-scale domain-specific learning for controllable hand image gener | arXiv: 2412.02690
- foveated instance segmentation | arXiv: 2503.21854
- fractal calibration for long-tailed object detection | arXiv: 2410.11774
- fractals made practical denoising diffusion as partitioned iterated function sys | arXiv: 2603.13069
- frame floor-aligned representation for avatar motion from egocentric video | arXiv: 2503.23094
- frames-vqa benchmarking fine-tuning robustness across multi-modal shifts in visu
- framevggt frame evidence rolling memory for streaming vggt | arXiv: 2603.07690
- free lunch enhancements for multi-modal crowd counting
- free on the fly enhancing flexibility in test-time adaptation with online em | arXiv: 2507.06973
- free-viewpoint human animation with pose-correlated reference selection | arXiv: 2412.17290
- free360 layered gaussian splatting for unbounded 360-degree view synthesis from
- freecloth free-form generation enhances challenging clothed human modeling | arXiv: 2411.19942
- freegave 3d physics learning from dynamic videos by gaussian velocity | arXiv: 2506.07865
- freepca integrating consistency information across long-short frames in training
- freescene mixed graph diffusion for 3d scene synthesis from free prompts | arXiv: 2506.02781
- freesim toward free-viewpoint camera simulation in driving scenes | arXiv: 2412.03566
- freetimegs free gaussian primitives at anytime anywhere for dynamic scene recons
- freeuv ground-truth-free realistic facial uv texture recovery via cross-assembly | arXiv: 2503.17197
- freqdebias towards generalizable deepfake detection via consistency-driven frequ
- frequency dynamic convolution for dense image prediction | arXiv: 2503.18783
- frequency-biased synergistic design for image compression and compensation
- fresa feedforward reconstruction of personalized skinned avatars from few images | arXiv: 2503.19207
- from alexnet to transformers measuring the non-linearity of deep neural networks
- from elements to design a layered approach for automatic graphic design composit | arXiv: 2412.19712
- from faces to voices learning hierarchical representations for high-quality vide
- from head to tail efficient black-box model inversion attack via long-tailed lea
- from head to tail towards balanced representation in large vision-language model
- from laboratory to real world a new benchmark towards privacy-preserved visible-
- from multimodal llms to generalist embodied agents methods and lessons | arXiv: 2412.08442
- from poses to identity training-free person re-identification via feature centra
- from prototypes to general distributions an efficient curriculum for masked imag | arXiv: 2411.10685
- from slow bidirectional to fast autoregressive video diffusion models | arXiv: 2412.07772
- from sparse signal to smooth motion real-time motion generation with rolling pre
- from sparse to dense camera relocalization with scene-specific detector from fea
- from words to structured visuals a benchmark and framework for text-to-diagram g | arXiv: 2411.11916
- from zero to detail deconstructing ultra-high-definition image restoration from
- frugalnerf fast convergence for extreme few-shot novel view synthesis without le
- fruitninja 3d object interior texture generation with gaussian splatting | arXiv: 2411.12089
- fsbench a figure skating benchmark for advancing artistic sports understanding | arXiv: 2504.19514
- fsboard over 3 million characters of asl fingerspelling collected via smartphone | arXiv: 2407.15806
- fsfm a generalizable face security foundation model via self-supervised facial r | arXiv: 2412.12032
- fshnet fully sparse hybrid network for 3d object detection | arXiv: 2506.03714
- full-dof egomotion estimation for event cameras using geometric solvers | arXiv: 2503.03307
- functionality understanding and segmentation in 3d scenes | arXiv: 2411.16310
- fuzzy multimodal learning for trusted cross-modal retrieval
- g3d-lf generalizable 3d-language feature fields for embodied tasks | arXiv: 2411.17030
- g3flow generative 3d semantic flow for pose-aware and generalizable object manip
- ga3ce unconstrained 3d gaze estimation with gaze-aware 3d context encoding | arXiv: 2505.10671
- gaf gaussian avatar reconstruction from monocular videos via multi-view diffusio
- gain from neighbors boosting model robustness in the wild via adversarial pertur
- galaxy walker geometry-aware vlms for galaxy-scale understanding | arXiv: 2503.18578
- gapt-dar category-level garments pose tracking via integrated 2d deformation and
- garmentpile point-level visual affordance guided retrieval and adaptation for cl
- gasp gaussian avatars with synthetic priors | arXiv: 2412.07739
- gaucho gaussian distributions with cholesky decomposition for oriented object de
- gausshdr high dynamic range gaussian splatting via learning unified 3d and 2d lo | arXiv: 2503.10143
- gaussian eigen models for human heads | arXiv: 2407.04545
- gaussian splashing unified particles for versatile motion synthesis and renderin
- gaussian splatting feature fields for privacy-preserving visual localization | arXiv: 2507.23569
- gaussian splatting for efficient satellite image photogrammetry | arXiv: 2412.13047
- gaussianformer-2 probabilistic gaussian superposition for efficient 3d occupancy | arXiv: 2412.04384
- gaussianip identity-preserving realistic 3d human generation via human-centric d | arXiv: 2503.11143
- gaussianspa an optimizing-sparsifying simplification framework for compact and h
- gaussianudf inferring unsigned distance functions through 3d gaussian splatting | arXiv: 2503.19458
- gaussianworld gaussian world model for streaming 3d occupancy prediction | arXiv: 2412.10373
- gausstr foundation model-aligned gaussian transformer for self-supervised 3d spa
- gaustar gaussian surface tracking and reconstruction | arXiv: 2501.10283
- gaze-lle gaze target estimation via large-scale learned encoders | arXiv: 2412.09586
- gazegene large-scale synthetic gaze dataset with 3d eyeball annotations
- gazing at rewards eye movements as a lens into human and ai decision-making in h | arXiv: 2411.09176
- gazing into missteps leveraging eye-gaze for unsupervised mistake detection in e
- gbc-splat generalizable gaussian-based clothed human digitalization under sparse
- gblobs explicit local structure via gaussian blobs for improved cross-domain lid
- gcc generative color constancy via diffusing a color checker | arXiv: 2502.17435
- gce-pose global context enhancement for category-level object pose estimation | arXiv: 2502.04293
- geal generalizable 3d affordance learning with cross-modal consistency | arXiv: 2412.09511
- gem a generalizable ego-vision multimodal world model for fine-grained ego-motio
- gen3c 3d-informed world-consistent video generation with precise camera control | arXiv: 2503.03751
- gen3deval using vllms for automatic evaluation of generated 3d objects | arXiv: 2504.08125
- genassets generating in-the-wild 3d assets in latent space
- gendeg diffusion-based degradation synthesis for generalizable all-in-one image | arXiv: 2411.17687
- generalizable object keypoint localization from generative priors
- generalized diffusion detector mining robust features from diffusion models for | arXiv: 2503.02101
- generalized few-shot 3d point cloud segmentation with vision-language model | arXiv: 2503.16282
- generalized gaussian entropy model for point cloud attribute compression with dy
- generalized recorrupted-to-recorrupted self-supervised learning beyond gaussian | arXiv: 2412.04648
- generalized zero-shot classification via semantics-free inter-class feature gene
- generalizing deepfake video detection with plug-and-play video-level blending an
- generating 3d-consistent videos from unposed internet photos | arXiv: 2411.13549
- generating 6dof object manipulation trajectories from action description in egoc
- generating multimodal driving scenes via next-scene prediction | arXiv: 2503.14945
- generation of maximal snake polyominoes using a deep neural network | arXiv: 2603.12400
- generative densification learning to densify gaussians for high-fidelity general
- generative gaussian splatting for unbounded 3d city generation | arXiv: 2406.06526
- generative hard example augmentation for semantic point cloud segmentation
- generative image layer decomposition with visual effects | arXiv: 2411.17864
- generative inbetweening through frame-wise conditions-driven video generation | arXiv: 2412.11755
- generative map priors for collaborative bev semantic segmentation
- generative modeling of class probability for multi-modal representation learning | arXiv: 2503.17417
- generative multimodal pretraining with discrete diffusion timestep tokens | arXiv: 2504.14666
- generative multiview relighting for 3d reconstruction under extreme illumination | arXiv: 2412.15211
- generative omnimatte learning to decompose video into layers | arXiv: 2411.16683
- generative photography scene-consistent camera control for realistic text-to-ima
- generative photomontage | arXiv: 2408.07116
- generative sparse-view gaussian splatting
- generative video propagation | arXiv: 2412.19761
- generative zero-shot composed image retrieval
- genfusion closing the loop between reconstruction and generation via videos | arXiv: 2503.21219
- genius a generative framework for universal multimodal search | arXiv: 2503.19868
- genmanip llm-driven simulation for generalizable instruction-following manipulat
- genpc zero-shot point cloud completion via 3d generative priors | arXiv: 2502.19896
- genvdm generating vector displacement maps from a single image | arXiv: 2503.00605
- geoavatar geometrically-consistent multi-person avatar reconstruction from spars
- geochemad benchmarking unsupervised geochemical anomaly detection for mineral ex | arXiv: 2603.13068
- geodepth from point-to-depth to plane-to-depth modeling for self-supervised mono
- geometric knowledge-guided localized global distribution alignment for federated | arXiv: 2503.06457
- geometry field splatting with gaussian surfels | arXiv: 2411.17067
- geometry in style 3d stylization via surface normal deformation | arXiv: 2503.23241
- geometry-guided camera motion understanding in videollms | arXiv: 2603.13119
- geometry-guided online 3d video synthesis with multi-view temporal consistency | arXiv: 2505.18932
- geomm on geodesic perspective for multi-modal learning | arXiv: 2505.11216
- ges3vig incorporating pointing gestures into language-based 3d visual grounding
- get unlocking the multi-modal potential of clip for generalized category discove
- gflowvlm enhancing multi-step reasoning in vision-language models with generativ
- gg-ssms graph-generating state space models | arXiv: 2412.12423
- gif generative inspiration for face recognition at scale | arXiv: 2505.03012
- gifstream 4d gaussian-based immersive video with feature stream | arXiv: 2505.07539
- gigahands a massive annotated dataset of bimanual hand activities | arXiv: 2412.04244
- giim graph-based learning of inter- and intra-view dependencies for multi-view m | arXiv: 2603.09446
- givepose gradual intra-class variation elimination for rgb-based category-level
- glane3d detecting lanes with graph of 3d keypoints | arXiv: 2503.23882
- glass guided latent slot diffusion for object-centric learning | arXiv: 2407.17929
- glianet adaptive neural network structure learning with glia-driven
- global-local tree search in vlms for 3d indoor scene generation | arXiv: 2503.18476
- glossy object reconstruction with cost-effective polarized acquisition | arXiv: 2504.07025
- glus global-local reasoning unified into a single large language model for video | arXiv: 2504.07962
- glyphmastero a glyph encoder for high-fidelity scene text editing | arXiv: 2505.04915
- go-n3rdet geometry optimized nerf-enhanced 3d object detector | arXiv: 2503.15211
- go-with-the-flow motion-controllable video diffusion models using real-time warp
- goal global-local object alignment learning | arXiv: 2503.17782
- goalflow goal-driven flow matching for multimodal trajectories generation in end
- goku flow based video generative foundation models | arXiv: 2502.04896
- golden cudgel network for real-time semantic segmentation | arXiv: 2503.03325
- golf-nrt integrating global context and local geometry for few-shot view synthes
- good cheap and fast overfitted image compression with wasserstein distortion | arXiv: 2412.00505
- gpavatar high-fidelity head avatars by learning efficient gaussian projections
- gps as a control signal for image generation | arXiv: 2501.12390
- gpvk-vl geometry-preserving virtual keyframes for visual localization under larg
- grade benchmarking discipline-informed reasoning in image editing | arXiv: 2603.12264
- gradient inversion attacks on parameter-efficient fine-tuning | arXiv: 2506.04453
- gradient-guided annealing for domain generalization | arXiv: 2502.20162
- grae-3dmot geometry relation-aware encoder for online 3d multi-object tracking
- graph neural network combining event stream and periodic aggregation for low-lat
- graph-embedded structure-aware perceptual hashing for neural network protection
- graphgpt-o synergistic multimodal comprehension and generation on graphs | arXiv: 2502.11925
- graphi2p image-to-point cloud registration with exploring pattern of corresponde
- graphmimic graph-to-graphs generative modeling from videos for policy learning
- great geometry-intention collaborative inference for open-vocabulary 3d object a | arXiv: 2411.19626
- gromov-wasserstein problem with cyclic symmetry
- groomlight hybrid inverse rendering for relightable human hair appearance modeli
- ground-v teaching vlms to ground complex instructions in pixels | arXiv: 2505.13788
- grounding 3d object affordance with language instructions visual observations an | arXiv: 2504.04744
- groundingface fine-grained face understanding via pixel grounding multimodal lar
- groupmamba efficient group-based visual state space model | arXiv: 2407.13772
- grove a generalized reward for learning open-vocabulary physical skill | arXiv: 2504.04191
- gs-2dgs geometrically supervised 2dgs for reflective object reconstruction | arXiv: 2506.13110
- gs-dit advancing video generation with dynamic 3d gaussian fields through effici
- guardsplat efficient and robust watermarking for 3d gaussian splatting | arXiv: 2411.19895
- gui-xplore empowering generalizable gui agents with one exploration | arXiv: 2503.17709
- guiding human-object interactions with rich geometry and relations | arXiv: 2503.20172
- gyro-based neural single image deblurring | arXiv: 2404.00916
- h-edit effective and flexible diffusion-based editing via doobs h-transform | arXiv: 2503.02187
- h-more learning human-centric motion representation for action analysis | arXiv: 2504.10676
- h2st hierarchical two-sample tests for continual out-of-distribution detection | arXiv: 2503.14832
- hallo3 highly dynamic and realistic portrait image animation with video diffusio
- halloc token-level localization of hallucinations for vision language models | arXiv: 2506.10286
- hand-held object reconstruction from rgb video with dynamic interaction
- handling spatial-temporal data heterogeneity for federated continual learning vi
- handos 3d hand reconstruction in one stage | arXiv: 2412.01537
- hardware-rasterized ray-based gaussian splatting | arXiv: 2503.18682
- harmonyset a comprehensive dataset for understanding video-music semantic alignm
- harnessing frequency spectrum insights for image copyright protection against di
- harnessing frozen unimodal encoders for flexible multimodal alignment | arXiv: 2409.19425
- harnessing global-local collaborative adversarial perturbation for anti-customiz
- hash3d training-free acceleration for 3d generation | arXiv: 2404.06091
- hawor world-space hand motion reconstruction from egocentric videos | arXiv: 2501.02973
- hazy low-quality satellite video restoration via learning optimal joint degradat
- hd-epic a highly-detailed egocentric video dataset | arXiv: 2502.04144
- hearing anywhere in any environment | arXiv: 2504.10746
- hearing hands generating sounds from physical interactions in 3d scenes | arXiv: 2506.09989
- heatformer a neural optimizer for multiview human mesh recovery | arXiv: 2412.04456
- heie mllm-based hierarchical explainable aigc image implausibility evaluator | arXiv: 2411.17261
- helvipad a real-world dataset for omnidirectional stereo depth estimation | arXiv: 2411.18335
- hemora unsupervised heuristic consensus sampling for robust point cloud registra
- hera hybrid explicit representation for ultra-realistic head avatars
- heterogeneous skeleton-based action representation learning | arXiv: 2506.03481
- hfp-sam hierarchical frequency prompted sam for efficient marine animal segmenta | arXiv: 2603.12708
- hiap a multi-granular stochastic auto-pruning framework for vision transformers | arXiv: 2603.12222
- hiding images in diffusion models by editing learned score functions | arXiv: 2503.18459
- hierarchical adaptive filtering network for text image specular highlight remova
- hierarchical compact clustering attention coca for unsupervised object-centric l | arXiv: 2505.02071
- hierarchical dual-change collaborative learning for uav scene change captioning | arXiv: 2603.12832
- hierarchical features matter a deep exploration of progressive parameterization
- hierarchical flow diffusion for efficient frame interpolation | arXiv: 2504.00380
- hierarchical gaussian mixture model splatting for efficient and part controllabl
- hierarchical knowledge prompt tuning for multi-task test-time adaptation
- hierarq task-aware hierarchical q-former for enhanced video understanding | arXiv: 2503.08585
- hifi-portrait zero-shot identity-preserved portrait generation with high-fidelit
- hificl high-fidelity in-context learning for multimodal tasks | arXiv: 2603.12760
- high dynamic range video compression a large-scale benchmark dataset and a learn
- high temporal consistency through semantic similarity propagation in semi-superv
- high-fidelity 3d object generation from single image with rgbn-volume gaussian r | arXiv: 2504.01512
- high-fidelity lightweight mesh reconstruction from point clouds
- high-fidelity relightable monocular portrait animation with lighting-controllabl
- high-quality point cloud oriented normal estimation via hybrid angular and eucli
- higher-order ratio cycles for fast and globally optimal shape matching
- hiif hierarchical encoding based implicit image function for continuous super-re
- hilots high-low temporal sensitive representation learning for semi-supervised l
- himor monocular deformable gaussian reconstruction with hierarchical motion repr
- hipart hierarchical pose autoregressive transformer for occluded 3d human pose e | arXiv: 2503.23331
- hires-llava restoring fragmentation input in high-resolution large vision-langua
- histofs non-iid histopathologic whole slide image classification via federated s
- hmar efficient hierarchical masked auto-regressive image generation | arXiv: 2506.04421
- hogs unified near and far object reconstruction via homogeneous gaussian splatti
- hoi3dgen generating high-quality human-object-interactions in 3d | arXiv: 2603.12126
- hoigen-1m a large-scale dataset for human-object interaction video generation | arXiv: 2503.23715
- hoigpt learning long-sequence hand-object interaction with language models
- holmes-vau towards long-term video anomaly understanding at any granularity | arXiv: 2412.06171
- homesafe-bench evaluating vision-language models on unsafe action detection for | arXiv: 2603.11975
- homogen enhanced video inpainting via homography propagation and diffusion
- homogeneous dynamics space for heterogeneous humans | arXiv: 2412.06146
- hop heterogeneous topology-based multimodal entanglement for co-speech gesture g | arXiv: 2503.01175
- horizon-gs unified 3d gaussian splatting for large-scale aerial-to-ground scenes | arXiv: 2412.01745
- horp human-object relation priors guided hoi detection
- hot hadamard-based optimized training | arXiv: 2503.21261
- hot3d hand and object tracking in 3d from egocentric multi-view videos | arXiv: 2411.19167
- hotformerloc hierarchical octree transformer for versatile lidar place recogniti
- hotspot signed distance function optimization with an asymptotically sufficient | arXiv: 2411.14628
- hovle unleashing the power of monolithic vision-language models with holistic vi
- how do i do that synthesizing 3d hand motion and contacts for everyday interacti
- how to merge your multimodal models over time | arXiv: 2412.06712
- hravatar high-quality and relightable gaussian head avatar | arXiv: 2503.08224
- hsemotion team at abaw-10 competition facial expression recognition valence-arou | arXiv: 2603.12693
- hsi a holistic style injector for arbitrary style transfer | arXiv: 2502.04369
- hsi-gpt a general-purpose large scene-motion-language model for human scene inte
- human knowledge integrated multi-modal learning for single source domain general | arXiv: 2603.12369
- human motion instruction tuning | arXiv: 2411.16805
- human-centered interactive learning via mllms for text-to-image person re-identi
- humandreamer generating controllable human-motion videos via decoupled generatio
- humanmm global human motion recovery from multi-shot videos | arXiv: 2503.07597
- humanrig learning automatic rigging for humanoid character in a large scale data
- humocon concept discovery for human motion understanding | arXiv: 2505.20920
- hunet homotopy unfolding network for image compressive sensing
- hunyuanportrait implicit condition control for enhanced portrait animation | arXiv: 2503.18860
- huperflow a comprehensive benchmark for human vs machine motion estimation compa
- hush holistic panoramic 3d scene understanding using spherical harmonics
- hvi a new color space for low-light image enhancement | arXiv: 2502.20272
- hybrid concept bottleneck models
- hybrid etfce-grf exact cluster-size retrieval with analytical p-values for voxel | arXiv: 2603.11344
- hybrid global-local representation with augmented spatial guidance for zero-shot
- hybrid reciprocal transformer with triplet feature alignment for scene graph gen
- hybrid-level instruction injection for video token compression in multi-modal la
- hybridgs decoupling transients and statics with 2d and 3d gaussian splatting | arXiv: 2412.03844
- hybridmqa exploring geometry-texture interactions for colored mesh quality asses
- hyperbolic category discovery | arXiv: 2504.06120
- hyperbolic safety-aware vision-language models | arXiv: 2503.12127
- hyperbolic uncertainty-aware few-shot incremental point cloud segmentation
- hyperdimensional uncertainty quantification for multimodal uncertainty fusion in
- hyperfree a channel-adaptive and tuning-free foundation model for hyperspectral
- hyperglm hypergraph for video scene graph generation and anticipation | arXiv: 2411.18042
- hypergraph vision transformers images are more than nodes more than edges | arXiv: 2504.08710
- hypergs hyperspectral 3d gaussian splatting | arXiv: 2412.12849
- hyperlora parameter-efficient adaptive generation for portrait synthesis | arXiv: 2503.16944
- hypernet fields efficiently training hypernetworks without ground truth by learn
- hypernvd accelerating neural video decomposition via hypernetworks | arXiv: 2503.17276
- hyperpose hypernetwork-infused camera pose localization and an extended cambridg
- hyperseg hybrid segmentation assistant with fine-grained visual perceiver
- hyperspectral pansharpening via diffusion models with iteratively zero-shot guid
- i2vguard safeguarding images against misuse in diffusion-based image-to-video mo
- iaao interactive affordance learning for articulated objects in 3d environments | arXiv: 2504.06827
- ice intrinsic concept extraction from a single image via diffusion models | arXiv: 2503.19902
- icediff high resolution and high-quality arctic sea ice forecasting with generat
- icp immediate compensation pruning for mid-to-high sparsity
- ict image-object cross-level trusted intervention for mitigating object hallucin
- id-patch robust id association for group photo personalization | arXiv: 2411.13632
- idea inverted text with cooperative deformable aggregation for multi-modal objec
- idea-bench how far are generative models from professional designing | arXiv: 2412.11767
- identifying and mitigating position bias of multi-image vision-language models | arXiv: 2503.13792
- identifying and mitigating spurious correlation in multi-task learning
- identity-clothing similarity modeling for unsupervised clothing change person re
- identity-preserving distillation sampling by fixed-point iterator | arXiv: 2502.19930
- identity-preserving text-to-video generation by frequency decomposition | arXiv: 2411.17440
- idol instant photorealistic 3d human creation from a single image | arXiv: 2412.14963
- idprotector an adversarial noise encoder to protect against id-preserving image | arXiv: 2412.11638
- ig-6dof model-free 6dof pose estimation for unseen object via iterative 3d gauss
- ilias instance-level image retrieval at scale | arXiv: 2502.11748
- illumination spectrum estimation for multispectral images via surface reflectanc
- im-portrait learning 3d-aware video diffusion for photorealistic talking heads f
- im-zero instance-level motion controllable video generation in a zero-shot manne
- image generation diversity issues and how to tame them | arXiv: 2411.16171
- image is all you need to empower large-scale diffusion models for in-domain gene
- image over text transforming formula recognition evaluation with character detec
- image quality assessment from human to machine preference | arXiv: 2503.10078
- image quality assessment investigating causal perceptual effects with abductive | arXiv: 2412.16939
- image reconstruction from readout-multiplexed single-photon detector arrays | arXiv: 2312.02971
- image referenced sketch colorization based on animation creation workflow | arXiv: 2502.19937
- imagine and seek improving composed image retrieval with an imagined proxy | arXiv: 2411.16752
- imaginefsl self-supervised pretraining matters on imagined base set for vlm-base
- imfine 3d inpainting via geometry-guided multi-view refinement | arXiv: 2503.04501
- img-diff contrastive data synthesis for multimodal large language models | arXiv: 2408.04594
- immune improving safety against jailbreaks in multi-modal llms via inference-tim
- implicit bias injection attacks against text-to-image diffusion models | arXiv: 2504.01819
- implicit correspondence learning for image-to-point cloud registration
- improve representation for imbalanced regression through geometric constraints | arXiv: 2503.00876
- improved monocular depth prediction using distance transform over pre-semantic c
- improved video vae for latent video diffusion model | arXiv: 2411.06449
- improving accuracy and calibration via differentiated deep mutual learning
- improving adversarial transferability on vision transformers via forward propaga
- improving autoregressive visual generation with cluster-oriented token predictio
- improving diffusion inverse problem solving with decoupled noise annealing | arXiv: 2407.01521
- improving editability in image generation with layer-wise memory | arXiv: 2505.01079
- improving gaussian splatting with localized points management | arXiv: 2406.04251
- improving personalized search with regularized low-rank parameter updates | arXiv: 2506.10182
- improving semi-supervised semantic segmentation with sliced-wasserstein feature
- improving sound source localization with joint slot attention on image and audio | arXiv: 2504.15118
- improving the training of data-efficient gans via quality aware dynamic discrimi
- improving the transferability of adversarial attacks on face recognition with di
- improving transferable targeted attacks with feature tuning mixup | arXiv: 2411.15553
- improving visual and downstream performance of low-light enhancer with vision fo
- imputation-free and alignment-free incomplete multi-view clustering driven by co
- imvid immersive volumetric videos for enhanced vr engagement | arXiv: 2503.14359
- inceventgs pose-free gaussian splatting from a single event camera | arXiv: 2410.08107
- incomplete multi-modal brain tumor segmentation via learnable sorting state spac
- incomplete multi-view multi-label learning via disentangled representation and l
- incorporating dense knowledge alignment into unified multimodal representation m
- incremental object keypoint learning | arXiv: 2503.20248
- indoorgs geometric cues guided gaussian splatting for indoor scene reconstructio
- inference-scale complexity in ann-snn conversion for high-performance and low-po
- infighting in the dark multi-label backdoor attack in federated learning | arXiv: 2409.19601
- infinity scaling bitwise autoregressive modeling for high-resolution image synth
- influence malleability in linearized attention dual implications of non-converge | arXiv: 2603.13085
- infp audio-driven interactive head generation in dyadic conversations | arXiv: 2412.04037
- inpo inversion preference optimization with reparametrized ddim for efficient di
- insight-v exploring long-chain visual reasoning with multimodal large language m | arXiv: 2411.14432
- insightedit towards better instruction following for image editing | arXiv: 2411.17323
- insightful instance features for 3d instance segmentation
- inst3d-lmm instance-aware 3d scene understanding with multi-modal instruction tu
- instag learning personalized 3d talking head from few-second video | arXiv: 2502.20387
- instance-wise supervision-level optimization in active learning | arXiv: 2503.06517
- instancecap improving text-to-video generation via instance-aware structured cap
- instancegaussian appearance-semantic joint gaussian representation for 3d instan
- instant adversarial purification with adversarial consistency distillation | arXiv: 2408.17064
- instant gaussian stream fast and generalizable streaming of dynamic scene recons
- instant3dit multiview inpainting for fast editing of 3d objects | arXiv: 2412.00518
- instanthdr single-forward gaussian splatting for high dynamic range 3d reconstru | arXiv: 2603.11298
- instruct-clip improving instruction-guided image editing with automated data ref
- instruction-based image manipulation by watching how things move | arXiv: 2412.12087
- integral fast fourier color constancy | arXiv: 2502.03494
- integration of deep generative anomaly detection algorithm in high-speed industr | arXiv: 2603.07577
- interact advancing large-scale versatile 3d human-object interaction generation | arXiv: 2509.09555
- interactanything zero-shot human object interaction synthesis via llm feedback a
- interactionmap improving online vectorized hdmap construction with interaction | arXiv: 2503.21659
- interactive medical image analysis with concept-based similarity reasoning | arXiv: 2503.06873
- interactive medical image segmentation a benchmark dataset and baseline | arXiv: 2411.12814
- interactvlm 3d interaction reasoning from 2d foundational models | arXiv: 2504.05303
- interdyn controllable interactive dynamics with video diffusion models | arXiv: 2412.11785
- interedit navigating text-guided multi-human 3d motion editing | arXiv: 2603.13082
- interleaved-modal chain-of-thought | arXiv: 2411.19488
- intermimic towards universal whole-body control for physics-based human-object i | arXiv: 2502.20390
- interpretable generative models through post-hoc concept bottlenecks | arXiv: 2503.19377
- interpretable image classification via non-parametric part prototype learning | arXiv: 2503.10247
- interpreting object-level foundation models via visual precision search | arXiv: 2411.16198
- inversion circle interpolation diffusion-based image augmentation for data-scarc
- investigating the role of weight decay in enhancing nonconvex sgd
- invisible backdoor attack against self-supervised learning | arXiv: 2405.14672
- irgs inter-reflective gaussian splatting with 2d gaussian ray tracing | arXiv: 2412.15867
- iris inverse rendering of indoor scenes from low dynamic range images | arXiv: 2401.12977
- is right right enhancing object orientation understanding in multimodal large la
- is this generated person existed in real-world fine-grained detecting and calibr
- is your world simulator a good story presenter a consecutive events-based benchm
- isegman interactive segment-and-manipulate 3d gaussians | arXiv: 2505.11934
- ita-mdt image-timestep-adaptive masked diffusion transformer framework for image
- iterative predictor-critic code decoding for real-world image dehazing | arXiv: 2503.13147
- iteris iterative inference-solving alignment for lora merging | arXiv: 2411.15231
- its a blind match towards vision-language correspondence without parallel data | arXiv: 2503.24129
- jailbreaking the non-transferable barrier via test-time data disguising | arXiv: 2503.17198
- jamma ultra-lightweight local feature matching with joint mamba | arXiv: 2503.03437
- janus decoupling visual encoding for unified multimodal understanding and genera
- janusflow harmonizing autoregression and rectified flow for unified multimodal u | arXiv: 2411.07975
- jarvisir elevating autonomous driving perception with intelligent image restorat
- jisam alleviate labeling burden and corner case problems in autonomous driving v
- joint and streamwise distributed mimo satellite communications with multi-antenn | arXiv: 2603.12914
- joint optimization of neural radiance fields and continuous camera motion from a | arXiv: 2504.19819
- joint out-of-distribution filtering and data discovery active learning | arXiv: 2503.02491
- joint scheduling of causal prompts and tasks for multi-task learning
- joint vision-language social bias removal for clip | arXiv: 2411.12785
- jopp-3d joint open vocabulary semantic segmentation on point clouds and panorama | arXiv: 2603.06168
- jtd-uav mllm-enhanced joint tracking and description framework for anti-uav syst
- just dance with pi a poly-modal inductor for weakly-supervised video anomaly det
- k-lora unlocking training-free fusion of any subject and style loras | arXiv: 2502.18461
- k-sort arena efficient and reliable benchmarking for generative models via k-wis
- kac kolmogorov-arnold classifier for continual learning | arXiv: 2503.21076
- keep the balance a parameter-efficient symmetrical framework for rgbx semantic s
- keyface expressive audio-driven facial animation for long sequences via keyframe | arXiv: 2503.01715
- keyframe-guided creative video inpainting
- kiss3dgen repurposing image diffusion models for 3d asset generation | arXiv: 2503.01370
- kmd koopman multi-modality decomposition for generalized brain tumor segmentatio
- knowledge bridger towards training-free missing modality completion | arXiv: 2502.19834
- knowledge memorization and rumination for pre-trained model-based class-incremen
- knowledge-aligned counterfactual-enhancement diffusion perception for unsupervis
- koala-36m a large-scale video dataset improving consistency between fine-grained
- kvq boosting video quality assessment via saliency-guided local perception | arXiv: 2503.10259
- l-swag layer-sample wise activation with gradients information for zero-shot nas
- l2gtx from local to global time series explanations | arXiv: 2603.13065
- label shift meets online learning ensuring consistent adaptation with universal
- lal enhancing 3d human motion prediction with latency-aware auxiliary learning
- lamra large multimodal model as your advanced retrieval assistant | arXiv: 2412.01720
- language guided concept bottleneck models for interpretable continual learning | arXiv: 2503.23283
- language-assisted debiasing and smoothing for foundation model-based semi-superv
- language-grounded decoupled action representation for robotic manipulation | arXiv: 2603.12967
- language-guided audio-visual learning for long-term sports assessment
- language-guided image tokenization for generation | arXiv: 2412.05796
- language-guided salient object ranking
- large self-supervised models bridge the gap in domain adaptive object detection | arXiv: 2503.23220
- large-scale multi-view tensor clustering with implicit linear kernels
- large-scale text-to-image model with inpainting is a zero-shot subject-driven im
- latent drifting in diffusion models for counterfactual medical image synthesis | arXiv: 2412.20651
- latent space imaging | arXiv: 2407.07052
- latent space super-resolution for higher-resolution image generation with diffus
- latenthoi on the generalizable hand object motion generation with latent hand di
- latexblend scaling multi-concept customized generation with latent textual blend | arXiv: 2503.06956
- latte-mv learning to anticipate table tennis hits from monocular videos | arXiv: 2503.20936
- lavin-dit large vision diffusion transformer | arXiv: 2411.11505
- layer- and timestep-adaptive differentiable token compression ratios for efficie
- layered image vectorization via semantic simplification | arXiv: 2406.05404
- layered motion fusion lifting motion segmentation to 3d in egocentric videos | arXiv: 2506.05546
- layoutvlm differentiable optimization of 3d layout via vision-language models | arXiv: 2412.02193
- lc-mamba local and continuous mamba with shifted windows for frame interpolation
- leangaussian breaking pixel or point cloud correspondence in modeling 3d gaussia
- learnable infinite taylor gaussian for dynamic view rendering | arXiv: 2412.04282
- learned binocular-encoding optics for rgbd imaging using joint stereo and focus
- learned image compression with dictionary-based entropy model | arXiv: 2504.00496
- learning 4d panoptic scene graph generation from rich 2d visual scene | arXiv: 2503.15019
- learning affine correspondences by integrating geometric constraints | arXiv: 2504.04834
- learning audio-guided video representation with gated attention for video-text r | arXiv: 2504.02397
- learning bijective surface parameterization for inferring signed distance functi
- learning class prototypes for unified sparse-supervised 3d object detection | arXiv: 2503.21099
- learning compatible multi-prize subnetworks for asymmetric retrieval | arXiv: 2504.11879
- learning conditional space-time prompt distributions for video class-incremental
- learning dynamic collaborative network for semi-supervised 3d vessel segmentatio
- learning endogenous attention for incremental object detection
- learning extremely high density crowds as active matters | arXiv: 2503.12168
- learning flow fields in attention for controllable person image generation | arXiv: 2412.08486
- learning from neighbors category extrapolation for long-tail learning | arXiv: 2410.15980
- learning from streaming video with orthogonal gradients | arXiv: 2504.01961
- learning from synchronization self-supervised uncalibrated multi-view person ass
- learning hazing to dehazing towards realistic haze generation for real-world ima
- learning heterogeneous tissues with mixture of experts for gigapixel whole slide
- learning occlusion-robust vision transformers for real-time uav tracking | arXiv: 2504.09228
- learning on model weights using tree experts | arXiv: 2410.13569
- learning partonomic 3d reconstruction from image collections
- learning person-specific animatable face models from in-the-wild images via a sh
- learning phase distortion with selective state space models for video turbulence | arXiv: 2504.02697
- learning physics from video unsupervised physical parameter estimation for conti
- learning physics-based full-body human reaching and grasping from brief walking | arXiv: 2503.07481
- learning temporally consistent video depth from video diffusion priors | arXiv: 2406.01493
- learning textual prompts for open-world semi-supervised learning
- learning to detect objects from multi-agent lidar scans without manual labels | arXiv: 2503.08421
- learning to filter outlier edges in global sfm
- learning to highlight audio by watching movies | arXiv: 2505.12154
- learning to normalize on the spd manifold under bures-wasserstein geometry | arXiv: 2504.00660
- learning to sample effective and diverse prompts for text-to-image generation | arXiv: 2502.11477
- learning visual composition through improved semantic guidance | arXiv: 2412.15396
- learning visual generative priors without text | arXiv: 2412.07767
- learning with noisy triplet correspondence for composed image retrieval
- learning-enabled polynomial lyapunov function synthesis via high-accuracy counte
- lediff latent exposure diffusion for hdr generation | arXiv: 2412.14456
- lesionlocator zero-shot universal tumor segmentation and tracking in 3d whole-bo
- less attention is more prompt transformer for generalized category discovery
- less is more efficient image vectorization with adaptive parameterization
- less is more efficient model merging with binary task switch | arXiv: 2412.00054
- lessons and insights from a unifying study of parameter-efficient fine-tuning pe
- let humanoids hike integrative skill development on complex trails | arXiv: 2505.06218
- let samples speak mitigating spurious correlation by exploiting the clusterness | arXiv: 2512.22874
- lets chorus partner-aware hybrid song-driven 3d head animation
- lets verify and reinforce image generation step by step
- leveraging 3d geometric priors in 2d rotation symmetry detection | arXiv: 2503.20235
- leveraging perturbation robustness to enhance out-of-distribution detection | arXiv: 2503.18784
- leveraging sd map to augment hd map-based trajectory prediction
- leveraging temporal cues for semi-supervised multi-view 3d object detection
- levitor 3d trajectory oriented image-to-video synthesis | arXiv: 2412.15214
- libra-merging importance-redundancy and pruning-merging trade-off for accelerati
- libragrad balancing gradient flow for universally better vision transformer attr
- lidar-rt gaussian-based ray tracing for dynamic lidar re-simulation | arXiv: 2412.15199
- lidargait learning local features and size awareness from lidar point clouds for
- lifelong knowledge editing for vision language models with low-rank mixture-of-e
- lift3d policy lifting 2d foundation models for robust 3d robotic manipulation | arXiv: 2411.18623
- lifting motion to the 3d world via 2d diffusion | arXiv: 2411.18808
- lifting the veil on visual information flow in mllms unlocking pathways to faste
- light transport-aware diffusion posterior sampling for single-view reconstructio
- light3r-sfm towards feed-forward structure-from-motion | arXiv: 2501.14914
- lightloc learning outdoor lidar localization at light speed | arXiv: 2503.17814
- lim large interpolator model for dynamic reconstruction | arXiv: 2503.22537
- limoe mixture of lidar representation learners from automotive scenes | arXiv: 2501.04004
- linear attention modeling for learned image compression | arXiv: 2502.05741
- lineart a knowledge-guided training-free high-quality appearance transfer for de
- lingen towards high-resolution minute-length text-to-video generation with linea
- linguistics-aware masked image modeling for self-supervised scene text recogniti
- link to the past temporal propagation for fast 3d human reconstruction from mono
- link-based contrastive learning for one-shot unsupervised domain adaptation
- lion-fs fast slow video-language thinker as online video assistant | arXiv: 2503.03663
- lirm large inverse rendering model for progressive reconstruction of shape mater
- lisu a dataset and method for lidar surface normal estimation | arXiv: 2503.08601
- lita-gs illumination-agnostic novel view synthesis via reference-free 3d gaussia
- livecc learning video llm with streaming speech transcription at scale | arXiv: 2504.16030
- livos light video object segmentation with gated linear matching | arXiv: 2411.02818
- llava-critic learning to evaluate multimodal models | arXiv: 2410.02712
- llava-st a multimodal large language model for fine-grained spatial-temporal und
- llavidal a large language vision model for daily activities of living | arXiv: 2406.09390
- llm-driven multimodal and multi-identity listening head generation
- llmdet learning strong open-vocabulary object detectors under the supervision of
- lmo linear mamba operator for mri reconstruction
- locality-aware zero-shot human-object interaction detection | arXiv: 2505.19503
- localized concept erasure for text-to-image diffusion models using training-free
- localizing events in videos with multimodal queries | arXiv: 2406.10079
- locally orderless images for optimization in differentiable rendering | arXiv: 2503.21931
- locore image re-ranking with long-context sequence modeling | arXiv: 2503.21772
- lod-gs achieving levels of detail using scalable gaussian soup
- logiczsl exploring logic-induced representation for compositional zero-shot lear
- logits deconfusion with clip for few-shot learning | arXiv: 2504.12104
- logosp local-global grouping of superpoints for unsupervised semantic segmentati
- loki low-dimensional kan for efficient fine-tuning image models
- long video diffusion generation with segmented cross-attention and content-rich | arXiv: 2412.01316
- longdiff training-free long video generation in one go | arXiv: 2503.18150
- longvale vision-audio-language-event benchmark towards time-aware omni-modal per
- lookcloser frequency-aware radiance field for tiny-detail scene | arXiv: 2503.18513
- lookingglass generative anamorphoses via laplacian pyramid warping | arXiv: 2504.08902
- lora recycle unlocking tuning-free few-shot adaptability in visual foundation mo
- lora subtraction for drift-resistant space in exemplar-free continual learning | arXiv: 2503.18985
- loraclr contrastive adaptation for customization of diffusion models | arXiv: 2412.09622
- lorasculpt sculpting lora for harmonizing general and specialized knowledge in m
- lost in translation found in context sign language translation with contextual c | arXiv: 2501.09754
- lotus large-scale machine unlearning with a taste of uncertainty | arXiv: 2503.18314
- lotusfilter fast diverse nearest neighbor search via a learned cutoff table | arXiv: 2506.04790
- low-biased general annotated dataset generation | arXiv: 2412.10831
- low-rank adaptation in multilinear operator networks for security-preserving inc
- lp-diff towards improved restoration of real-world degraded license plate
- lposs label propagation over patches and pixels for open-vocabulary semantic seg
- lr-sgs robust lidar-reflectance-guided salient gaussian splatting for self-drivi | arXiv: 2603.12647
- lscenellm enhancing large 3d scene understanding using adaptive visual preferenc
- lsnet see large focus small | arXiv: 2503.23135
- lt3sd latent trees for 3d scene diffusion | arXiv: 2409.08215
- lucas layered universal codec avatars | arXiv: 2502.19739
- luminance-gs adapting 3d gaussian splatting to challenging lighting conditions w
- luminet latent intrinsics meets diffusion models for indoor scene relighting | arXiv: 2412.00177
- lux post facto learning portrait performance relighting with conditional video d
- lyapunov stable graph neural flow | arXiv: 2603.12557
- m-llm based video frame selection for efficient video understanding | arXiv: 2502.19680
- m2-occ resilient 3d semantic occupancy prediction for autonomous driving with in | arXiv: 2603.09737
- m3-vos multi-phase multi-transition and multi-scenery video object segmentation | arXiv: 2412.13803
- m3amba memory mamba is all you need for whole slide image classification
- m3gym a large-scale multimodal multi-view multi-person pose dataset for fitness
- mac-ego3d multi-agent gaussian consensus for real-time collaborative ego-motion
- mad memory-augmented detection of 3d objects
- madcow marginal distortion correction for wide-angle photography with arbitrary
- mage single image to material-aware 3d via the multi-view g-buffer estimation mo
- magic-slam multi-agent gaussian globally consistent slam | arXiv: 2411.16785
- magicarticulate make your 3d models articulation-ready | arXiv: 2502.12135
- magicquill an intelligent interactive image editing system | arXiv: 2411.09703
- magma a foundation model for multimodal ai agents | arXiv: 2502.13130
- maintaining consistent inter-class topology in continual test-time adaptation
- mair a locality- and continuity-preserving mamba for image restoration | arXiv: 2412.20066
- make it count text-to-image generation with an accurate number of objects | arXiv: 2406.10210
- make-it-animatable an efficient framework for authoring animation-ready 3d chara
- making old film great again degradation-aware state space model for old film res
- mamba as a bridge where vision foundation models meet vision language models for
- mamba-adaptor state space model adaptor for visual recognition | arXiv: 2505.12685
- mamba-reg vision mamba also needs registers
- mamba4d efficient 4d point cloud video understanding with disentangled spatial-t
- mambaic state space models for high-performance learned image compression | arXiv: 2503.12461
- mambairv2 attentive state space restoration | arXiv: 2411.15269
- mambaout do we really need mamba for vision | arXiv: 2405.07992
- mambavision a hybrid mamba-transformer vision backbone | arXiv: 2407.08083
- mambavlt time-evolving multimodal state space model for vision-language tracking | arXiv: 2411.15459
- mambavo deep visual odometry based on sequential matching refinement and trainin
- mammalps a multi-view video behavior monitoring dataset of wild mammals in the s | arXiv: 2503.18223
- manganinja line art colorization with precise reference following | arXiv: 2501.08332
- mani-gs gaussian splatting manipulation with triangular mesh | arXiv: 2405.17811
- maniptrans efficient dexterous bimanual manipulation transfer via residual learn | arXiv: 2503.21860
- manivideo generating hand-object manipulation video with dexterous and generaliz | arXiv: 2412.16212
- manta a large-scale multi-view and visual-text anomaly detection dataset for tin
- manta diffusion mamba for efficient and effective stochastic long-term dense act
- map unleashing hybrid mamba-transformer vision backbones potential with masked a | arXiv: 2410.00871
- mapgclr geospatial contrastive learning of representations for online vectorized | arXiv: 2603.10688
- mar-3d progressive masked auto-regressor for high-resolution 3d generation | arXiv: 2503.20519
- marble material recomposition and blending in clip-space | arXiv: 2506.05313
- mari material retrieval integration across domains | arXiv: 2503.08111
- markushgrapher joint visual and textual recognition of markush structures | arXiv: 2503.16096
- marten visual question answering with mask generation for multi-modal document u | arXiv: 2503.14140
- marvel-40m multi-level visual elaboration for high-fidelity text-to-3d content c | arXiv: 2411.17945
- mash-vlm mitigating action-scene hallucination in video-llms through disentangle
- mask-adapter the devil is in the masks for open-vocabulary segmentation | arXiv: 2412.04533
- mask2dit dual mask-based diffusion transformer for multi-scene long video genera
- masked point-entity contrast for open-vocabulary 3d scene understanding | arXiv: 2504.19500
- masked scene modeling narrowing the gap between supervised and self-supervised l
- maskgaussian adaptive 3d gaussian representation from probabilistic masks | arXiv: 2412.20522
- maskgwm a generalizable driving world model with video mask reconstruction | arXiv: 2502.11663
- masking meets supervision a strong learning alliance | arXiv: 2306.11339
- mass13k a matting-level semantic segmentation benchmark | arXiv: 2503.18364
- mast3r-slam real-time dense slam with 3d reconstruction priors | arXiv: 2412.12392
- mastering negation boosting grounding models via grouped opposition-based learni | arXiv: 2603.12606
- matanyone stable video matting with consistent memory propagation | arXiv: 2501.14677
- matcha gaussians atlas of charts for high-quality geometry and photorealism from | arXiv: 2412.06767
- matcha towards matching anything
- material anything generating materials for any 3d object via diffusion | arXiv: 2411.15138
- matrix-free shared intrinsics bundle adjustment
- matrix3d large photogrammetry model all-in-one | arXiv: 2502.07685
- mbq modality-balanced quantization for large vision-language models | arXiv: 2412.19509
- mc2 multi-concept guidance for customized multi-concept generation
- mccd multi-agent collaboration-based compositional diffusion for complex text-to
- mdp multidimensional vision model pruning with latency constraint | arXiv: 2504.02168
- meat multiview diffusion model for human generation on megapixels with mesh atte
- medunifier unifying vision-and-language pre-training on medical data with vision
- medusa a multi-scale high-order contrastive dual-diffusion approach for multi-vi
- meet towards memory-efficient temporal sparse deep neural networks
- mega hybrid mesh-gaussian head avatar for high-fidelity rendering and head editi
- mega masked generative autoencoder for human mesh recovery | arXiv: 2405.18839
- megasam accurate fast and robust structure and motion from casual dynamic videos | arXiv: 2412.04463
- megasynth scaling up 3d scene reconstruction with synthesized data | arXiv: 2412.14166
- memories of forgotten concepts | arXiv: 2412.00782
- merge multi-faceted hierarchical graph-based gnn for gene expression prediction
- mergevq a unified framework for visual generation and representation with disent
- mesc-3dmining effective semantic cues for 3d reconstruction from a single image
- mesh mamba a unified state space model for saliency prediction in non-textured a | arXiv: 2504.01466
- meshart generating articulated meshes with structure-guided transformers | arXiv: 2412.11596
- meshgen generating pbr textured mesh with render-enhanced auto-encoder and gener
- met3r measuring multi-view consistency in generated images | arXiv: 2501.06336
- meta-learning hyperparameters for parameter efficient fine-tuning | arXiv: 2603.01759
- metascenes towards automated replica creation for real-world 3d scans | arXiv: 2505.02388
- metashadow object-centered shadow detection removal and synthesis | arXiv: 2412.02635
- metaspectra a compact broadband metasurface camera for snapshot hyperspectral im | arXiv: 2603.09116
- metawriter personalized handwritten text recognition using meta-learned prompt t | arXiv: 2505.20513
- metricgrids arbitrary nonlinear approximation with elementary metric grids based
- mexd an expert-infused diffusion model for whole-slide image classification | arXiv: 2503.12401
- mfoghub bridging multi-regional and multi-satellite data for global marine fog d | arXiv: 2505.10281
- mg-motionllm a unified framework for motion comprehension and generation across | arXiv: 2504.02478
- mi-detr an object detection model with multi-time inquiries mechanism | arXiv: 2503.01463
- micas multi-grained in-context adaptive sampling for 3d point cloud processing | arXiv: 2411.16773
- microvqa a multimodal reasoning benchmark for microscopy-based scientific resear
- midi multi-instance diffusion for single image to 3d scene generation | arXiv: 2412.03558
- mil-pf multiple instance learning on precomputed features for mammography classi | arXiv: 2603.09374
- mimic in-context learning for multimodal tasks | arXiv: 2504.08851
- mimir improving video diffusion models for precise text understanding | arXiv: 2412.03085
- mimo a medical vision language model with visual referring multimodal input and | arXiv: 2510.10011
- mimo controllable character video synthesis with spatial decomposed modeling | arXiv: 2409.16160
- mind the gap confidence discrepancy can guide federated semi-supervised learning | arXiv: 2503.13227
- mind the gap detecting black-box adversarial attacks in the making through query | arXiv: 2503.02986
- mind the time temporally-controlled multi-event video generation | arXiv: 2412.05263
- mind the trojan horse image prompt adapter enabling scalable and deceptive jailb
- minding fuzzy regions a data-driven alternating learning paradigm for stable les
- minima modality invariant image matching | arXiv: 2412.19412
- minimal interaction seperated tuning a new paradigm for visual adaptation
- minimizing labeled maximizing unlabeled an image-driven approach for video insta
- minority-focused text-to-image generation via prompt optimization | arXiv: 2410.07838
- mire matched implicit neural representations
- mirrorverse pushing diffusion models to realistically reflect the world | arXiv: 2504.15397
- missing target-relevant information prediction with world model for accurate zer
- mitigating ambiguities in 3d classification with gaussian splatting | arXiv: 2503.08352
- mitigating hallucinations in large vision-language models via dpo on-policy data
- mitigating memorization in text-to-image diffusion via region-aware prompt augme | arXiv: 2603.13070
- mitigating object hallucinations in large vision-language models with assembly o
- mitigating the human-robot domain discrepancy in visual pre-training for robotic | arXiv: 2406.14235
- mitracker multi-view integration for visual object tracking | arXiv: 2502.20111
- mixermdm learnable composition of human motion diffusion models | arXiv: 2504.01019
- mixture of submodules for domain adaptive person search
- mllm-as-a-judge for image safety without human labeling | arXiv: 2501.00192
- mlvu benchmarking multi-task long video understanding | arXiv: 2406.04264
- mm-condchain a programmatically verified benchmark for visually grounded deep co | arXiv: 2603.12266
- mm-or a large multimodal operating room dataset for semantic understanding of hi
- mmar towards lossless multi-modal auto-regressive probabilistic modeling | arXiv: 2410.10798
- mmaudio taming multimodal joint training for high-quality video-to-audio synthes
- mmrl multi-modal representation learning for vision-language models | arXiv: 2503.08497
- mmtl-uniad a unified framework for multimodal and multi-task learning in assisti
- mmvu measuring expert-level multi-discipline video understanding | arXiv: 2501.12380
- mne-slam multi-agent neural slam for mobile robots
- mobile-gs real-time gaussian splatting for mobile devices | arXiv: 2603.11531
- mobileh2r learning generalizable human to mobile robot handover exclusively from
- mobilemamba lightweight multi-receptive visual mamba network | arXiv: 2411.15941
- mobileportrait real-time one-shot neural head avatars on mobile devices | arXiv: 2407.05712
- moda motion-drift augmentation for inertial human motion analysis
- modec-gs global-to-local motion decomposition and temporal interval adjustment f
- model diagnosis and correction via linguistic and implicit attribute editing
- model poisoning attacks to federated learning via multi-round consistency | arXiv: 2404.15611
- modeling multiple normal action representations for error detection in procedura
- modeling thousands of human annotators for generalizable text-to-image person re | arXiv: 2503.09962
- modeseq taming sparse multimodal motion prediction with sequential mode modeling | arXiv: 2411.11911
- modfinity unsupervised domain adaptation with multimodal information flow intert
- moedit on learning quantity perception for multi-object image editing | arXiv: 2503.10112
- moee mixture of emotion experts for audio-driven portrait animation | arXiv: 2501.01808
- moflow one-step flow matching for human trajectory forecasting via implicit maxi
- moge unlocking accurate monocular geometry estimation for open-domain images wit
- mokus leveraging cross-modal knowledge transfer for knowledge-aware concept cust | arXiv: 2603.12743
- molmo and pixmo open weights and open data for state-of-the-art vision-language | arXiv: 2409.17146
- momanipvla transferring vision-language-action models for general mobile manipul | arXiv: 2503.13446
- mono-internvl pushing the boundaries of monolithic multimodal large language mod
- mono2stereo a benchmark and empirical study for stereo conversion | arXiv: 2503.22262
- mono3dvlt monocular-video-based 3d visual language tracking
- monocular and generalizable gaussian talking head animation | arXiv: 2504.00665
- monodgp monocular 3d object detection with decoupled-query and geometry-error pr
- monoinstance enhancing monocular priors via multi-view instance alignment for ne
- monoplace3d learning 3d-aware object placement for 3d monocular detection | arXiv: 2504.06801
- monosplat generalizable 3d gaussian splatting from monocular depth foundation mo
- monotakd teaching assistant knowledge distillation for monocular 3d object detec
- monster marry monodepth to stereo unleashes power
- morpheus text-driven 3d gaussian splat shape and color stylization | arXiv: 2503.02009
- mos modeling object-scene associations in generalized category discovery | arXiv: 2503.12035
- mos-attack a scalable multi-objective adversarial attack framework | arXiv: 2501.07251
- mosaic of modalities a comprehensive benchmark for multimodal graph learning | arXiv: 2406.16321
- mosaic3d foundation dataset and model for open-vocabulary 3d segmentation | arXiv: 2502.02548
- mosca dynamic gaussian fusion from casual videos via 4d motion scaffolds | arXiv: 2405.17421
- most efficient monarch sparse tuning for 3d representation learning | arXiv: 2503.18368
- motif making text count in image animation with motion focal loss | arXiv: 2412.16153
- motion modes what could happen next | arXiv: 2412.00148
- motion prompting controlling video generation with motion trajectories | arXiv: 2412.02700
- motion-grounded video reasoning understanding and perceiving motion at pixel lev
- motionanymesh physics-grounded articulation for simulation-ready digital twins | arXiv: 2603.12936
- motionbench benchmarking and improving fine-grained video motion understanding f
- motionmap representing multimodality in human pose forecasting | arXiv: 2412.18883
- motionpro a precise motion controller for image-to-video generation | arXiv: 2505.20287
- motionpro exploring the role of pressure in human mocap and beyond | arXiv: 2504.05046
- motions as queries one-stage multi-person holistic human motion capture
- motionstone decoupled motion intensity modulation with diffusion transformer for | arXiv: 2412.05848
- move-in-2d 2d-conditioned human motion generation | arXiv: 2412.13185
- move-kd knowledge distillation for vlms with mixture of visual encoders | arXiv: 2501.01709
- movie weaver tuning-free multi-concept video personalization with anchored promp
- moviebench a hierarchical movie level dataset for long video generation | arXiv: 2411.15262
- movis enhancing multi-object novel view synthesis for indoor scenes | arXiv: 2412.11457
- mp-gui modality perception with mllms for gui understanding | arXiv: 2503.14021
- mp-sfm monocular surface priors for robust structure-from-motion | arXiv: 2504.20040
- mpdrive improving spatial understanding with marker-based prompt learning for au
- mr detr instructive multi-route training for detection transformers | arXiv: 2412.10028
- mtadiffusion mask text alignment diffusion model for object inpainting | arXiv: 2506.23482
- multi-focal conditioned latent diffusion for person image synthesis | arXiv: 2503.15686
- multi-granularity class prototype topology distillation for class-incremental so
- multi-group proportional representations for text-to-image models | arXiv: 2505.24023
- multi-label prototype visual spatial search for weakly supervised semantic segme
- multi-layer visual feature fusion in multimodal llms methods analysis and best p | arXiv: 2503.06063
- multi-modal aerial-ground cross-view place recognition with neural odes
- multi-modal contrastive learning with negative sampling calibration for phenotyp
- multi-modal contrastive masked autoencoders a two-stage progressive pre-training | arXiv: 2408.02245
- multi-modal knowledge distillation-based human trajectory forecasting | arXiv: 2503.22201
- multi-modal medical diagnosis via large-small model collaboration
- multi-modal synergistic implicit image enhancement for efficient optical flow es
- multi-modal topology-embedded graph learning for spatially resolved genes predic
- multi-modal vision pre-training for medical image analysis | arXiv: 2410.10604
- multi-party collaborative attention control for image customization | arXiv: 2505.01428
- multi-resolution pathology-language pre-training model with text-guided visual r | arXiv: 2504.18856
- multi-scale neighborhood occupancy masked autoencoder for self-supervised learni
- multi-sensor object anomaly detection unifying appearance geometry and internal | arXiv: 2412.14592
- multi-subject open-set personalization in video generation | arXiv: 2501.06187
- multi-view pose-agnostic change localization with zero labels | arXiv: 2412.03911
- multi-view reconstruction via sfm-guided monocular depth estimation | arXiv: 2503.14483
- multigo towards multi-level geometry learning for monocular 3d textured human re
- multimodal autoregressive pre-training of large vision encoders | arXiv: 2411.14402
- multimodal classification of radiation-induced contrast enhancements and tumor r | arXiv: 2603.11827
- multimodal ocr parse anything from documents | arXiv: 2603.13032
- multimodal protein language models for enzyme kinetic parameters from substrate | arXiv: 2603.12845
- multimodalstudio a heterogeneous sensor dataset and framework for neural renderi
- multimorph on-demand atlas construction | arXiv: 2504.00247
- multiple object tracking as id prediction | arXiv: 2403.16848
- multirate neural image compression with adaptive lattice vector quantization
- multiscale structure-guided latent diffusion for multimodal mri translation | arXiv: 2603.12581
- multitwine multi-object compositing with text and layout control | arXiv: 2502.05165
- multivent 20 a massive multilingual benchmark for event-centric video retrieval
- must the first dataset and unified framework for multispectral uav single object | arXiv: 2503.17699
- must3r multi-view network for stereo 3d reconstruction | arXiv: 2503.01661
- mutri multi-view tri-alignment for oct to octa 3d image translation | arXiv: 2504.01428
- mv-dust3r single-stage scene reconstruction from sparse views in 2 seconds | arXiv: 2412.06974
- mv-math evaluating multimodal math reasoning in multi-visual contexts | arXiv: 2502.20808
- mv-ssm multi-view state space modeling for 3d human pose estimation | arXiv: 2509.00649
- mvboost boost 3d reconstruction with multi-view refinement | arXiv: 2411.17772
- mvdoppler-pose multi-modal multi-view mmwave sensing for long-distance self-occl
- mvgenmaster scaling multi-view generation from any image via 3d priors enhanced | arXiv: 2411.16157
- mvpaint synchronized multi-view diffusion for painting anything 3d | arXiv: 2411.02336
- mvportrait text-guided motion and emotion control for multi-view vivid portrait | arXiv: 2503.19383
- mvsanywhere zero-shot multi-view stereo | arXiv: 2503.22430
- mxnorm reusing mxfp block scales for efficient tensor normalisation | arXiv: 2603.13180
- nader neural architecture design via multi-agent collaboration | arXiv: 2412.19206
- narrating the video boosting text-video retrieval via comprehensive utilization
- navigating image restoration with vars distribution alignment prior | arXiv: 2412.21063
- navigating the unseen zero-shot scene graph generation via capsule-based equivar
- navigation world models | arXiv: 2412.03572
- nbavatar neural billboards avatars with realistic hand-face interaction | arXiv: 2603.12063
- nearly zero-cost protection against mimicry by personalized diffusion models | arXiv: 2412.11423
- neighborretr balancing hub centrality in cross-modal retrieval | arXiv: 2503.10526
- neisf neural incident stokes field for polarized inverse rendering of conductors | arXiv: 2411.10189
- nerfprior learning neural radiance field as a prior for indoor scene reconstruct | arXiv: 2503.18361
- nested diffusion models using hierarchical latent priors | arXiv: 2412.05984
- neural gate mitigating privacy risks in lvlms via neuron-level gradient gating | arXiv: 2603.12598
- neural hierarchical decomposition for single image plant modeling
- neural inverse rendering from propagating light | arXiv: 2506.05347
- neural lightrig unlocking accurate object normal and material estimation with mu
- neural motion simulator pushing the limit of world models in reinforcement learn | arXiv: 2504.07095
- neural video compression with context modulation | arXiv: 2505.14541
- neuro-3d towards 3d visual decoding from eeg signals | arXiv: 2411.12248
- neuro-symbolic evaluation of text-to-video models using formal verification | arXiv: 2411.16718
- neuron learning context-aware evolving representations for zero-shot skeleton ac
- nexusgs sparse view synthesis with epipolar depth priors in 3d gaussian splattin
- nightadapter learning a frequency adapter for generalizable night-time scene seg
- nitrofusion high-fidelity single-step diffusion through dynamic adversarial trai
- nlprompt noise-label prompt learning for vision-language models | arXiv: 2412.01256
- nn-former rethinking graph structure in neural architecture representation | arXiv: 2507.00880
- nnwnet rethinking the use of transformers in biomedical image segmentation and c
- no pains more gains recycling sub-salient patches for efficient high-resolution
- no thing nothing highlighting safety-critical classes for robust lidar semantic
- node-rf learning generalized continuous space-time scene dynamics with neural od | arXiv: 2603.12078
- noir neural operator mapping for implicit representations | arXiv: 2603.13118
- noise calibration and spatial-frequency interactive network for stem image enhan
- noise diffusion for enhancing semantic faithfulness in text-to-image synthesis | arXiv: 2411.16503
- noise modeling in one hour minimizing preparation efforts for self-supervised lo
- noise-consistent siamese-diffusion for medical image synthesis and segmentation | arXiv: 2505.06068
- noise-resistant video anomaly detection via rgb error-guided multiscale predicti
- noisectrl a sampling-algorithm-agnostic conditional generation method for diffus
- non-natural image understanding with advancing frequency-based vision encoders
- nonisotropic gaussian diffusion for realistic 3d human motion prediction | arXiv: 2501.06035
- nopain no-box point cloud attack via optimal transport singular boundary | arXiv: 2503.00063
- not all parameters matter masking diffusion models for enhancing generation abil | arXiv: 2505.03097
- not federated unlearning via weight negation | arXiv: 2503.05657
- not just text uncovering vision modality typographic threats in image generation | arXiv: 2412.05538
- not only text exploring compositionality of visual representations in vision-lan
- notes-guided mllm reasoning enhancing mllm with knowledge and visual notes for v
- novel architecture of rpa in oral cancer lesion detection | arXiv: 2603.10928
- novel view synthesis with pixel-space diffusion models | arXiv: 2411.07765
- nsd-imagery a benchmark dataset for extending fmri vision decoding methods to me
- ntclick achieving precise interactive segmentation with noise-tolerant clicks
- ntr-gaussian nighttime dynamic thermal reconstruction with 4d gaussian splatting
- nullu mitigating object hallucinations in large vision-language models via hallu
- number it temporal grounding videos like flipping manga | arXiv: 2411.10332
- nvcomposer boosting generative novel view synthesis with multiple sparse and unp
- nvila efficient frontier visual language models | arXiv: 2412.04468
- nyxus a next generation image feature extraction library for the big data and ai | arXiv: 2603.12016
- o-tpt orthogonality constraints for calibrating test-time prompt tuning in visio
- o3n omnidirectional open-vocabulary occupancy prediction | arXiv: 2603.12144
- object detection using event camera a moe heat conduction based detector and a n | arXiv: 2412.06647
- object-aware sound source localization via audio-visual scene understanding | arXiv: 2506.18557
- object-centric prompt-driven vision-language-action model for robotic manipulati
- object-shot enhanced grounding network for egocentric video | arXiv: 2505.04270
- objectmover generative object movement with video prior | arXiv: 2503.08037
- occlusion-aware text-image-point cloud pretraining for open-world 3d object reco
- occmamba semantic occupancy prediction with state space models | arXiv: 2408.09859
- ocrt boosting foundation models in the open world with object-concept-relation t | arXiv: 2503.18695
- octopus alleviating hallucination via dynamic contrastive decoding | arXiv: 2503.00361
- oda-gan orthogonal decoupling alignment gan assisted by weakly-supervised learni
- odd-one-out anomaly detection by comparing with neighbors | arXiv: 2406.20099
- ode open-set evaluation of hallucinations in multimodal large language models | arXiv: 2409.09318
- odhsr online dense 3d reconstruction of humans and scenes from monocular videos | arXiv: 2504.13167
- ofer occluded face expression reconstruction | arXiv: 2410.21629
- offsetopt explicit surface reconstruction without normals | arXiv: 2503.15763
- olympus a universal task router for computer vision tasks | arXiv: 2412.09612
- omni-id holistic identity representation designed for generative tasks | arXiv: 2412.09694
- omni-rgpt unifying image and video region-level understanding via token marks | arXiv: 2501.08326
- omni-scene omni-gaussian representation for ego-centric sparse-view scene recons
- omnia de egotempo benchmarking temporal understanding of multi-modal llms in ego
- omnidirectional multi-object tracking | arXiv: 2503.04565
- omnidocbench benchmarking diverse pdf document parsing with comprehensive annota
- omnidrive a holistic vision-language dataset for autonomous driving with counter
- omniflow any-to-any generation with multi-modal rectified flows | arXiv: 2412.01169
- omnigen unified image generation | arXiv: 2409.11340
- omniguard hybrid manipulation localization via augmented versatile deep image wa
- omnimanip towards general robotic manipulation via object-centric interaction pr
- omnimmi a comprehensive multi-modal interaction benchmark in streaming video con
- omnisplat taming feed-forward 3d gaussian splatting for omnidirectional images w
- omnistereo real-time omnidireactional depth estimation with multiview fisheye ca
- omnistyle filtering high quality style transfer data at scale | arXiv: 2505.14028
- on denoising walking videos for gait recognition | arXiv: 2505.18582
- on the consistency of video large language models in temporal comprehension | arXiv: 2411.12951
- on the generalization of handwritten text recognition models | arXiv: 2411.17332
- on the out-of-distribution generalization of large multimodal models | arXiv: 2402.06599
- on the possible detectability of image-in-image steganography | arXiv: 2603.11876
- on the zero-shot adversarial robustness of vision-language models a truly zero-s
- on-device self-supervised learning of low-latency monocular depth from only even
- once-tuning-multiple-variants tuning once and expanded as multiple vision-langua
- onda-pose occlusion-aware neural domain adaptation for self-supervised 6d object
- one diffusion to generate them all | arXiv: 2411.16318
- one is plenty a polymorphic feature interpreter for immutable heterogeneous coll
- one model for all low-level task interaction is a key to task-agnostic image fus
- one model many budgets elastic latent interfaces for diffusion transformers | arXiv: 2603.12245
- one token two fates a unified framework via vision token manipulation against ml | arXiv: 2603.10360
- one-for-more continual diffusion model for anomaly detection | arXiv: 2502.19848
- one-minute video generation with test-time training | arXiv: 2504.05298
- one-shot 3d object canonicalization based on geometric and semantic consistency
- one-step event-driven high-speed autofocus | arXiv: 2503.01214
- one-way ticket time-independent unified encoder for distilling text-to-image dif
- one2any one-reference 6d pose estimation for any object | arXiv: 2505.04109
- online task-free continual learning via dynamic expansionable memory distributio
- online video understanding ovbench and videochat-online | arXiv: 2501.00584
- onlineanyseg online zero-shot 3d segmentation by visual foundation model guided
- oodd test-time out-of-distribution detection with dynamic dictionary | arXiv: 2503.10468
- open ad-hoc categorization with contextualized feature learning | arXiv: 2512.16202
- open set label shift with test time out-of-distribution reference | arXiv: 2505.05868
- open-canopy towards very high resolution forest monitoring | arXiv: 2407.09392
- open-vocabulary functional 3d scene graphs for real-world indoor spaces | arXiv: 2503.19199
- open-world amodal appearance completion | arXiv: 2411.13019
- open-world objectness modeling unifies novel object detection
- openhumanvid a large-scale high-quality dataset for enhancing human-centric vide
- opening a comprehensive benchmark for judging open-ended interleaved image-text | arXiv: 2411.18499
- openmibood open medical imaging benchmarks for out-of-distribution detection | arXiv: 2503.16247
- opensdi spotting diffusion-generated images in the open world | arXiv: 2503.19653
- opportunistic single-photon time of flight
- optical leveraging optimal transport for contribution allocation in dataset dist
- optical-flow guided prompt optimization for coherent video generation | arXiv: 2411.15540
- opticalnet an optical imaging dataset and benchmark beyond the diffraction limit
- optimal transport-guided source-free adaptation for face anti-spoofing | arXiv: 2503.22984
- optimizing for the shortest path in denoising diffusion model | arXiv: 2503.03265
- optimus-2 multimodal minecraft agent with goal-observation-action conditioned po
- oralxrays-9 towards hospital-scale panoramic x-ray anomaly detection via persona
- order-one rolling shutter cameras | arXiv: 2403.11295
- order-robust class incremental learning graph-driven dynamic similarity grouping | arXiv: 2502.20032
- orida object-centric real-world image composition dataset | arXiv: 2506.08964
- osdface one-step diffusion model for face restoration | arXiv: 2411.17163
- osloprompt bridging low-supervision challenges and open-set domain generalizatio
- osmamba omnidirectional spectral mamba with dual-domain prior generator for expo
- osv one step is enough for high-quality image to video generation | arXiv: 2409.11367
- ouroboros3d image-to-3d generation via 3d-aware recursive diffusion | arXiv: 2406.03184
- out of sight out of mind evaluating state evolution in video world models | arXiv: 2603.13215
- overcoming shortcut problem in vlm for robust out-of-distribution detection
- overcoming visual clutter in vision language action models via concept-gated vis | arXiv: 2603.10340
- overlock an overview-first-look-closely-next convnet with context-mixing dynamic | arXiv: 2502.20087
- ovo-bench how far is your video-llms from real-world online video understanding | arXiv: 2501.05510
- ow-ovd unified open world and open vocabulary object detection
- p-slcr unsupervised point cloud semantic segmentation via prototypes structure l | arXiv: 2603.06321
- pact pruning and clustering-based token reduction for faster visual language mod
- paint by inpaint learning to add image objects by removing them first | arXiv: 2404.18212
- panda towards panoramic depth anything with unlabeled panoramas and mobius spati
- pano360 perspective to panoramic vision with geometric consistency | arXiv: 2603.12013
- panoaffordancenet towards holistic affordance grounding in 360 indoor environmen | arXiv: 2603.09760
- panogs gaussian-based panoptic segmentation for 3d open vocabulary scene underst
- panorama generation from nfov image done right | arXiv: 2503.18420
- panoramic multimodal semantic occupancy prediction for quadruped robots | arXiv: 2603.13108
- pansplat 4k panorama synthesis with feed-forward gaussian splatting | arXiv: 2412.12096
- paper title lov3d grounding cognitive prognosis reasoning in longitudinal 3d bra | arXiv: 2603.12071
- parahome parameterizing everyday home activities towards 3d generative modeling
- parallel sequence modeling via generalized spatial propagation network | arXiv: 2501.12381
- parallelized autoregressive visual generation | arXiv: 2412.15119
- parameter efficient mamba tuning via projector-targeted diagonal-centric linear | arXiv: 2411.15224
- parameter-efficient fine-tuning in hyperspherical space for open-vocabulary sema
- parameterized blur kernel prior learning for local motion deblurring
- parametric point cloud completion for polygonal surface reconstruction | arXiv: 2503.08363
- parc a quantitative framework uncovering the symmetries within vision language m | arXiv: 2506.14808
- partgen part-level 3d generation and reconstruction with multi-view diffusion mo
- partrm modeling part-level dynamics with large cross-state reconstruction model | arXiv: 2503.19913
- passionsr post-training quantization with adaptive scale in one-step diffusion b
- patch matters training-free fine-grained image caption enhancement via local per
- patchdemux a certifiably robust framework for multi-label classifiers against ad
- patchdpo patch-level dpo for finetuning-free personalized image generation | arXiv: 2412.03177
- patchguard adversarially robust anomaly detection and localization through visio
- patchvsr breaking video diffusion resolution limits with patch-wise video super- | arXiv: 2509.26025
- pathways on the image manifold image editing via video generation | arXiv: 2411.16819
- patient-level anatomy meets scanning-level physics personalized federated low-do
- pattern analogies learning to perform programmatic image edits by analogy | arXiv: 2412.12463
- pave patching and adapting video large language models | arXiv: 2503.19794
- pay attention to the foreground in object-centric learning
- pbr-nerf inverse rendering with physics-based neural fields | arXiv: 2412.09680
- pcdreamer point cloud completion through multi-view diffusion priors | arXiv: 2411.19036
- pcm picard consistency model for fast parallel sampling of diffusion models | arXiv: 2503.19731
- pdfactor learning tri-perspective view policy diffusion field for multi-task rob
- peace empowering geologic map holistic understanding with mllms | arXiv: 2501.06184
- peer pressure model-to-model regularization for single source domain generalizat
- perceive what matters relevance-driven scheduling for multimodal streaming perce | arXiv: 2603.13176
- percept memory and imagine world feature simulating for open-domain unknown obje
- perception tokens enhance visual reasoning in multimodal language models | arXiv: 2412.03548
- perceptual inductive bias is what you need before contrastive learning | arXiv: 2506.01201
- perceptual video compression with neural wrapping
- perceptually accurate 3d talking head generation new definitions speech-mesh rep
- period-llm extending the periodic capability of multimodal large language model | arXiv: 2505.24476
- perla perceptive 3d language assistant | arXiv: 2411.19774
- perse personalized 3d generative avatars from a single portrait | arXiv: 2412.21206
- person de-reidentification a variation-guided identity shift modeling
- personabooth personalized text-to-motion generation | arXiv: 2503.07390
- personahoi effortlessly improving face personalization in human-object interacti
- personalized preference fine-tuning of diffusion models | arXiv: 2501.06655
- perturb-and-revise flexible 3d editing with generative trajectories | arXiv: 2412.05279
- pfedmxf personalized federated class-incremental learning with mixture of freque
- pgc physics-based gaussian cloth from a single pose | arXiv: 2503.20779
- phd a chatgpt-prompted visual hallucination evaluation dataset | arXiv: 2403.11116
- phgc procedural heterogeneous graph completion for natural language task verific
- phoenix a motion-based self-reflection framework for fine-grained robotic action | arXiv: 2504.14588
- phys-edit physics-aware semantic image editing with text description
- physanimator physics-guided generative cartoon animation | arXiv: 2501.16550
- physgen3d crafting a miniature interactive world from a single image | arXiv: 2503.20746
- physical plausibility-aware trajectory prediction via locomotion embodiment | arXiv: 2503.17267
- physicsgen can generative models learn from images to predict complex physical r | arXiv: 2503.05333
- physmodpo physically-plausible humanoid motion with preference optimization | arXiv: 2603.13228
- physvlm enabling visual language models to understand robotic physical reachabil
- phyt2v llm-guided iterative self-refinement for physics-grounded text-to-video g | arXiv: 2412.00596
- pi-hmr towards robust in-bed temporal human shape reconstruction with contact pr
- piad pose and illumination agnostic anomaly detection
- picd versatile perceptual image compression with diffusion rendering | arXiv: 2505.05853
- pico reconstructing 3d people in contact with objects | arXiv: 2504.17695
- picosam3 real-time in-sensor region-of-interest segmentation | arXiv: 2603.11917
- pidloc cross-view pose optimization network inspired by pid controllers | arXiv: 2503.02388
- pidsr complementary polarized image demosaicing and super-resolution | arXiv: 2504.07758
- pillarhist a quantization-aware pillar feature encoder based on height-aware his
- pioneering 4-bit fp quantization for diffusion models mixup-sign quantization an
- pippo high-resolution multi-view humans from a single image | arXiv: 2502.07785
- pixel-aligned rgb-nir stereo imaging and dataset for robot vision | arXiv: 2411.18025
- pixel-level and semantic-level adjustable super-resolution a dual-lora approach | arXiv: 2412.03017
- planarsplatting accurate planar surface reconstruction in 3 minutes | arXiv: 2412.03451
- playing the fool jailbreaking llms and multimodal llms with out-of-distribution | arXiv: 2503.20823
- pleas - merging models with permutations and least squares | arXiv: 2407.02447
- plug-and-play interpretable responsible text-to-image generation via dual-space
- plug-and-play ppo an adaptive point prompt optimizer making sam greater
- plug-and-play versatile compressed video enhancement | arXiv: 2504.15380
- pma towards parameter-efficient point cloud understanding via point mamba adapte | arXiv: 2505.20941
- po3ad predicting point offsets toward better 3d point cloud anomaly detection | arXiv: 2412.12617
- point cloud upsampling using conditional diffusion module with adaptive noise su
- point clouds meets physics dynamic acoustic field fitting network for point clou
- point-cache test-time dynamic and hierarchical cache for robust and generalizabl
- point-to-region loss for semi-supervised point-based crowd counting | arXiv: 2505.21943
- point2rbox-v2 rethinking point-supervised oriented object detection with spatial
- pointlora low-rank adaptation with token selection for point cloud learning | arXiv: 2504.16023
- pointsr self-regularized point supervision for drone-view object detection
- polarfree polarization-based reflection-free imaging | arXiv: 2503.18055
- polarized color screen matting
- polarnext rethink instance segmentation with polar representation
- polishing the sky wide-field and high-dynamic range interferometric image recons | arXiv: 2603.09162
- poly-autoregressive prediction for modeling interactions | arXiv: 2502.08646
- pomp physics-consistent motion generative model through phase manifolds
- pop-gs next best view in 3d-gaussian splatting with p-optimality | arXiv: 2503.07819
- popen preference-based optimization and ensemble for lvlm-based reasoning segmen
- population normalization for federated learning
- pos3r 6d pose estimation for unseen objects made easy
- pose priors from language models | arXiv: 2405.03689
- pose-guided temporal enhancement for robust low-resolution hand reconstruction
- posebh prototypical multi-dataset training beyond human pose estimation | arXiv: 2505.17475
- posetraj pose-aware trajectory control in video diffusion | arXiv: 2503.16068
- positive2negative breaking the information-lossy barrier in self-supervised sing
- post-pre-training for modality alignment in vision-language foundation models | arXiv: 2504.12717
- posta a go-to framework for customized artistic poster generation | arXiv: 2503.14908
- postermaker towards high-quality product poster generation with accurate text re
- postero structuring layout trees to enable language models in generalized conten
- pot prototypical optimal transport for weakly supervised semantic segmentation
- potential field based deep metric learning | arXiv: 2405.18560
- pow3r empowering unconstrained 3d reconstruction with camera and scene priors | arXiv: 2503.17316
- pqpp a joint benchmark for text-to-image prompt and query performance prediction | arXiv: 2406.04746
- practical solutions to the relative pose of three calibrated cameras | arXiv: 2303.16078
- prada projective radial distortion averaging | arXiv: 2504.16499
- precise event spotting in sports videos solving long-range dependency and class | arXiv: 2503.00147
- precise fast and low-cost concept erasure in value space orthogonal complement m | arXiv: 2412.06143
- precisecam precise camera control for text-to-image generation | arXiv: 2501.12910
- preconditioners for the stochastic training of neural fields | arXiv: 2402.08784
- preditor3d fast and precise 3d shape editing | arXiv: 2412.06592
- preserve or modify context-aware evaluation for balancing preservation and modif
- preserving clusters in prompt learning for unsupervised domain adaptation | arXiv: 2506.11493
- prior does matter visual navigation via denoising diffusion bridge models | arXiv: 2504.10041
- prior-free 3d object tracking
- proapo progressively automatic prompt optimization for visual classification | arXiv: 2502.19844
- probabilistic prompt distribution learning for animal pose estimation | arXiv: 2503.16120
- probability density geodesics in image diffusion latent space | arXiv: 2504.06675
- probesdf light field probes for neural surface reconstruction | arXiv: 2412.10084
- probing the mid-level vision capabilities of self-supervised learning | arXiv: 2411.17474
- probpose a probabilistic approach to 2d human pose estimation | arXiv: 2412.02254
- prof robot differentiable robot rendering without static and self-collisions | arXiv: 2503.11269
- progress-aware video frame captioning | arXiv: 2412.02071
- progressive correspondence regenerator for robust 3d registration | arXiv: 2502.02163
- progressive focused transformer for single image super-resolution | arXiv: 2503.20337
- progressive rendering distillation adapting stable diffusion for instant text-to
- prohoc probabilistic hierarchical out-of-distribution classification via multi-d
- projattacker a configurable physical adversarial attack for face recognition via
- project-probe-aggregate efficient fine-tuning for group robustness | arXiv: 2503.09487
- proker a kernel perspective on few-shot adaptation of large vision-language mode
- prometheus 3d-aware latent diffusion models for feed-forward text-to-3d scene ge
- prompt-cam making vision transformers interpretable for fine-grained analysis | arXiv: 2501.09333
- prompt-driven lightweight foundation model for instance segmentation-based fault | arXiv: 2603.12624
- prompt2perturb p2p text-guided diffusion-based adversarial attack on breast ultr
- prompthashaffinity-prompted collaborative cross-modal learning for adaptive hash
- prompthmr promptable human mesh recovery | arXiv: 2504.06397
- prompting depth anything for 4k resolution accurate metric depth estimation | arXiv: 2412.14015
- proreflow progressive reflow with decomposed velocity | arXiv: 2503.04824
- prosody-enhanced acoustic pre-training and acoustic-disentangled prosody adaptin
- protecting your video content disrupting automated video-based llm annotations | arXiv: 2503.21824
- protodepth unsupervised continual depth completion with prototypes | arXiv: 2503.12745
- ProtoOcc: 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation | arXiv: 2503.15185
- prototype-based image prompting for weakly supervised histopathological image se
- prototype-based knowledge guidance for fine-grained structured radiology reporti | arXiv: 2603.11938
- provoking multi-modal few-shot lvlm via exploration-exploitation in-context lear
- proximal algorithm unrolling flexible and efficient reconstruction networks for | arXiv: 2505.23180
- proxytransformation preshaping point cloud manifold with proxy attention for 3d | arXiv: 2502.19247
- ps-diffusion photorealistic subject-driven image editing with disentangled contr
- ps-eip robust photometric stereo based on event interval profile | arXiv: 2503.18341
- psa-ssl pose and size-aware self-supervised learning on lidar point clouds | arXiv: 2503.13914
- psbd prediction shift uncertainty unlocks backdoor detection | arXiv: 2406.05826
- pseudo visible feature fine-grained fusion for thermal object detection
- pshuman photorealistic single-image 3d human reconstruction using cross-scale mu
- ptdiffusion free lunch for generating optical illusion hidden pictures with phas
- pup 3d-gs principled uncertainty pruning for 3d gaussian splatting | arXiv: 2406.10219
- pura parameter update-recovery test-time adaption for rgb-t tracking
- pursuing temporal-consistent video virtual try-on via dynamic pose interaction | arXiv: 2505.16980
- pvc progressive visual token compression for unified image and video processing
- pytorchgeonodes enabling differentiable shape programs for 3d shape reconstructi
- q-bench-video benchmark the video quality understanding of lmms | arXiv: 2409.20063
- q-dit accurate post-training quantization for diffusion transformers | arXiv: 2406.17343
- q-eval-100k evaluating visual quality and alignment level for text-to-vision con
- q-part quasi-periodic adaptive regression with test-time training for pediatric
- qmambabsr burst image super-resolution with query state space model | arXiv: 2408.08665
- quad-pixel image defocus deblurring a new benchmark and model
- quaffure real-time quasi-static neural hair simulation | arXiv: 2412.10061
- quantization without tears | arXiv: 2411.13918
- quartdepth post-training quantization for real-time depth estimation on the edge | arXiv: 2503.16709
- qucoop a versatile framework for solving composite and binary-parametrised probl
- query efficient black-box visual prompting with subspace learning
- question-aware gaussian experts for audio-visual question answering | arXiv: 2503.04459
- r-score revisiting scene coordinate regression for robust large-scale visual loc
- r-tpt improving adversarial robustness of vision-language models through test-ti
- r2c mapping room to chessboard to unlock llm as low-level action planner
- racformer towards high-quality 3d object detection via query-based radar-camera | arXiv: 2412.12725
- rad region-aware diffusion models for image inpainting | arXiv: 2412.09191
- radio frequency ray tracing with neural object representation for enhanced rf mo
- radiov25 improved baselines for agglomerative vision foundation models
- raencoder a label-free reversible adversarial examples encoder for dataset intel
- rainygs efficient rain synthesis with physically-based gaussian splatting | arXiv: 2503.21442
- randar decoder-only autoregressive visual generation in random orders | arXiv: 2412.01827
- random conditioning for diffusion model compression with distillation | arXiv: 2504.02011
- range retrieval augmented neural fields for multi-resolution geo-embeddings | arXiv: 2502.19781
- rap retrieval-augmented personalization for multimodal large language models | arXiv: 2410.13360
- rashomon sets for prototypical-part networks editing interpretable models in rea
- rasp revisiting 3d anamorphic art for shadow-guided packing of irregular objects | arXiv: 2504.02465
- rass improving denoising diffusion samplers with reinforced active sampling sche
- rate-in information-driven adaptive dropout rates for improved inference-time un
- rayflow instance-aware diffusion acceleration via adaptive flow trajectories | arXiv: 2503.07699
- rc-autocalib an end-to-end radar-camera automatic calibration network | arXiv: 2505.22427
- rcp-bench benchmarking robustness for collaborative perception under diverse cor
- rdd robust feature detector and descriptor using deformable transformer | arXiv: 2505.08013
- rdnet region proportion-aware dynamic adaptive salient object detection network | arXiv: 2603.12215
- re-hold video hand object interaction reenactment via adaptive layout-instructed | arXiv: 2503.16942
- re-thinking temporal search for long-form video understanding | arXiv: 2504.02259
- real-iad d3 a real-world 2dpseudo-3d3d dataset for industrial anomaly detection
- real-time free-view human rendering from sparse-view rgb videos using double unp
- real-time high-fidelity gaussian human avatars with position-based interpolation
- realedit reddit edits as a large-scale empirical dataset for image transformatio
- realistic test-time adaptation of vision-language models | arXiv: 2501.03729
- reanimating images using neural representations of dynamic stimuli | arXiv: 2406.02659
- reason-before-retrieve one-stage reflective chain-of-thoughts for training-free
- reasongrounder lvlm-guided hierarchical feature splatting for open-vocabulary 3d
- reasoning in visual navigation of end-to-end trained agents a dynamical systems | arXiv: 2503.08306
- reasoning mamba hypergraph-guided region relation calculating for weakly supervi
- reasoning over video evaluating how mllms extract integrate and reconstruct spat | arXiv: 2603.13091
- reasoning to attend try to understand how seg token works | arXiv: 2412.17741
- recap better gaussian relighting with cross-environment captures | arXiv: 2412.07534
- recapture generative video camera controls for user-provided videos using masked | arXiv: 2411.05003
- recognition-synergistic scene text editing | arXiv: 2503.08387
- recon enhancing true correspondence discrimination through relation consistency
- reconciling stochastic and deterministic strategies for zero-shot image restorat
- recondreamer crafting world models for driving scene reconstruction via online r | arXiv: 2411.19548
- reconstructing animals and the wild | arXiv: 2411.18807
- reconstructing close human interaction with appearance and proxemics reasoning | arXiv: 2507.02565
- reconstructing humans with a biomechanically accurate skeleton | arXiv: 2503.21751
- reconstructing in-the-wild open-vocabulary human-object interactions | arXiv: 2503.15898
- reconstructing people places and cameras | arXiv: 2412.17806
- reconstruction vs generation taming optimization dilemma in latent diffusion mod
- recover and match open-vocabulary multi-label recognition through knowledge-cons
- recovering dynamic 3d sketches from videos | arXiv: 2503.20321
- rectification-specific supervision and constrained estimator for online stereo r
- rectified diffusion guidance for conditional generation | arXiv: 2410.18737
- recurrence-enhanced vision-and-language transformers for robust multimodal docum
- recurrent feature mining and keypoint mixup padding for category-agnostic pose e | arXiv: 2503.21140
- redefining creative in dictionary towards an enhanced semantic understanding of | arXiv: 2410.24160
- rediffdet rotation-equivariant diffusion model for oriented object detection
- reducing class-wise confusion for incremental learning with disentangled manifol
- ref-gs directional factorization for 2d gaussian splatting | arXiv: 2412.00905
- reference-based 3d-aware image editing with triplanes | arXiv: 2404.03632
- reference-free image quality assessment for virtual try-on via human feedback | arXiv: 2603.13057
- refpose leveraging reference geometric correspondences for accurate 6d pose esti
- regularizing inr with diffusion prior self-supervised 3d reconstruction of neutr | arXiv: 2603.10947
- reinforcing the weakest links modernizing siena with targeted deep learning inte | arXiv: 2603.12951
- relation-rich visual document generator for visual information extraction | arXiv: 2504.10659
- relation3d enhancing relation modeling for point cloud instance segmentation | arXiv: 2506.17891
- relationfield relate anything in radiance fields | arXiv: 2412.13652
- relative pose estimation through affine corrections of monocular depth priors | arXiv: 2501.05446
- reloc3r large-scale training of relative camera pose regression for generalizabl
- relocate a simple training-free baseline for visual query localization using reg
- remote photoplethysmography in real-world and extreme lighting scenarios | arXiv: 2503.11465
- removing reflections from raw photos | arXiv: 2404.14414
- reneg learning negative embedding with reward guidance | arXiv: 2412.19637
- reno real-time neural compression for 3d lidar point clouds | arXiv: 2503.12382
- reperformer immersive human-centric volumetric videos from playback to photoreal | arXiv: 2503.12242
- representation learning for spatiotemporal physical systems | arXiv: 2603.13227
- reproducible vision-language models meet concepts out of pre-training
- repurposing pre-trained video diffusion models for event-based video interpolati
- repurposing stable diffusion attention for training-free unsupervised interactiv
- reraw rgb-to-raw image reconstruction via stratified sampling for efficient obje
- resclip residual attention for training-free dense vision-language inference | arXiv: 2411.15851
- residual sodap residual self-organizing domain-adaptive prompting with structura | arXiv: 2603.12816
- resilient sensor fusion under adverse sensor failures via multi-modal expert fus
- respec relevance and specificity grounded online filtering for learning on video
- restorgs depth-aware gaussian splatting for efficient 3d scene restoration
- retaining knowledge and enhancing long-text representations in clip through dual
- rethinking correspondence-based category-level object pose estimation
- rethinking decoder design improving biomarker segmentation using depth-to-space
- rethinking diffusion for text-driven human motion generation redundant represent
- rethinking end-to-end 2d to 3d scene segmentation in gaussian splatting | arXiv: 2503.14029
- rethinking epistemic and aleatoric uncertainty for active open-set annotation an | arXiv: 2502.19691
- rethinking few-shot adaptation of vision-language models in two stages | arXiv: 2503.11609
- rethinking lanes and points in complex scenarios for monocular 3d lane detection | arXiv: 2503.06237
- rethinking noisy video-text retrieval via relation-aware alignment
- rethinking personalized aesthetics assessment employing physique aesthetics asse
- rethinking query-based transformer for continual image segmentation | arXiv: 2507.07831
- rethinking reconstruction and denoising in the dark new perspective general arch
- rethinking spiking self-attention mechanism implementing a-xnor similarity calcu
- rethinking temporal fusion with a unified gradient descent view for 3d semantic | arXiv: 2504.12959
- rethinking the adversarial robustness of multi-exit neural networks in an attack
- rethinking token reduction with parameter-efficient fine-tuning in vit for pixel
- rethinking training for de-biasing text-to-image generation unlocking the potent
- rethinking vision-language model in face forensics multi-modal interpretable for | arXiv: 2503.20188
- rethinking vlms for image forgery detection and localization | arXiv: 2603.12930
- retrieving semantics from the deep an rag solution for gesture synthesis | arXiv: 2412.06786
- revealing key details to see differences a novel prototypical perspective for sk
- reversible decoupling network for single image reflection removal | arXiv: 2410.08063
- reversing flow for image restoration | arXiv: 2506.16961
- revisionllm recursive vision-language model for temporal grounding in hour-long | arXiv: 2411.14901
- revisiting audio-visual segmentation with vision-centric transformer | arXiv: 2506.23623
- revisiting backdoor attacks against large vision-language models from domain shi
- revisiting fairness in multitask learning a performance-driven approach for vari
- revisiting generative replay for class incremental object detection
- revisiting mae pre-training for 3d medical image segmentation | arXiv: 2410.23132
- revisiting model stitching in the foundation model era | arXiv: 2603.12433
- revisiting source-free domain adaptation insights into representativeness genera
- reward fine-tuning two-step diffusion models via learning differentiable latent- | arXiv: 2411.15247
- rewind real-time egocentric whole-body motion diffusion with exemplar-based iden
- rewind understanding long videos with instructed learnable memory | arXiv: 2411.15556
- rewis3d reconstruction improves weakly-supervised semantic segmentation | arXiv: 2603.06374
- rgbavatar reduced gaussian blendshapes for online modeling of head avatars | arXiv: 2503.12886
- riccardo radar hit prediction and convolution for camera-radar 3d object detecti
- riggs rigging of 3d gaussians for modeling articulated objects in videos | arXiv: 2503.16822
- ripvis rip currents video instance segmentation benchmark for beach monitoring a | arXiv: 2504.01128
- rivuletmlp an mlp-based architecture for efficient compressed video quality enha
- rl-rc-dot a block-level rl agent for task-aware video compression | arXiv: 2501.12216
- rlaif-v open-source ai feedback leads to super gpt-4v trustworthiness | arXiv: 2405.17220
- rng relightable neural gaussians | arXiv: 2409.19702
- roadsocial a diverse videoqa dataset and benchmark for road event understanding
- robobrain a unified brain model for robotic manipulation from abstract to concre
- roboground robotic manipulation with grounded vision-language priors | arXiv: 2504.21530
- robopepp vision-based robot pose and joint angle estimation through embedding pr
- robosense large-scale dataset and benchmark for egocentric robot perception and
- robospatial teaching spatial understanding to 2d and 3d vision-language models f | arXiv: 2411.16537
- robotic visual instruction | arXiv: 2505.00693
- robotwin dual-arm robot benchmark with generative digital twins | arXiv: 2504.13059
- robsense a robust multi-modal foundation model for remote sensing with static te
- robust 3d shape reconstruction in zero-shot from a single image in the wild | arXiv: 2403.14539
- robust audio-visual segmentation via audio-guided visual convergent alignment | arXiv: 2503.12847
- robust message embedding via attention flow-based steganography | arXiv: 2405.16414
- robust multi-object 4d generation for in-the-wild videos
- robust multimodal survival prediction with conditional latent differentiation va
- robust-mvton learning cross-pose feature alignment and fusion for robust multi-v
- rocket-1 mastering open-world interaction with visual-temporal context prompting | arXiv: 2410.17856
- rod-mllm towards more reliable object detection in multimodal large language mod
- rogsplat learning robust generalizable human gaussian splatting from sparse mult
- roictrl boosting instance control for visual generation | arXiv: 2411.17949
- roll robust noisy pseudo-label learning for multi-view clustering with noisy cor
- rooftop wind field reconstruction using sparse sensors from deterministic to gen | arXiv: 2603.13077
- roompainter view-integrated diffusion for consistent indoor scene texturing | arXiv: 2412.16778
- roomtour3d geometry-aware video-instruction tuning for embodied navigation | arXiv: 2412.08591
- rorem training a robust object remover with human-in-the-loop | arXiv: 2501.00740
- ros-sam high-quality interactive segmentation for remote sensing moving object | arXiv: 2503.12006
- rotation-equivariant self-supervised method in image denoising | arXiv: 2505.19618
- rsar restricted state angle resolver and rotated sar benchmark | arXiv: 2501.04440
- rsonet region-guided selective optimization network for rgb-t salient object det | arXiv: 2603.12685
- rubik a structured benchmark for image matching across geometric challenges | arXiv: 2502.19955
- s2d-lfe sparse-to-dense light field event generation
- s2gaussian sparse-view super-resolution 3d gaussian splatting | arXiv: 2503.04314
- s3-face sss-compliant facial reflectance estimation via diffusion priors
- s4-driver scalable self-supervised driving multimodal large language model with
- sacb-net spatial-awareness convolutions for medical image registration | arXiv: 2503.19592
- saist segment any infrared small target model guided by contrastive language-ima
- salad skeleton-aware latent diffusion for text-driven motion generation and edit | arXiv: 2503.13836
- salient frequency-aware paired diffusion for controllable long-tail ct detection | arXiv: 2602.23447
- saliuitl ensemble salience guided recovery of adversarial patches against cnns
- salova segment-augmented long video assistant for targeted retrieval and routing
- sam-i2v upgrading sam to support promptable video segmentation with less than 02
- sam-ref introducing image-prompt synergy during interaction for detail enhanceme
- sam2-love segment anything model 2 in language-aided audio-visual scenes | arXiv: 2506.01558
- sam2object consolidating view consistency via sam2 for zero-shot 3d instance seg
- samam style-aware state space model for arbitrary image style transfer | arXiv: 2503.15934
- samba a unified mamba-based framework for general salient object detection
- samble shape-specific point cloud sampling for an optimal trade-off between loca
- sample- and parameter-efficient auto-regressive image models | arXiv: 2411.15648
- sampling innovation-based adaptive compressive sensing | arXiv: 2503.13241
- samwise infusing wisdom in sam2 for text-driven video segmentation | arXiv: 2411.17646
- sap segment any 4k panorama | arXiv: 2603.12759
- sapave towards active perception and manipulation in vision-language-action mode | arXiv: 2603.12193
- sapiensid foundation for human recognition | arXiv: 2504.04708
- sar3d autoregressive 3d object generation and understanding via multi-scale 3d v | arXiv: 2411.16856
- sasep saliency-aware structured separation of geometry and feature for open set
- sat-hmr real-time multi-person 3d mesh estimation via scale-adaptive tokens | arXiv: 2411.19824
- sata spatial autocorrelation token analysis for enhancing the robustness of visi
- satellite observations guided diffusion model for accurate meteorological states
- satellite to groundscape - large-scale consistent ground view generation from sa
- saw toward a surgical action world model via controllable and scalable video gen | arXiv: 2603.13024
- scalable autoregressive monocular depth estimation | arXiv: 2411.11361
- scalable video-to-dataset generation for cross-platform mobile agents | arXiv: 2505.12632
- scale efficient training for large datasets | arXiv: 2503.13385
- scalelsd scalable deep line segment detection streamlined | arXiv: 2506.09369
- scaling down text encoders of text-to-image diffusion models | arXiv: 2503.19897
- scaling inference time compute for diffusion models
- scaling mesh generation via compressive tokenization | arXiv: 2411.07025
- scaling properties of diffusion models for perceptual tasks | arXiv: 2411.08034
- scaling up image segmentation across data and tasks
- scaling vision pre-training to 4k resolution | arXiv: 2503.19903
- scamo exploring the scaling law in autoregressive motion generation model | arXiv: 2412.14559
- scap transductive test-time adaptation via supportive clique-based attribute pro
- scenario dreamer vectorized latent diffusion for generating driving simulation e | arXiv: 2503.22496
- scene map-based prompt tuning for navigation instruction generation
- scene splatter momentum 3d scene generation from single image with video diffusi
- scene-agnostic pose regression for visual localization | arXiv: 2503.19543
- scene-centric unsupervised panoptic segmentation | arXiv: 2504.01955
- scene4u hierarchical layered 3d scene reconstruction from single panoramic image
- sceneassistant a visual feedback agent for open-vocabulary 3d scene generation | arXiv: 2603.12238
- scenecrafter controllable multi-view driving scene editing | arXiv: 2506.19488
- scenediffuser city-scale traffic simulation via a generative world model | arXiv: 2506.21976
- scenefactor factored latent 3d diffusion for controllable 3d scene generation | arXiv: 2412.01801
- scenetap scene-coherent typographic adversarial planner against vision-language
- scflow2 plug-and-play object pose refiner with shape-constraint scene flow | arXiv: 2504.09160
- schedule on the fly diffusion time prediction for faster and better image genera
- science-t2i addressing scientific illusions in image synthesis | arXiv: 2504.13129
- scope scene-contextualized incremental few-shot 3d segmentation | arXiv: 2603.06572
- scope semantic coreset with orthogonal projection embeddings for federated learn | arXiv: 2603.12976
- scribblelight single image indoor relighting with scribbles | arXiv: 2411.17696
- scsa a plug-and-play semantic continuous-sparse attention for arbitrary semantic | arXiv: 2503.04119
- scsegamba lightweight structure-aware vision mamba for crack segmentation in str
- sdbf steep-decision-boundary fingerprinting for hard-label tampering detection o
- sdf-net structure-aware disentangled feature learning for opticall-sar ship re-i | arXiv: 2603.12588
- sdgocc semantic and depth-guided birds-eye view transformation for 3d multimodal | arXiv: 2507.17083
- sea-ing in low-light
- seal semantic attention learning for long video representation | arXiv: 2412.01798
- sealion semantic part-aware latent point diffusion models for 3d generation | arXiv: 2505.17721
- search and detect training-free long tail object detection via web-image retriev | arXiv: 2409.18733
- sec-promptsemantic complementary prompting for few-shot class-incremental learni
- secap self-calibrating and adaptive prompts for cross-view person re-identificat
- secret lies in color enhancing ai-generated images detection with color distribu
- see further when clear curriculum consistency model | arXiv: 2412.06295
- seedvr seeding infinity in diffusion transformer towards generic video restorati
- seeground see and ground for zero-shot open-vocabulary 3d visual grounding | arXiv: 2412.04383
- seeing a 3d world in a grain of sand | arXiv: 2503.00260
- seeing far and clearly mitigating hallucinations in mllms with attention causal | arXiv: 2505.16652
- seeing is not believing adversarial natural object optimization for hard-label 3
- seeing more with less human-like representations in vision models
- seeing speech and sound distinguishing and locating audio sources in visual scen
- seeing the abstract translating the abstract language for vision language models | arXiv: 2505.03242
- seeing what matters empowering clip with patch generation-to-selection | arXiv: 2503.17080
- seek common ground while reserving differences semi-supervised image-text sentim
- seeking consistent flat minima for better domain generalization via refining los
- seen-da semantic entropy guided domain-aware attention for domain adaptive objec
- segagent exploring pixel understanding capabilities in mllms by imitating human | arXiv: 2503.08625
- segearth-ov towards training-free open-vocabulary segmentation for remote sensin
- segman omni-scale context modeling with state space models and local attention f
- segment any motion in videos | arXiv: 2503.22268
- segment any-quality images with generative latent space enhancement | arXiv: 2503.12507
- segment anything even occluded | arXiv: 2503.06261
- segment this thing foveated tokenization for efficient point-prompted segmentati
- segmenting maxillofacial structures in cbct volumes
- self-cross diffusion guidance for text-to-image synthesis of similar subjects | arXiv: 2411.18936
- self-evolving visual concept library using vision-language critics | arXiv: 2504.00185
- self-expansion of pre-trained models with mixture of adapters for continual lear
- self-learning hyperspectral and multispectral image fusion via adaptive residual
- self-supervised controlnet with spatio-temporal mamba for real-world video super | arXiv: 2506.01037
- self-supervised cross-view correspondence with predictive cycle consistency
- self-supervised large scale point cloud completion for archaeological site resto
- self-supervised learning for color spike camera reconstruction
- self-supervised spatial correspondence across modalities | arXiv: 2506.03148
- selfsplat pose-free and 3d prior-free generalizable 3d gaussian splatting | arXiv: 2411.17190
- semalign3d semantic correspondence between rgb-images through aligning 3d object | arXiv: 2503.22462
- semantic and expressive variations in image captions across languages | arXiv: 2310.14356
- semantic and sequential alignment for referring video object segmentation
- semantic class distribution learning for debiasing semi-supervised medical image | arXiv: 2603.05202
- semantic library adaptation lora retrieval and fusion for open-vocabulary semant | arXiv: 2503.21780
- semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
- semantic-guided cross-modal prompt learning for skeleton-based zero-shot action
- semanticdraw towards real-time interactive content creation from image diffusion | arXiv: 2403.09055
- semgeomo dynamic contextual human motion generation with semantic and geometric | arXiv: 2503.01291
- semi-supervised state-space model with dynamic stacking filter for real-world vi
- semidavil semi-supervised domain adaptation with vision-language guidance for se
- semiets integrating spatial and content consistencies for semi-supervised end-to
- semitooth a generalizable semi-supervised framework for multi-source tooth segme | arXiv: 2603.11616
- sensitivity-aware efficient fine-tuning via compact dynamic-rank adaptation
- separation of powers on segregating knowledge from observation in llm-enabled kn
- seq2time sequential knowledge transfer for video llm temporal grounding | arXiv: 2411.16932
- seqafford sequential 3d affordance reasoning via multimodal large language model | arXiv: 2412.01550
- seqmvrl a sequential fusion framework for multi-view representation learning
- serialgen personalized image generation by first standardization then personaliz
- seriesbench a benchmark for narrative-driven drama series understanding | arXiv: 2504.21435
- set spectral enhancement for tiny object detection
- seurat from moving points to depth | arXiv: 2504.14687
- sf2t self-supervised fragment finetuning of video-llms for fine-grained understa
- sf3d stable fast 3d mesh reconstruction with uv-unwrapping and illumination dise
- sfdm robust decomposition of geometry and reflectance for realistic face renderi
- sfm-free 3d gaussian splatting via hierarchical training | arXiv: 2412.01553
- sgc-net stratified granular comparison network for open-vocabulary hoi detection | arXiv: 2503.00414
- sgcr spherical gaussians for efficient 3d curve reconstruction | arXiv: 2505.04668
- sgformer satellite-ground fusion for 3d semantic scene completion | arXiv: 2503.16825
- sgma semantic-guided modality-aware segmentation for remote sensing with incompl | arXiv: 2603.02505
- sgmatch semantic-guided non-rigid shape matching with flow regularization | arXiv: 2603.12937
- sgsst scaling gaussian splatting style transfer
- shading meets motion self-supervised indoor 3d reconstruction via simultaneous s
- shadow generation using diffusion model with geometry prior
- shape abstraction via marching differentiable support functions
- shape and texture what influences reliable optical flow estimation
- shape my moves text-driven shape-aware synthesis of human motions | arXiv: 2504.03639
- shapeshifter 3d variations using multiscale and sparse point-voxel diffusion | arXiv: 2502.02187
- shapewords guiding text-to-image synthesis with 3d shape-aware prompts | arXiv: 2412.02912
- sharp-it a multi-view to multi-view diffusion model for 3d synthesis and manipul | arXiv: 2412.02631
- sharpdepth sharpening metric depth predictions using diffusion distillation | arXiv: 2411.18229
- shift the lens environment-aware unsupervised camouflaged object detection
- shiftwiseconv small convolutional kernel with large kernel effect | arXiv: 2401.12736
- shining yourself high-fidelity ornaments virtual try-on with diffusion model | arXiv: 2503.16065
- shotadapter text-to-multi-shot video generation with diffusion models | arXiv: 2505.07652
- show and segment universal medical image segmentation via in-context learning | arXiv: 2503.19359
- show and tell visually explainable deep neural nets via spatially-aware concept | arXiv: 2502.20134
- show dont tell detecting novel objects by watching human videos | arXiv: 2603.12751
- showhowto generating scene-conditioned step-by-step visual instructions | arXiv: 2412.01987
- showmak3r compositional tv show reconstruction | arXiv: 2504.19584
- showui one vision-language-action model for gui visual agent | arXiv: 2411.17465
- shrec a spectral embedding-based approach for ab-initio reconstruction of helica | arXiv: 2603.12307
- sida social media image deepfake detection localization and explanation with lar
- silence is golden leveraging adversarial examples to nullify audio control in ld
- silent branding attack trigger-free data poisoning attack on text-to-image diffu
- silmm self-improving large multimodal models for compositional text-to-image gen
- sim-to-real causal transfer a metric learning approach to causally-aware interac
- simavatar simulation-ready avatars with layered hair and clothing | arXiv: 2412.09545
- similarity-guided layer-adaptive vision transformer for uav tracking | arXiv: 2503.06625
- simlingo vision-only closed-loop autonomous driving with language-action alignme
- simltd simple supervised and semi-supervised long-tailed object detection | arXiv: 2412.20047
- simmotionedit text-based human motion editing with motion similarity prediction | arXiv: 2503.18211
- simpler diffusion 15 fid on imagenet512 with pixel-space diffusion
- simplification is all you need against out-of-distribution overconfidence
- simulator hc regression-based online simulation of starting problem-solution pai
- simvs simulating world inconsistencies for robust view synthesis | arXiv: 2412.07696
- single domain generalization for few-shot counting via universal representation | arXiv: 2505.16778
- single pixel image classification using an ultrafast digital light projector | arXiv: 2603.12036
- sings animatable single-image human gaussian splats with kinematic priors
- sinr sparsity driven compressed implicit neural representations | arXiv: 2503.19576
- sir-diff sparse image sets restoration with multi-view diffusion model | arXiv: 2503.14463
- six-cd benchmarking concept removals for text-to-image diffusion models | arXiv: 2406.14855
- skdream controllable multi-view and 3d generation with arbitrary skeletons
- ske-layout spatial knowledge enhanced layout generation with llms
- sketch down the flops towards efficient networks for human sketch | arXiv: 2505.23763
- sketchagent language-driven sequential sketch generation | arXiv: 2411.17673
- sketchfusion learning universal sketch features through fusing foundation models | arXiv: 2503.14129
- sketchtopia a dataset and foundational agents for benchmarking asynchronous mult
- sketchvideo sketch-based video generation and editing | arXiv: 2503.23284
- sketchy bounding-box supervision for 3d instance segmentation | arXiv: 2505.16399
- skillmimic learning basketball interaction skills from demonstrations | arXiv: 2408.15270
- skip tuning pre-trained vision-language models are effective and efficient adapt | arXiv: 2412.11509
- skysense-o towards open-world remote sensing interpretation with vision-centric
- slade shielding against dual exploits in large vision-language models
- slam3r real-time dense scene reconstruction from monocular rgb videos | arXiv: 2412.09401
- sldprtnet a large-scale multimodal dataset for cad generation in language-driven | arXiv: 2603.13098
- sleepermark towards robust watermark against fine-tuning text-to-image diffusion | arXiv: 2412.04852
- slidechat a large vision-language assistant for whole-slide pathology image unde
- slvr super-light visual reconstruction via blueprint controllable convolutions a
- small target detection based on mask-enhanced attention fusion of visible and in | arXiv: 2603.06925
- smartclip modular vision-language alignment with identification guarantees | arXiv: 2507.22264
- smarteraser remove anything from images using masked-region guidance | arXiv: 2501.08279
- smile infusing spatial and motion semantics in masked video learning | arXiv: 2504.00527
- smtpd a new benchmark for temporal prediction of social media popularity | arXiv: 2503.04446
- snapgen taming high-resolution text-to-image models for mobile devices with effi
- snapgen-v generating a five-second video within five seconds on a mobile device | arXiv: 2412.10494
- snowmaster comprehensive real-world image desnowing via mllm with multi-model fe
- soap vision-centric 3d semantic scene completion with scene-adaptive decoder and
- socialgesture delving into multi-person gesture understanding | arXiv: 2504.02244
- socialmoif multi-order intention fusion for pedestrian trajectory prediction | arXiv: 2504.15616
- soft self-labeling and potts relaxations for weakly-supervised segmentation | arXiv: 2507.01721
- softshadow leveraging soft masks for penumbra-aware shadow removal | arXiv: 2409.07041
- softvq-vae efficient 1-dimensional continuous tokenizer | arXiv: 2412.10958
- sogs second-order anchor for advanced 3d gaussian splatting | arXiv: 2503.07476
- solami social vision-language-action modeling for immersive interaction with 3d | arXiv: 2412.00174
- solve synergy of language-vision and end-to-end networks for autonomous driving | arXiv: 2505.16805
- solving instance detection from an open-world perspective | arXiv: 2503.00359
- soma singular value decomposed minor components adaptation for domain generaliza
- sonata self-supervised learning of reliable point representations | arXiv: 2503.16429
- sonic shifting focus to global audio perception in portrait animation | arXiv: 2411.16331
- sortscrews a dataset and baseline for real-time screw classification | arXiv: 2603.13027
- sound bridge associating egocentric and exocentric videos via audio cues
- soundvista novel-view ambient sound synthesis via visual-acoustic binding | arXiv: 2504.05576
- sp3d boosting sparsely-supervised 3d object detection via accurate cross-modal s | arXiv: 2503.06467
- spa-vl a comprehensive safety preference alignment dataset for vision language m | arXiv: 2406.12030
- spar3d stable point-aware reconstruction of 3d objects from single images | arXiv: 2501.04689
- sparc score prompting and adaptive fusion for zero-shot multi-label recognition
- sparrow learning spatial precision and temporal referential consistency in pixel | arXiv: 2603.12382
- spars3r semantic prior alignment and regularization for sparse 3d reconstruction | arXiv: 2411.12592
- sparse point cloud patches rendering via splitting 2d gaussians | arXiv: 2505.09413
- sparse voxels rasterization real-time high-fidelity radiance field rendering | arXiv: 2412.04459
- sparse2dgs geometry-prioritized gaussian splatting for surface reconstruction fr
- sparsealign a fully sparse framework for cooperative object detection | arXiv: 2503.12982
- spatial reasoning is not a free lunch a controlled study on llava | arXiv: 2603.12545
- spatial transport optimization by repositioning attention map for training-free | arXiv: 2503.22168
- spatial-temporal graph diffusion policy with kinematic modeling for bimanual rob
- spatial-ttt streaming visual-based spatial intelligence with test-time training | arXiv: 2603.12255
- spatial457 a diagnostic benchmark for 6d spatial reasoning of large mutimodal mo
- spatialclip learning 3d-aware image representations from spatially discriminativ
- spatialdreamer self-supervised stereo video synthesis from monocular input | arXiv: 2411.11934
- spatialllm a compound 3d-informed design towards spatially-intelligent large mul
- spatio-semantic expert routing architecture with mixture-of-experts for referrin | arXiv: 2603.12538
- spatiotemporal decoupling for efficient vision-based occupancy forecasting | arXiv: 2411.14169
- spatiotemporal skip guidance for enhanced video diffusion sampling | arXiv: 2411.18664
- spc-gs gaussian splatting with semantic-prompt consistency for indoor open-world
- spectral defense against resource-targeting attack in 3d gaussian splatting | arXiv: 2603.12796
- spectral informed mamba for robust point cloud processing | arXiv: 2503.04953
- spectral state space model for rotation-invariant visual representation learning | arXiv: 2503.06369
- spectral-geometric neural fields for pose-free lidar view synthesis | arXiv: 2603.12903
- spectre-gs modeling highly specular surfaces with reflected nearby objects by tr
- spectromotion dynamic 3d reconstruction of specular scenes | arXiv: 2410.17249
- speedy-splat fast 3d gaussian splatting with sparse pixels and sparse primitives | arXiv: 2412.00578
- sphereuformer a u-shaped transformer for spherical 360 perception | arXiv: 2412.06968
- spherical manifold guided diffusion model for panoramic image generation
- spiking transformer introducing accurate addition-only spiking self-attention fo
- spiking transformer with spatial-temporal attention | arXiv: 2409.19764
- spiritsight agent advanced gui agent with one look | arXiv: 2503.03196
- spk2srimgnet super-resolve dynamic scene from spike stream via motion aligned co
- splatad real-time lidar and camera rendering with 3d gaussian splatting for auto
- splatflow multi-view rectified flow model for 3d gaussian splatting synthesis | arXiv: 2411.16443
- splatflow self-supervised dynamic gaussian splatting in neural motion flow field
- splatter-360 generalizable 360 gaussian splatting for wide-baseline panoramic im
- splinegs robust motion-adaptive spline for real-time dynamic 3d gaussians from m | arXiv: 2412.09982
- split adaptation for pre-trained vision transformers | arXiv: 2503.00441
- spmtrack spatio-temporal parameter-efficient fine-tuning with mixture of experts
- spotting the unexpected stu a 3d lidar dataset for anomaly segmentation in auton
- sshnet unsupervised cross-modal homography estimation via problem reformulation
- staa-snn spatial-temporal attention aggregator for spiking neural networks | arXiv: 2503.02689
- stabilizing and accelerating autofocus with expert trajectory regularized deep r
- stable flow vital layers for training-free image editing | arXiv: 2411.14430
- stable-score a stable registration-based framework for 3d shape correspondence | arXiv: 2503.21766
- stableanimator high-quality identity-preserving human image animation | arXiv: 2411.17697
- stacking brick by brick aligned feature isolation for incremental face forgery d | arXiv: 2411.11396
- stagedesigner artistic stage generation for scenography via theater scripts | arXiv: 2503.02595
- star with bilinear mapping
- star-edge structure-aware local spherical curve representation for thin-walled e
- stargen a spatiotemporal autoregression framework with video diffusion model for
- starvector generating scalable vector graphics code from images and text | arXiv: 2312.11556
- stcocc sparse spatial-temporal cascade renovation for 3d occupancy and scene flo
- stdd spatio-temporal dual diffusion for video generation
- stdgen semantic-decomposed 3d character generation from single images | arXiv: 2411.05738
- steady progress beats stagnation mutual aid of foundation and conventional model
- stealthy backdoor attack in self-supervised learning vision encoders for large v | arXiv: 2502.18290
- steepest descent density control for compact 3d gaussian splatting | arXiv: 2505.05587
- steering away from harm an adaptive approach to defending vision language model | arXiv: 2411.16721
- step enhancing video-llms compositional reasoning by spatio-temporal graph-guide
- steps sequential probability tensor estimation for text-to-image hard prompt sea
- stereo a two-stage framework for adversarially robust concept erasing from text-
- stereo anywhere robust zero-shot deep stereo matching even where either stereo o
- stereo4d learning how things move in 3d from internet stereo videos | arXiv: 2412.09621
- stickmotion generating 3d human motions by drawing a stickman | arXiv: 2503.04829
- stil semi-supervised tabular-image learning for comprehensive task-relevant info
- sting-bee towards vision-language model for real-world x-ray baggage security in | arXiv: 2504.02823
- stinr deciphering spatial transcriptomics via implicit neural representation
- stochastic human motion prediction with memory of action transition and action c | arXiv: 2507.04062
- stop integrated spatial-temporal dynamic prompting for video understanding | arXiv: 2503.15973
- stop learning it all to mitigate visual hallucination focus on the hallucination | arXiv: 2506.11417
- stop walking in circles bailing out early in projected gradient descent | arXiv: 2503.19347
- storygpt-v large language models as consistent story visualizers | arXiv: 2312.02252
- stpro spatial and temporal progressive learning for weakly supervised spatio-tem
- strap-vit segregated tokens with randomized -- transformations for defense again | arXiv: 2603.12688
- streamingt2v consistent dynamic and extendable long video generation from text | arXiv: 2403.14773
- streetcrafter street view synthesis with controllable video diffusion models | arXiv: 2412.13188
- stretching each dollar diffusion training from scratch on a micro-budget | arXiv: 2407.15811
- structure from collision | arXiv: 2505.21335
- structure-aware correspondence learning for relative pose estimation | arXiv: 2503.18671
- structure-from-motion with a non-parametric camera model
- structured 3d latents for scalable and versatile 3d generation | arXiv: 2412.01506
- style evolving along chain-of-thought for unknown-domain object detection | arXiv: 2503.09968
- style quantization for data-efficient gan training | arXiv: 2503.24282
- style-editor text-driven object-centric style editing | arXiv: 2408.08461
- stylemaster stylize your video with artistic generation and translation | arXiv: 2412.07744
- stylessp sampling startpoint enhancement for training-free diffusion-based metho
- stylestudio text-driven style transfer with selective control of style elements | arXiv: 2412.08503
- subnet-aware dynamic supernet training for neural architecture search | arXiv: 2503.10740
- subspace constraint and contribution estimation for heterogeneous federated lear
- sufficient invariant learning for distribution shift | arXiv: 2210.13533
- sum parts benchmarking part-level semantic segmentation of urban meshes | arXiv: 2503.15300
- superlightnet lightweight parameter aggregation network for multimodal brain tum
- superpc a single diffusion model for point cloud completion upsampling denoising | arXiv: 2503.14558
- supervising sound localization by in-the-wild egomotion
- surg-r1 a hierarchical reasoning foundation model for scalable and interpretable | arXiv: 2603.12430
- surgeon memory-adaptive fully test-time adaptation via dynamic activation sparsi
- svdc consistent direct time-of-flight video depth completion with frequency sele
- svfr a unified framework for generalized video face restoration | arXiv: 2501.01235
- svg-ir spatially-varying gaussian splatting for inverse rendering | arXiv: 2504.06815
- svlta benchmarking vision-language temporal alignment via synthetic video situat | arXiv: 2504.05925
- swiftedit lightning fast text-guided image editing via one-step diffusion | arXiv: 2412.04301
- symbolic representation for any-to-any generative tasks | arXiv: 2504.17261
- symdpo boosting in-context learning of large multimodal models with symbol demon
- symmetry strikes back from single-image symmetry detection to 3d generation | arXiv: 2411.17763
- synchronized video-to-audio generation via mel quantization-continuum decomposit | arXiv: 2503.06984
- syncsde a probabilistic framework for diffusion synchronization | arXiv: 2503.21555
- syncvp joint diffusion for synchronous multi-modal video prediction | arXiv: 2503.18933
- synergen-vl towards synergistic image understanding and generation with vision e
- synergizing motion and appearance multi-scale compensatory codebooks for talking
- syntab-llava enhancing multimodal table understanding with decoupled synthesis
- synthetic data is an elegant gift for continual vision-language models | arXiv: 2503.04229
- synthetic prior for few-shot drivable head avatar inversion | arXiv: 2501.06903
- synthetic visual genome | arXiv: 2506.07643
- synthetic-to-real self-supervised robust depth estimation via learning with moti
- synthlight portrait relighting with diffusion model by learning to re-render syn
- t-cil temperature scaling using adversarial perturbation for calibration in clas
- t-fake synthesizing thermal images for facial landmarking | arXiv: 2408.15127
- t2icount enhancing cross-modal understanding for zero-shot counting | arXiv: 2502.20625
- t2isafety benchmark for assessing fairness toxicity and privacy in image generat
- t2sg traffic topology scene graph for topology reasoning in autonomous driving | arXiv: 2411.18894
- t2v-compbench a comprehensive benchmark for compositional text-to-video generati
- tacodepth towards efficient radar-camera depth estimation with one-stage fusion | arXiv: 2504.11773
- tadformer task-adaptive dynamic transformer for efficient multi-task learning | arXiv: 2501.04293
- taet two-stage adversarial equalization training on long-tailed distributions | arXiv: 2503.01924
- taga self-supervised learning for template-free animatable gaussian articulated
- tailedcore few-shot sampling for unsupervised long-tail noisy anomaly detection | arXiv: 2504.02775
- take the bull by the horns learning to segment hard samples
- taming score-based denoisers in admm a convergent plug-and-play framework | arXiv: 2603.10281
- taming teacher forcing for masked autoregressive video generation | arXiv: 2501.12389
- taming video diffusion prior with scene-grounding guidance for 3d gaussian splat
- tamt temporal-aware model tuning for cross-domain few-shot action recognition | arXiv: 2411.19041
- tango training-free embodied ai agents for open-world tasks | arXiv: 2412.10402
- taoavatar real-time lifelike full-body talking avatars for augmented reality via
- tapt test-time adversarial prompt tuning for robust inference in vision-language | arXiv: 2411.13136
- targeted forgetting of image subgroups in clip models | arXiv: 2506.03117
- tarot towards essentially domain-invariant robustness with theoretical justifica
- tartan imu a light foundation model for inertial positioning in robotics
- task preference optimization improving multimodal large language models with vis
- task singular vectors reducing task interference in model merging | arXiv: 2412.00081
- task-agnostic guided feature expansion for class-incremental learning | arXiv: 2503.00823
- task-aware clustering for prompting vision-language models
- task-aware cross-modal feature refinement transformer with large language models
- task-driven image fusion with learnable fusion loss | arXiv: 2412.03240
- task-specific gradient adaptation for few-shot one-class classification
- taste more taste better diverse data and strong model boost semi-supervised crow | arXiv: 2503.17984
- taste-rob advancing video generation of task-oriented hand-object interaction fo
- taxonomy-aware evaluation of vision-language models | arXiv: 2504.05457
- tcfg tangential damping classifier-free guidance | arXiv: 2503.18137
- teaching large language models to regress accurate image quality scores using sc | arXiv: 2501.11561
- team leya in 10th abaw competition multimodal ambivalencehesitancy recognition a | arXiv: 2603.12848
- team ras in 10th abaw competition multimodal valence and arousal estimation appr | arXiv: 2603.13056
- teller real-time streaming audio-driven portrait animation with autoregressive m | arXiv: 2503.18429
- temporal action detection model compression by progressive block drop | arXiv: 2503.16916
- temporal alignment-free video matching for few-shot action recognition | arXiv: 2504.05956
- temporal score analysis for understanding and correcting diffusion artifacts | arXiv: 2503.16218
- temporal separation with entropy regularization for knowledge distillation in sp
- temporally consistent object-centric learning by contrasting slots | arXiv: 2412.14295
- tensoflow tensorial flow-based sampler for inverse rendering | arXiv: 2503.18328
- test-time attention purification for backdoored large vision language models | arXiv: 2603.12989
- test-time augmentation improves efficiency in conformal prediction | arXiv: 2505.22764
- test-time backdoor detection for object detection models | arXiv: 2503.15293
- test-time domain generalization via universe learning a multi-graph matching app
- test-time fine-tuning of image compression models for multi-task adaptability
- test-time visual in-context tuning | arXiv: 2503.21777
- texgarment consistent garment uv texture generation via efficient 3d structure-g
- texgaussian generating high-quality pbr material via octree-based 3d gaussian sp
- text augmented correlation transformer for few-shot classification segmentation
- text embedding is not all you need attention control for text-to-image semantic
- text-driven fashion image editing with compositional concept learning and counte
- text-guided sparse voxel pruning for efficient 3d visual grounding | arXiv: 2502.10392
- text-phase synergy network with dual priors for unsupervised cross-domain image | arXiv: 2603.12711
- textured gaussians for enhanced 3d scene appearance modeling | arXiv: 2411.18625
- tfcustom customized image generation with time-aware frequency feature guidance
- the art of deception color visual illusions and diffusion models | arXiv: 2412.10122
- the change you want to detect semantic change detection in earth observation wit
- the devil is in low-level features for cross-domain few-shot segmentation | arXiv: 2503.21150
- the devil is in temporal token high quality video reasoning segmentation | arXiv: 2501.08549
- the devil is in the prompts retrieval-augmented prompt optimization for text-to- | arXiv: 2504.11739
- the illusion of unlearning the unstable nature of machine unlearning in text-to-
- the impact label noise and choice of threshold has on cross-entropy and soft-dic
- the language of motion unifying verbal and non-verbal language of 3d human motio
- the panaf-fgbg dataset understanding the impact of backgrounds in wildlife behav
- the photographers eye teaching multimodal large language models to see and criti
- the power of context how multimodality improves image super-resolution | arXiv: 2503.14503
- the scene language representing scenes with programs words and embeddings | arXiv: 2410.16770
- theoretical insights in model inversion robustness and conditional entropy maxim
- theory-inspired deep multi-view multi-label learning with incomplete views and n
- thin-shell-sft fine-grained monocular non-rigid 3d surface tracking with neural | arXiv: 2503.19976
- think and answer me benchmarking and exploring multi-entity reasoning grounding | arXiv: 2603.12788
- think small act big primitive prompt learning for lifelong robot manipulation | arXiv: 2504.00420
- thinking in dynamics how multimodal large language models perceive track and rea | arXiv: 2603.12746
- thinking in space how multimodal large language models see remember and recall s | arXiv: 2412.14171
- thinking in streaming video | arXiv: 2603.12938
- three cars approaching within 100m enhancing distant geometry by tri-axis voxel
- three-view focal length recovery from homographies | arXiv: 2501.07499
- through-the-mask mask-based motion trajectories for image-to-video generation | arXiv: 2501.03059
- tide training locally interpretable domain generalization models enables test-ti
- tightening robustness verification of maxpool-based neural networks via minimizi
- tiled diffusion | arXiv: 2412.15185
- time of the flight of the gaussians optimizing depth indirectly in dynamic radia
- timestep embedding tells its time to cache for video diffusion model | arXiv: 2411.19108
- timetracker event-based continuous point tracking for video frame interpolation
- timotion temporal and interactive framework for efficient human-human motion gen
- tinyfusion diffusion transformers learned shallow | arXiv: 2412.01199
- tinynav end-to-end tinyml for real-time autonomous navigation on microcontroller | arXiv: 2603.11071
- tkg-dm training-free chroma key content generation diffusion model | arXiv: 2411.15580
- token cropr faster vits for quite a few tasks | arXiv: 2412.00965
- tokenflow unified image tokenizer for multimodal understanding and generation | arXiv: 2412.03069
- tokenhsi unified synthesis of physical human-scene interactions through task tok
- tokenize image patches global context fusion for effective haze removal in large | arXiv: 2504.09621
- tokenmotion decoupled motion control via token disentanglement for human-centric | arXiv: 2504.08181
- topnet transformer-efficient occupancy prediction network for octree-structured
- topo-r1 detecting topological anomalies via vision-language models | arXiv: 2603.13054
- topocellgen generating histopathology cell topology with a diffusion model | arXiv: 2412.06011
- topv compatible token pruning with inference time optimization for fast and low-
- tora trajectory-oriented diffusion transformer for video generation | arXiv: 2407.21705
- tornadonet real-time building damage detection with ordinal supervision | arXiv: 2603.11557
- touch2shape touch-conditioned 3d diffusion for shape exploration and reconstruct | arXiv: 2505.13091
- toward generalized image quality assessment relaxing the perfect reference quali
- toward real-world bev perception depth uncertainty estimation via gaussian splat | arXiv: 2504.01957
- toward robust neural reconstruction from sparse point sets | arXiv: 2412.16361
- towards a universal synthetic video detector from face or background manipulatio
- towards all-in-one medical image re-identification | arXiv: 2503.08173
- towards autonomous micromobility through scalable urban simulation | arXiv: 2505.00690
- towards better alignment training diffusion models with reinforcement learning a
- towards consistent multi-task learning unlocking the potential of task-specific
- towards continual universal segmentation
- towards cost-effective learning a synergy of semi-supervised and active learning
- towards effective and sparse adversarial attack on spiking neural networks via b
- towards efficient foundation model for zero-shot amodal segmentation
- towards enhanced image inpainting mitigating unwanted object insertion and prese
- towards explainable and unprecedented accuracy in matching challenging finger cr
- towards explicit geometry-reflectance collaboration for generalized lidar segmen
- towards faithful multimodal concept bottleneck models | arXiv: 2603.13163
- towards fine-grained interpretability counterfactual explanations for misclassif
- towards general visual-linguistic face forgery detection | arXiv: 2307.16545
- towards generalizable scene change detection | arXiv: 2409.06214
- towards generalizable trajectory prediction using dual-level representation lear
- towards high-fidelity 3d talking avatar with personalized dynamic texture | arXiv: 2503.00495
- towards human-understandable multi-dimensional concept discovery | arXiv: 2503.18629
- towards improved text-aligned codebook learning multi-hierarchical codebook-text
- towards in-the-wild 3d plane reconstruction from a single image | arXiv: 2506.02493
- towards long-horizon vision-language navigation platform benchmark and method | arXiv: 2412.09082
- towards lossless implicit neural representation via bit plane decomposition | arXiv: 2502.21001
- towards million-scale adversarial robustness evaluation with stronger individual | arXiv: 2411.15210
- towards more general video-based deepfake detection through facial component gui
- towards natural language-based document image retrieval new dataset and benchmar
- towards open-vocabulary audio-visual event localization | arXiv: 2411.11278
- towards optimizing large-scale multi-graph matching in bioimaging
- towards practical real-time neural video compression | arXiv: 2502.20762
- towards precise embodied dialogue localization via causality guided diffusion
- towards precise scaling laws for video diffusion transformers | arXiv: 2411.17470
- towards raw object detection in diverse conditions | arXiv: 2411.15678
- towards realistic example-based modeling via 3d gaussian stitching | arXiv: 2408.15708
- towards satellite image road graph extraction a global-scale dataset and a novel | arXiv: 2411.16733
- towards scalable human-aligned benchmark for text-guided image editing | arXiv: 2505.00502
- towards smart point-and-shoot photography | arXiv: 2505.03638
- towards source-free machine unlearning | arXiv: 2508.15127
- towards spatio-temporal world scene graph generation from monocular videos | arXiv: 2603.13185
- towards stable and storage-efficient dataset distillation matching convexified t | arXiv: 2406.19827
- towards training-free anomaly detection with vision and language foundation mode
- towards transformer-based aligned generation with self-coherence guidance | arXiv: 2503.17675
- towards unbiased and robust spatio-temporal scene graph generation and anticipat
- towards understanding and quantifying uncertainty for text-to-image generation | arXiv: 2412.03178
- towards understanding how knowledge evolves in large vision-language models | arXiv: 2504.02862
- towards universal ai-generated image detection by variational information bottle
- towards universal computational aberration correction in photographic cameras a | arXiv: 2603.12083
- towards universal dataset distillation via task-driven diffusion
- towards universal soccer video understanding | arXiv: 2412.01820
- towards visual discrimination and reasoning of real-world physical dynamics phys
- towards zero-shot anomaly detection and reasoning with multimodal large language | arXiv: 2502.07601
- tra-moe learning trajectory prediction model from multiple domains for adaptive | arXiv: 2411.14519
- track any anomalous objecta granular video anomaly detection pipeline
- track4gen teaching video diffusion models to track points improves video generat
- tracktention leveraging point tracking to attend videos faster and better | arXiv: 2503.19904
- traf-align trajectory-aware feature alignment for asynchronous multi-agent perce
- training data provenance verification did your model use synthetic data from my | arXiv: 2503.09122
- training-free dense-aligned diffusion guidance for modular conditional image syn
- training-free neural architecture search through variance of knowledge of deep n | arXiv: 2502.04975
- trajectory mamba efficient attention-mamba forecasting model based on selective | arXiv: 2503.10898
- transfer your perspective controllable 3d generation from any viewpoint in a dri
- transformer-based multi-region segmentation and radiomic analysis of hr-pqct ima | arXiv: 2603.09137
- transformers without normalization | arXiv: 2503.10622
- transpixeler advancing text-to-video generation with transparency | arXiv: 2501.03006
- traversing distortion-perception tradeoff using a single score-based generative | arXiv: 2503.20297
- treemeshgpt artistic mesh generation with autoregressive tree sequencing | arXiv: 2503.11629
- tripartite weight-space ensemble for few-shot class-incremental learning | arXiv: 2506.15720
- tritex learning texture from a single mesh via triplane semantic features | arXiv: 2503.16630
- trust your critic robust reward modeling and reinforcement learning for faithful | arXiv: 2603.12247
- tsam temporal sam augmented with multimodal prompts for referring audio-visual s
- tsd-sr one-step diffusion with target score distillation for real-world image su
- tsp-mamba the travelling salesman problem meets mamba for image super-resolution
- tuning the frequencies robust training for sinusoidal neural networks | arXiv: 2407.21121
- turbo3d ultra-fast text-to-3d generation | arXiv: 2412.04470
- turbofill adapting few-step text-to-image model for fast image inpainting | arXiv: 2504.00996
- twinner shining light on digital twins in a few snaps | arXiv: 2503.08382
- two by two learning multi-task pairwise objects assembly for generalizable robot | arXiv: 2504.06961
- two is better than one efficient ensemble defense for robust and compact models | arXiv: 2504.04747
- u-know-diffpan an uncertainty-aware knowledge distillation diffusion framework w
- ua-pose uncertainty-aware 6d object pose estimation and online object completion
- ucm-veid v2 a richer dataset and a pre-training method for uav cross-modality ve
- ucod-dpl unsupervised camouflaged object detection via dynamic pseudo-label lear
- uhd-processer unified uhd image restoration with progressive frequency learning
- uibdiffusion universal imperceptible backdoor attack for diffusion models | arXiv: 2412.11441
- ultrafusion ultra high dynamic imaging using exposure fusion | arXiv: 2501.11515
- ultrasoundagents hierarchical multi-agent evidence-chain reasoning for breast ul | arXiv: 2603.10852
- umfn unified multi-domain face normalization for joint cross-domain prototype le
- umotion uncertainty-driven human motion estimation from inertial and ultra-wideb
- unbiased video scene graph generation via visual and semantic dual debiasing | arXiv: 2503.00548
- unbiasing through textual descriptions mitigating representation bias in video b | arXiv: 2503.18637
- unboxed geometrically and temporally consistent video outpainting
- uncertain multimodal intention and emotion understanding in the wild
- uncertainty meets diversity a comprehensive active learning framework for indoor
- uncertainty weighted gradients for model calibration | arXiv: 2503.22725
- uncertainty-aware concept and motion segmentation for semi-supervised angiograph | arXiv: 2603.00881
- uncertainty-guided perturbation for image super-resolution diffusion model | arXiv: 2503.18512
- uncertainty-instructed structure injection for generalizable hd map construction | arXiv: 2503.23109
- uncommon objects in 3d | arXiv: 2501.07574
- understanding fine-tuning clip for open-vocabulary semantic segmentation in hype
- understanding multi-layered transmission matrices | arXiv: 2410.23864
- understanding multi-task activities from single-task videos
- unem unrolled generalized em for transductive few-shot learning | arXiv: 2412.16739
- uni-renderer unifying rendering and inverse rendering via dual stream diffusion | arXiv: 2412.15050
- uni4d unifying visual foundation models for 4d modeling from a single video | arXiv: 2503.21761
- unialign scaling multimodal alignment within one unified model
- uniap unifying inter- and intra-layer automatic parallelism by mixed integer qua
- unic-adapter unified image-instruction adapter with multi-modal transformer for | arXiv: 2412.18928
- unicl-sam uncertainty-driven in-context segmentation with part prototype discove
- unicom unified multimodal modeling via compressed continuous semantic representa | arXiv: 2603.10702
- unified dense prediction of video diffusion | arXiv: 2503.09344
- unified medical lesion segmentation via self-referring indicator
- unified reconstruction of static and dynamic scenes from events
- unified uncertainty-aware diffusion for multi-agent trajectory modeling | arXiv: 2503.18589
- unigoal towards universal zero-shot goal-oriented navigation | arXiv: 2503.10630
- unigrasptransformer simplified policy distillation for scalable dexterous roboti
- unihope a unified approach for hand-only and hand-object pose estimation | arXiv: 2503.13303
- unik3d universal camera monocular 3d estimation | arXiv: 2503.16591
- unimamba unified spatial-channel representation learning with group-efficient ma
- uninet a contrastive learning-guided unified framework with feature selection fo
- uniphy learning a unified constitutive model for inverse physics simulation | arXiv: 2505.16971
- unipose a unified multimodal framework for human pose comprehension generation a | arXiv: 2411.16781
- unipre3d unified pre-training of 3d point cloud models with cross-modal gaussian | arXiv: 2506.09952
- unireal universal image generation and editing via learning real-world dynamics | arXiv: 2412.07774
- unirestore unified perceptual and task-oriented image restoration model using di
- uniscene unified occupancy-centric driving scene generation | arXiv: 2412.05435
- unistainnet foundation-model-guided virtual staining of he to ihc | arXiv: 2603.12716
- unistd towards unified spatio-temporal learning across diverse disciplines | arXiv: 2503.20748
- unity in diversity video editing via gradient-latent purification
- univad a training-free unified model for few-shot visual anomaly detection | arXiv: 2412.03342
- universal actions for enhanced embodied foundation models | arXiv: 2501.10105
- universal domain adaptation for semantic segmentation | arXiv: 2505.22458
- universal scene graph generation | arXiv: 2503.15005
- unlearning through knowledge overwriting reversible federated unlearning via sel
- unleashing in-context learning of autoregressive models for few-shot image manip
- unleashing the potential of consistency learning for detecting and grounding mul
- unleashing the potential of multi-modal foundation models and video diffusion fo
- unleashing video language models for fine-grained hrct report generation | arXiv: 2603.12469
- unlocking generalization power in lidar point cloud registration | arXiv: 2503.10149
- unlocking the potential of unlabeled data in semi-supervised domain generalizati
- unmasking biases and reliability concerns in convolutional neural networks analy | arXiv: 2603.12445
- unopose unseen object pose estimation with an unposed rgb-d reference image | arXiv: 2411.16106
- unraveling normal anatomy via fluid-driven anomaly randomization | arXiv: 2501.13370
- unseen visual anomaly generation | arXiv: 2406.01078
- unsupervised continual domain shift learning with multi-prototype modeling
- unsupervised discovery of facial landmarks and head pose
- unsupervised foundation model-agnostic slide-level representation learning | arXiv: 2411.13623
- unveil inversion and invariance in flow transformer for versatile image editing | arXiv: 2411.15843
- unveiling differences in generative models a scalable differential clustering ap
- unveiling the ignorance of mllms seeing clearly answering incorrectly | arXiv: 2406.10638
- unveiling the mist over 3d vision-language understanding object-centric evaluati
- unveiling visual perception in language models an attention head analysis approa
- upme an unsupervised peer review framework for multimodal large language model e | arXiv: 2503.14941
- urbancad towards highly controllable and photorealistic 3d vehicles for urban sc
- urwkv unified rwkv model with multi-state perspective for low-light image restor | arXiv: 2505.23068
- using diffusion priors for video amodal segmentation | arXiv: 2412.04623
- using powerful prior knowledge of diffusion model in deep unfolding networks for | arXiv: 2503.08429
- usp-gaussian unifying spike-based image reconstruction pose correction and gauss
- uvgs reimagining unstructured 3d gaussian splatting using uv mapping | arXiv: 2502.01846
- uwav uncertainty-weighted weakly-supervised audio-visual video parsing | arXiv: 2505.09615
- v-bridge bridging video generative priors to versatile few-shot image restoratio | arXiv: 2603.13089
- v-clr view-consistent learning for open-world instance segmentation | arXiv: 2504.01383
- v-stylist video stylization via collaboration and reflection of mllm agents | arXiv: 2503.12077
- v2dial unification of video and visual dialog via multimodal experts
- v2v3d view-to-view denoised 3d reconstruction for light field microscopy
- v2x-r cooperative lidar-4d radar fusion with denoising diffusion for 3d object d | arXiv: 2411.08402
- variance-based membership inference attacks against large-scale image captioning
- variational garrote for sparse inverse problems | arXiv: 2603.12562
- varsplat uncertainty-aware 3d gaussian splatting for robust rgb-d slam | arXiv: 2603.09673
- vasparse towards efficient visual hallucination mitigation via visual-aware toke
- vastsd learning 3d vascular tree-state space diffusion model for angiography syn
- vcbench a streaming counting benchmark for spatial-temporal state maintenance in | arXiv: 2603.12703
- vdocrag retrieval-augmented generation over visually-rich documents | arXiv: 2504.09795
- velociti benchmarking video-language compositional reasoning with strict entailm
- vera explainable video anomaly detection via verbalized learning of vision-langu
- verbdiff text-only diffusion models with enhanced interaction awareness | arXiv: 2503.16406
- vesselfm a foundation model for universal 3d blood vessel segmentation | arXiv: 2411.17386
- veu-bench towards comprehensive understanding of video editing | arXiv: 2504.17828
- vggt visual geometry grounded transformer | arXiv: 2503.11651
- vi3nr variance informed initialization for implicit neural representations | arXiv: 2504.19270
- vicas a dataset for combining holistic and pixel-level video understanding using
- vid2avatar-pro authentic avatar from videos in the wild via universal prior | arXiv: 2503.01610
- vid2sim generalizable video-based reconstruction of appearance geometry and phys
- vid2sim realistic and interactive simulation from video for urban navigation | arXiv: 2501.06693
- vidbot learning generalizable 3d actions from in-the-wild 2d human videos for ze
- vidcomposition can mllms analyze compositions in compiled videos | arXiv: 2411.10979
- video depth anything consistent depth estimation for super-long videos | arXiv: 2501.12375
- video depth without video models | arXiv: 2411.19189
- video language model pretraining with spatio-temporal masking
- video motion transfer with diffusion transformers | arXiv: 2412.07776
- video streaming thinking videollms can watch and think simultaneously | arXiv: 2603.12262
- video summarization with large language models | arXiv: 2504.11199
- video-3d llm learning position-aware video representation for 3d scene understan
- video-bench human-aligned video generation benchmark | arXiv: 2504.04907
- video-colbert contextualized late interaction for text-to-video retrieval | arXiv: 2503.19009
- video-guided foley sound generation with multimodal controls | arXiv: 2411.17698
- video-mme the first-ever comprehensive evaluation benchmark of multi-modal llms
- video-panda parameter-efficient alignment for encoder-free video-language models | arXiv: 2412.18609
- video-xl extra-long vision language model for hour-scale video understanding | arXiv: 2409.14485
- videoautoarena an automated arena for evaluating large multimodal models in vide
- videocomp advancing fine-grained compositional and temporal alignment in video-t
- videodirector precise video editing via text-to-video models | arXiv: 2411.17592
- videodpo omni-preference alignment for video diffusion generation | arXiv: 2412.14167
- videoespresso a large-scale chain-of-thought dataset for fine-grained video reas
- videogem training-free action grounding in videos | arXiv: 2503.20348
- videogigagan towards detail-rich video super-resolution | arXiv: 2404.12388
- videoglamm a large multimodal model for pixel-level visual grounding in videos | arXiv: 2411.04923
- videoguide improving video diffusion models without training through a teachers | arXiv: 2410.04364
- videohandles editing 3d object compositions in videos using video generative pri
- videoicl confidence-based iterative in-context learning for out-of-distribution
- videomage multi-subject and motion customization of text-to-video diffusion mode
- videorefer suite advancing spatial-temporal object understanding with video llm | arXiv: 2501.00599
- videoscene distilling video diffusion model to generate 3d scenes in one step | arXiv: 2504.01956
- videospats video spatiotemporal splines for disentangled occlusion appearance an
- videotree adaptive tree-based video representation for llm reasoning on long vid
- videoworld exploring knowledge learning from unlabeled videos | arXiv: 2501.09781
- vidhalluc evaluating temporal hallucinations in multimodal large language models
- vidmuse a simple video-to-music generation framework with long-short-term modeli
- vidseg training-free video semantic segmentation based on diffusion models
- vidtwin video vae with decoupled structure and dynamics | arXiv: 2412.17726
- viewpoint rosetta stone unlocking unpaired ego-exo videos for view-invariant rep
- viineus volumetric initialization for implicit neural surface reconstruction of
- vikienet towards efficient 3d object detection with virtual key instance enhance
- vila-m3 enhancing vision-language models with medical expert knowledge | arXiv: 2411.12915
- vinabench benchmark for faithful and consistent visual narratives | arXiv: 2503.20871
- vintage joint video and text conditioning for holistic audio generation | arXiv: 2412.10768
- vird view-invariant representation through dual-axis transformation for cross-vi | arXiv: 2603.12918
- vires video instance repainting via sketch and text guided generation | arXiv: 2411.16199
- visco benchmarking fine-grained critique and correction towards self-improvement
- vision-guided action enhancing 3d human motion prediction with gaze-informed aff
- vision-language embodiment for monocular depth estimation | arXiv: 2503.16535
- vision-language gradient descent-driven all-in-one deep unfolding networks | arXiv: 2503.16930
- vision-language model ip protection via prompt-based learning | arXiv: 2503.02393
- vision-language models do not understand negation | arXiv: 2501.09425
- visionarena 230k real world user-vlm conversations with preference labels | arXiv: 2412.08687
- visionpad a vision-centric pre-training paradigm for autonomous driving | arXiv: 2411.14716
- visionzip longer is better but not necessary in vision language models | arXiv: 2412.04467
- vista enhancing long-duration and high-resolution video understanding by video s | arXiv: 2412.00927
- vista3d a unified segmentation foundation model for 3d medical imaging | arXiv: 2406.05285
- vistream improving computation efficiency of visual streaming perception via law
- visual agentic ai for spatial reasoning with a dynamic api | arXiv: 2502.06787
- visual and semantic prompt collaboration for generalized zero-shot learning | arXiv: 2503.23030
- visual consensus prompting for co-salient object detection | arXiv: 2504.14254
- visual lexicon rich image features in language space | arXiv: 2412.06774
- visual persona foundation model for full-body human customization | arXiv: 2503.15406
- visual prompting for one-shot controllable video editing without inversion | arXiv: 2504.14335
- visual representation learning through causal intervention for controllable imag
- visual-erm reward modeling for visual equivalence | arXiv: 2603.13224
- visual-instructed degradation diffusion for all-in-one image restoration | arXiv: 2506.16960
- vited video temporal evidence distillation | arXiv: 2503.12855
- viunit visual unit tests for more robust visual programming | arXiv: 2412.08859
- vl-rewardbench a challenging benchmark for vision-language generative reward mod
- vl2lite task-specific knowledge distillation from large vision-language models t
- vladva discriminative fine-tuning of lvlms | arXiv: 2412.04378
- vlms-guided representation distillation for efficient vision-based reinforcement
- vlog video-language models by generative retrieval of narration vocabulary | arXiv: 2503.09402
- vlogger multimodal diffusion for embodied avatar synthesis | arXiv: 2403.08764
- vlsi verbalized layers-to-interactions from large to small vision language model | arXiv: 2412.01822
- voco-llama towards vision compression with large language models | arXiv: 2406.12275
- vodiff controlling object visibility order in text-to-image generation
- volformer explore more comprehensive cube interaction for hyperspectral image re
- volume tells dual cycle-consistent diffusion for 3d fluorescence microscopy de-n
- volumetric surfaces representing fuzzy geometries with layered meshes | arXiv: 2409.02482
- volumetrically consistent 3d gaussian rasterization | arXiv: 2412.03378
- voteflow enforcing local rigidity in self-supervised scene flow | arXiv: 2503.22328
- voxelsplat dynamic gaussian splatting as an effective loss for occupancy and flo
- vsnet focusing on the linguistic characteristics of sign language
- vton 360 high-fidelity virtual try-on from any viewing direction | arXiv: 2503.12165
- vton-handfit virtual try-on for arbitrary hand pose guided by hand priors embedd
- watermarking one for all a robust watermarking scheme against partial image thef
- wav2sem plug-and-play audio semantic decoupling for 3d speech-driven facial anim | arXiv: 2505.23290
- wave weight templates for adaptive initialization of variable-sized models | arXiv: 2406.17503
- wavelet and prototype augmented query-based transformer for pixel-level surface
- weakly supervised contrastive adversarial training for learning robust features
- weakly supervised semantic segmentation via progressive confidence region expans
- weakly supervised teacher-student framework with progressive pseudo-mask refinem | arXiv: 2603.08605
- weakly supervised temporal action localization via dual-prior collaborative lear
- weakmcn multi-task collaborative network for weakly supervised referring express
- wear classification of abrasive flap wheels using a hierarchical deep learning a | arXiv: 2603.12852
- weathergen a unified diverse weather generator for lidar point clouds via spider | arXiv: 2504.13561
- wegen a unified model for interactive multimodal generation as we chat | arXiv: 2503.01115
- wf-vae enhancing video vae by wavelet-driven energy flow for latent video diffus
- what makes a good dataset for knowledge distillation | arXiv: 2411.12817
- whats in the image a deep-dive into the vision of vision language models | arXiv: 2411.17491
- when domain generalization meets generalized category discovery an adaptive task
- when the future becomes the past taming temporal correspondence for self-supervi
- when to lock attention training-free kv control in video diffusion | arXiv: 2603.09657
- where the devil hides deepfake detectors can no longer be trusted | arXiv: 2505.08255
- wheres the liability in the generative era recovery-based black-box detection of | arXiv: 2505.01008
- which viewpoint shows it best language for weakly supervising view selection in | arXiv: 2411.08753
- why does it look there structured explanations for image classification | arXiv: 2603.10234
- wildavatar learning in-the-wild 3d avatars from the web | arXiv: 2407.02165
- wildgs-slam monocular gaussian splatting slam in dynamic environments | arXiv: 2504.03886
- wilor end-to-end 3d hand localization and reconstruction in-the-wild | arXiv: 2409.12259
- wise a framework for gigapixel whole-slide-image lossless compression | arXiv: 2503.18074
- wish weakly supervised instance segmentation using heterogeneous labels
- wisnet pseudo label generation on unbalanced and patch annotated waste images
- wonderland navigating 3d scenes from a single image | arXiv: 2412.12091
- wonderworld interactive 3d scene generation from a single image | arXiv: 2406.09394
- words or vision do vision-language models have blind faith in text | arXiv: 2503.02199
- world-consistent video diffusion with explicit 3d modeling | arXiv: 2412.01821
- world2act latent action post-training via skill-compositional world models | arXiv: 2603.10422
- x-dyna expressive dynamic human image animation | arXiv: 2501.10021
- xlrs-bench could your multimodal llms understand extremely large ultra-high-reso
- yochameleon personalized vision and language generation | arXiv: 2504.20998
- you see it you got it learning 3d creation on pose-free videos at scale | arXiv: 2412.06699
- your large vision-language model only needs a few attention heads for visual gro | arXiv: 2503.06287
- your scale factors are my weapon targeted bit-flip attacks on vision transformer
- your vit is secretly an image segmentation model | arXiv: 2503.19108
- z-magic zero-shot multiple attributes guided image creator | arXiv: 2503.12124
- zero-1-to-a zero-shot one image to animatable head avatars using video diffusion | arXiv: 2503.15851
- zero-shot 3d question answering via voxel-based dynamic token compression
- zero-shot 4d lidar panoptic segmentation | arXiv: 2504.00848
- zero-shot blind-spot image denoising via implicit neural sampling
- zero-shot head swapping in real-world scenarios | arXiv: 2503.00861
- zero-shot image restoration using few-step guidance of consistency models and be | arXiv: 2412.20596
- zero-shot monocular scene flow estimation in the wild | arXiv: 2501.10357
- zero-shot novel view and depth synthesis with multi-view geometric diffusion | arXiv: 2501.18804
- zero-shot rgb-d point cloud registration with pre-trained large vision model
- zero-shot styled text image generation but make it autoregressive | arXiv: 2503.17074
- zerograsp zero-shot shape reconstruction enabled robotic grasping | arXiv: 2504.10857
- zerovo visual odometry with minimal assumptions | arXiv: 2506.08005
- zo-sam zero-order sharpness-aware minimization for efficient sparse training | arXiv: 2603.13115
- zoomldm latent diffusion model for multi-scale image generation | arXiv: 2411.16969
- dual exposure stereo extended dr 3d | arXiv: 2412.02351
- dualpm dual point maps shape pose | arXiv: 2412.04464
- dune universal encoder distillation | arXiv: 2503.14405
- dyn hamr recovering 4d interacting hand motion from a dynamic camera
- faster focal token acquiring-and-scaling transformer for long-term 3d objection | arXiv: 2503.01899
- flare sparse view reconstruction | arXiv: 2502.12138
- magic-slam multi-agent gaussian globally consistent slam | arXiv: 2411.16785
- murre sfm guided depth reconstruction | arXiv: 2503.14483
- mv 3dcd multiview change detection | arXiv: 2412.03911
- climbingcap multi-modal dataset and method for rock climbing in world | arXiv: 2503.21268
- gdfusion temporal fusion occupancy | arXiv: 2504.12959
- codepercept code-grounded visual stem perception for mllms | arXiv: 2603.10757
- motionrefit motion editing | arXiv: 2503.20724
- dual diffusion unified generation understanding | arXiv: 2501.00289
- dualanodiff few shot anomaly image generation | arXiv: 2408.13509
- easycraft avatar crafting | arXiv: 2503.01158
- fade fine grained erasure diffusion | arXiv: 2503.19783
- filmcomposer llm music production | arXiv: 2503.08147
- finelip clip long text fine grained | arXiv: 2504.01916
- flipsketch sketch animation | arXiv: 2411.10818
- mca ctrl attention control customization | arXiv: 2505.01428
- dpir dual prompting restoration dit | arXiv: 2504.17825
- advancing myopia to holism fully contrastive language-image pre-training | arXiv: 2412.00440
- chathuman chatting about 3d humans with tools | arXiv: 2405.04533
- cobra combinatorial retrieval augmentation for few-shot adaptation | arXiv: 2412.17684
- docopilot improving multimodal models for document-level understanding | arXiv: 2507.14675
- ezsr event-based zero-shot recognition | arXiv: 2407.21616
- few-shot recognition via stage-wise retrieval-augmented finetuning | arXiv: 2406.11148
- genius a generative framework for universal multimodal search | arXiv: 2503.19868
- goal global-local object alignment learning | arXiv: 2503.17782
- joint vision-language social bias removal for clip | arXiv: 2411.12785
- lamra large multimodal model as your advanced retrieval assistant | arXiv: 2412.01720
- lotusfilter fast diverse nearest neighbor search via a learned cutoff table | arXiv: 2506.04790
- neighborretr balancing hub centrality in cross-modal retrieval | arXiv: 2503.10526
- preserving clusters in prompt learning for unsupervised domain adaptation | arXiv: 2506.11493
- range retrieval augmented neural fields for multi-resolution geo-embeddings | arXiv: 2502.19781
- towards smart point-and-shoot photography | arXiv: 2505.03638
- vdocrag retrieval-augmented generation over visually-rich documents | arXiv: 2504.09795
- vladva discriminative fine-tuning of lvlms | arXiv: 2412.04378
- albm attribute concept space | arXiv: 2503.20301
- attribute-formed class-specific concept space endowing language bottleneck model
- differentiable inverse rendering with interpretable basis brdfs | arXiv: 2411.17994
- geometry-guided camera motion understanding in videollms | arXiv: 2603.13119
- interpretable image classification via non-parametric part prototype learning | arXiv: 2503.10247
- kvq boosting video quality assessment via saliency-guided local perception | arXiv: 2503.10259
- l-swag layer-sample wise activation with gradients information for zero-shot nas
- language guided concept bottleneck models for interpretable continual learning
- learning on model weights using tree experts | arXiv: 2410.13569
- learning visual composition through improved semantic guidance | arXiv: 2412.15396
- lswag zero shot nas | arXiv: 2505.07300
- on the possible detectability of image-in-image steganography | arXiv: 2603.11876
- open ad-hoc categorization with contextualized feature learning | arXiv: 2512.16202
- probing the mid-level vision capabilities of self-supervised learning | arXiv: 2411.17474
- prompt-cam making vision transformers interpretable for fine-grained analysis | arXiv: 2501.09333
- sample- and parameter-efficient auto-regressive image models | arXiv: 2411.15648
- scaling vision pre-training to 4k resolution | arXiv: 2503.19903
- tide domain generalization | arXiv: 2411.16788
- tide training locally interpretable domain generalization models enables test-ti
- tide training locally interpretable domain generalization models enables test time correction | arXiv: 2411.16788
- towards faithful multimodal concept bottleneck models | arXiv: 2603.13163
- towards human-understandable multi-dimensional concept discovery | arXiv: 2503.18629
- why does it look there structured explanations for image classification | arXiv: 2603.10234
- cad llama parametric | arXiv: 2505.04481
- capo multi preference | arXiv: 2502.02588
- inpo inversion preference optimization diffusion alignment | arXiv: 2503.18454
- sam dpo semi supervised | arXiv: 2503.04639
- sam dpo semi supervised medical segmentation | arXiv: 2503.04639
- spo aesthetic post training | arXiv: 2406.04314
- symdpo symbol icl | arXiv: 2411.11909
- care transformer linear attention | arXiv: 2411.16170
- moee mixture expert extraction | arXiv: 2505.15414
- comfybench benchmarking llm-based agents in comfyui for autonomously designing c | arXiv: 2409.01392
- context-cir learning from concepts in text for composed image retrieval | arXiv: 2505.20764
- dense match summarization for faster two-view estimation
- do imagenet-trained models learn shortcuts the impact of frequency shortcuts on | arXiv: 2503.03519
- dora sampling and benchmarking for 3d shape variational auto-encoders | arXiv: 2412.17808
- dual consolidation for pre-trained model-based domain-incremental learning | arXiv: 2410.00911
- erase diffusion empowering object removal through calibrating diffusion pathways | arXiv: 2503.07026
- event ellipsometer event-based mueller-matrix video imaging | arXiv: 2411.17313
- exposure-slot exposure-centric representations learning with slot-in-slot attent
- gradient-guided annealing for domain generalization | arXiv: 2502.20162
- lotus large-scale machine unlearning with a taste of uncertainty | arXiv: 2503.18314
- making old film great again degradation-aware state space model for old film res
- on the generalization of handwritten text recognition models | arXiv: 2411.17332
- oodd test-time out-of-distribution detection with dynamic dictionary | arXiv: 2503.10468
- out of sight out of mind evaluating state evolution in video world models | arXiv: 2603.13215
- polarfree polarization-based reflection-free imaging | arXiv: 2503.18055
- postero structuring layout trees to enable language models in generalized conten | arXiv: 2505.07843
- potential field based deep metric learning | arXiv: 2405.18560
- practical solutions to the relative pose of three calibrated cameras | arXiv: 2303.16078
- roadsocial a diverse videoqa dataset and benchmark for road event understanding
- sata spatial autocorrelation token analysis for enhancing the robustness of visi
- scene-agnostic pose regression for visual localization | arXiv: 2503.19543
- sufficient invariant learning for distribution shift | arXiv: 2210.13533
- traf-align trajectory-aware feature alignment for asynchronous multi-agent perce | arXiv: 2503.19391
- uncertainty weighted gradients for model calibration | arXiv: 2503.22725
- vinabench benchmark for faithful and consistent visual narratives | arXiv: 2503.20871
- comrope rotary position | arXiv: 2506.03737
- sec-promptsemantic complementary prompting for few-shot class-incremental learni
- 3d prior is all you need cross-task few-shot 2d gaze estimation | arXiv: 2502.04074
- a unified framework for heterogeneous semi-supervised learning | arXiv: 2503.00286
- bridging the vision-brain gap with an uncertainty-aware blur prior | arXiv: 2503.04207
- dreamtext high fidelity scene text synthesis | arXiv: 2405.14701
- hsemotion team at abaw-10 competition facial expression recognition valence-arou | arXiv: 2603.12693
- improving autoregressive visual generation with cluster-oriented token predictio | arXiv: 2501.00880
- lost in translation found in context sign language translation with contextual c | arXiv: 2501.09754
- mxnorm reusing mxfp block scales for efficient tensor normalisation | arXiv: 2603.13180
- precise event spotting in sports videos solving long-range dependency and class | arXiv: 2503.00147
- robust message embedding via attention flow-based steganography
- scamo exploring the scaling law in autoregressive motion generation model | arXiv: 2412.14559
- softshadow leveraging soft masks for penumbra-aware shadow removal | arXiv: 2409.07041
- the change you want to detect semantic change detection in earth observation wit
- the scene language representing scenes with programs words and embeddings | arXiv: 2410.16770
- vires video instance repainting via sketch and text guided generation
- osrcir reflective cot | arXiv: 2412.11077
- videoespresso cot reasoning | arXiv: 2411.14794
- empowering llms to understand and generate complex vector graphics
- order-robust class incremental learning graph-driven dynamic similarity grouping | arXiv: 2502.20032
- mr plip multi resolution pathology | arXiv: 2504.18856
- autossvh exploring automated frame sampling for efficient self-supervised video h | arXiv: 2504.03587
- l swag zero shot nas vision transformers | arXiv: 2505.07300
- harnessing frozen unimodal encoders for flexible multimodal alignment | arXiv: 2409.19425
- semantic and expressive variations in image captions across languages | arXiv: 2310.14356
- smtpd a new benchmark for temporal prediction of social media popularity | arXiv: 2503.04446
- document haystacks vision-language reasoning over piles of 1000 documents | arXiv: 2411.16740
- homesafe-bench evaluating vision-language models on unsafe action detection for | arXiv: 2603.11975
- multi-modal contrastive masked autoencoders a two-stage progressive pre-training | arXiv: 2408.02245
- on the out-of-distribution generalization of large multimodal models | arXiv: 2402.06599
- videoglamm a large multimodal model for pixel-level visual grounding in videos | arXiv: 2411.04923
- generative modeling of class probability for multi modal representation learning | arXiv: 2503.17417
- mulsen ad multi sensor anomaly detection | arXiv: 2412.14592
- sdf-net structure-aware disentangled feature learning for opticall-sar ship re-i | arXiv: 2603.12588
- strap-vit segregated tokens with randomized -- transformations for defense again | arXiv: 2603.12688
- calf communication aware distributed rl | arXiv: 2603.12543
- asap advancing semantic alignment promotes multi-modal manipulation de | arXiv: 2412.12718
- coordinated manipulation hybrid deformable rigid objects | arXiv: 2603.12940
- foundations of the theory of performance based ranking | arXiv: 2412.04227
- lift3d policy lifting 2d foundation models for robust 3d robotic manipulation | arXiv: 2411.18623
- assessing and learning alignment of unimodal vision and language model | arXiv: 2412.04616
- autossvh exploring automated frame sampling for efficient self-supervised video
- chexworld exploring image world modeling for radiograph representation learning
- as language models scale low-order linear depth dynamics emerge | arXiv: 2603.12541
- as language models scale low-order linear depth dynamics emerge v2 | arXiv: 2603.12541
- classifier-guided clip distillation for unsupervised multi-label classification | arXiv: 2503.16873
- classifier-to-bias toward unsupervised automatic bias detection for visual class | arXiv: 2504.20902
- learning from neighbors category extrapolation for long-tail learning | arXiv: 2410.15980
- let samples speak mitigating spurious correlation by exploiting the clusterness
- 4real-video learning generalizable photo-realistic 4d video diffusion | arXiv: 2412.04462
- animateanything consistent and controllable animation for video generation | arXiv: 2411.10836
- articulated kinematics distillation from video diffusion models | arXiv: 2504.01204
- bf-stvsr b-splines and fourier---best friends for high fidelity spatia | arXiv: 2501.11043
- bf-stvsr b-splines and fourier---best friends for high fidelity spatial-temporal | arXiv: 2501.11043
- can text-to-video generation help video-language alignment | arXiv: 2503.18507
- conmo controllable motion disentanglement and recomposition for zero-shot motion | arXiv: 2504.02451
- dynamic camera poses and where to find them | arXiv: 2504.17788
- dynamicscaler panoramic video | arXiv: 2412.11100
- dynamicscaler seamless and scalable video generation for panoramic scenes
- exploring temporally-aware features for point tracking | arXiv: 2501.12218
- fade frequency-aware diffusion model factorization for video editing | arXiv: 2506.05934
- flashmotion few-step controllable video generation with trajectory guidance | arXiv: 2603.12146
- from slow bidirectional to fast autoregressive video diffusion models | arXiv: 2412.07772
- gen3c 3d-informed world-consistent video generation with precise camera control | arXiv: 2503.03751
- generative inbetweening through frame-wise conditions-driven video generation | arXiv: 2412.11755
- geometry-guided online 3d video synthesis with multi-view temporal consistency | arXiv: 2505.18932
- hoigen-1m a large-scale dataset for human-object interaction video generation | arXiv: 2503.23715
- hunyuanportrait implicit condition control for enhanced portrait animation | arXiv: 2503.18860
- hypernvd accelerating neural video decomposition via hypernetworks | arXiv: 2503.17276
- identity-preserving text-to-video generation by frequency decomposition | arXiv: 2411.17440
- idol instant photorealistic 3d human creation from a single image | arXiv: 2412.14963
- improved video vae for latent video diffusion model | arXiv: 2411.06449
- interdyn controllable interactive dynamics with video diffusion models | arXiv: 2412.11785
- learning from streaming video with orthogonal gradients | arXiv: 2504.01961
- learning temporally consistent video depth from video diffusion priors | arXiv: 2406.01493
- levitor 3d trajectory oriented image-to-video synthesis | arXiv: 2412.15214
- long video diffusion generation with segmented cross-attention and content-rich | arXiv: 2412.01316
- longdiff training-free long video generation in one go | arXiv: 2503.18150
- mimir improving video diffusion models for precise text understanding | arXiv: 2412.03085
- mimo controllable character video synthesis with spatial decomposed modeling | arXiv: 2409.16160
- mind the time temporally-controlled multi-event video generation | arXiv: 2412.05263
- motif making text count in image animation with motion focal loss | arXiv: 2412.16153
- motion modes what could happen next | arXiv: 2412.00148
- motion prompting controlling video generation with motion trajectories | arXiv: 2412.02700
- motionpro a precise motion controller for image-to-video generation | arXiv: 2505.20287
- motionstone decoupled motion intensity modulation with diffusion transformer for | arXiv: 2412.05848
- moviebench a hierarchical movie level dataset for long video generation | arXiv: 2411.15262
- multi-subject open-set personalization in video generation | arXiv: 2501.06187
- navigation world models | arXiv: 2412.03572
- neuro-symbolic evaluation of text-to-video models using formal verification | arXiv: 2411.16718
- one-minute video generation with test-time training | arXiv: 2504.05298
- optical-flow guided prompt optimization for coherent video generation | arXiv: 2411.15540
- osv one step is enough for high-quality image to video generation | arXiv: 2409.11367
- parallelized autoregressive visual generation | arXiv: 2412.15119
- patchvsr breaking video diffusion resolution limits with patch-wise video super- | arXiv: 2509.26025
- pathways on the image manifold image editing via video generation | arXiv: 2411.16819
- phyt2v llm-guided iterative self-refinement for physics-grounded text-to-video g | arXiv: 2412.00596
- posetraj pose-aware trajectory control in video diffusion | arXiv: 2503.16068
- protecting your video content disrupting automated video-based llm annotations | arXiv: 2503.21824
- saw toward a surgical action world model via controllable and scalable video gen | arXiv: 2603.13024
- semantic satellite communications for synchronized audiovisual reconstruction | arXiv: 2603.10791
- shotadapter text-to-multi-shot video generation with diffusion models | arXiv: 2505.07652
- sketchvideo sketch-based video generation and editing | arXiv: 2503.23284
- spatialdreamer self-supervised stereo video synthesis from monocular input | arXiv: 2411.11934
- spatiotemporal skip guidance for enhanced video diffusion sampling | arXiv: 2411.18664
- streamingt2v consistent dynamic and extendable long video generation from text | arXiv: 2403.14773
- streetcrafter street view synthesis with controllable video diffusion models | arXiv: 2412.13188
- taming teacher forcing for masked autoregressive video generation | arXiv: 2501.12389
- teller real-time streaming audio-driven portrait animation with autoregressive m | arXiv: 2503.18429
- the devil is in the prompts retrieval-augmented prompt optimization for text-to- | arXiv: 2504.11739
- through-the-mask mask-based motion trajectories for image-to-video generation | arXiv: 2501.03059
- timestep embedding tells its time to cache for video diffusion model | arXiv: 2411.19108
- tokenmotion decoupled motion control via token disentanglement for human-centric | arXiv: 2504.08181
- tora trajectory-oriented diffusion transformer for video generation | arXiv: 2407.21705
- towards precise scaling laws for video diffusion transformers | arXiv: 2411.17470
- tracktention leveraging point tracking to attend videos faster and better | arXiv: 2503.19904
- transpixeler advancing text-to-video generation with transparency | arXiv: 2501.03006
- unified dense prediction of video diffusion | arXiv: 2503.09344
- veu-bench towards comprehensive understanding of video editing | arXiv: 2504.17828
- video-bench human-aligned video generation benchmark | arXiv: 2504.04907
- video-colbert contextualized late interaction for text-to-video retrieval | arXiv: 2503.19009
- videodirector precise video editing via text-to-video models | arXiv: 2411.17592
- videodpo omni-preference alignment for video diffusion generation | arXiv: 2412.14167
- videogigagan towards detail-rich video super-resolution | arXiv: 2404.12388
- videoguide improving video diffusion models without training through a teachers | arXiv: 2410.04364
- videoscene distilling video diffusion model to generate 3d scenes in one step | arXiv: 2504.01956
- vidtwin video vae with decoupled structure and dynamics | arXiv: 2412.17726
- vires video instance repainting via sketch and text guided generation | arXiv: 2411.16199
- visual prompting for one-shot controllable video editing without inversion | arXiv: 2504.14335
- when to lock attention training-free kv control in video diffusion | arXiv: 2603.09657
- world-consistent video diffusion with explicit 3d modeling | arXiv: 2412.01821
- world2act latent action post-training via skill-compositional world models | arXiv: 2603.10422
- zero-1-to-a zero-shot one image to animatable head avatars using video diffusion | arXiv: 2503.15851