ECCV2024 论文笔记 TODO¶

总计: 1041 篇 | 已完成: 1041 | 待更新: 0

2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction | arXiv: 2409.09969
3D Congealing: 3D-Aware Image Alignment in the Wild | arXiv: 2404.02125
3D Hand Pose Estimation in Everyday Egocentric Images | arXiv: 2312.06583
3D Reconstruction of Objects in Hands without Real World 3D Supervision | arXiv: 2305.03036
3D Single-Object Tracking in Point Clouds with High Temporal Variation | arXiv: 2408.02049
3DEgo: 3D Editing on the Go! | arXiv: 2407.10102
3dfg-pifu 3d feature grids for human digitization from sparse views
3DGazeNet: Generalizing 3D Gaze Estimation with Weak-Supervision from Synthetic Views | arXiv: 2212.02997
3dsa multi-view 3d human pose estimation with 3d space attention mechanisms
3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting | arXiv: 2408.03753
3×2: 3D Object Part Segmentation by 2D Semantic Correspondences | arXiv: 2407.09648
4D Contrastive Superflows are Dense 3D Representation Learners | arXiv: 2407.06190
4diff 3d-aware diffusion model for third-to-first viewpoint translation
6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model | arXiv: 2407.15484
a cephalometric landmark regression method based on dual-encoder for high-resolu
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks | arXiv: 2407.13863
A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis | arXiv: 2311.12897
A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control | arXiv: 2407.15631
a direct approach to viewing graph solvability
A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation | arXiv: 2406.07320
A High-Quality Robust Diffusion Framework for Corrupted Dataset | arXiv: 2311.17101
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis | arXiv: 2503.06973
A New Dataset and Framework for Real-World Blurred Images Super-Resolution | arXiv: 2407.14880
A Probability-guided Sampler for Neural Implicit Surface Rendering | arXiv: 2506.08619
a rotation-invariant texture vit for fine-grained recognition of esophageal canc
A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties | arXiv: 2312.13764
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars | arXiv: 2401.04730
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting | arXiv: 2401.10227
A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging | arXiv: 2407.21517
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-Agnostic Counting | arXiv: 2309.04820
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation | arXiv: 2407.10738
Accelerating Image Super-Resolution Networks with Pixel-Level Classification | arXiv: 2407.21448
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention | arXiv: 2407.06683
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos | arXiv: 2406.09272
ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos | arXiv: 2407.12987
ActionVOS: Actions as Prompts for Video Object Segmentation | arXiv: 2407.07402
Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images | arXiv: 2303.11530
Active Generation for Image Classification | arXiv: 2403.06517
AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection | arXiv: 2407.15795
AdaDiffSR: Adaptive Region-Aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution | arXiv: 2410.17752
AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition | arXiv: 2407.01332
AdaGen: Learning Adaptive Policy for Image Synthesis | arXiv: 2603.06993
AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale | arXiv: 2404.03482
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer | arXiv: 2407.12951
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation | arXiv: 2409.00342
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts | arXiv: 2407.14872
Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth | arXiv: 2406.00474
Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction | arXiv: 2403.07263
Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling | arXiv: 2407.08256
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration | arXiv: 2312.00837
Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification | arXiv: 2410.06977
Adaptive Human Trajectory Prediction via Latent Corridors | arXiv: 2312.06653
Adaptive Multi-head Contrastive Learning | arXiv: 2310.05615
adaptive multi-task learning for few-shot object detection
Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing | arXiv: 2409.11738
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting | arXiv: 2403.09513
addme zero-shot group-photo synthesis by inserting people into scenes
addressclip empowering vision-language models for city-wide image address locali | arXiv: 2407.08156
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization | arXiv: 2407.08156
ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation | arXiv: 2408.09042
admap anti-disturbance framework for vectorized hd map construction
adversarially robust distillation by reducing the student-teacher variance gap
aednet adaptive embedding and multiview-aware disentanglement for point cloud co
aff-ttention affordances and attention models for short-term object interaction | arXiv: 2406.01194
afreeca annotation-free counting for all | arXiv: 2403.04943
agent3d-zero an agent for zero-shot 3d understanding | arXiv: 2403.11835
aid-appeal automatic image dataset and algorithm for content appeal enhancement | arXiv: 2407.05546
align before collaborate mitigating feature misalignment for robust multi-agent
alignist cad-informed orientation distribution estimation by fusing shape and co | arXiv: 2409.06683
alternate diverse teaching for semi-supervised medical image segmentation | arXiv: 2311.17325
amego active memory from long egocentric videos | arXiv: 2409.10917
an economic framework for 6-dof grasp detection
an incremental unified framework for small defect inspection
analysis-by-synthesis transformer for single-view 3d reconstruction
analytic-splatting anti-aliased 3d gaussian splatting via analytic integration | arXiv: 2403.11056
animatabledreamer text-guided non-rigid 3d model generation and reconstruction w | arXiv: 2312.03795
any target can be offense adversarial example generation via generalized latent | arXiv: 2407.12292
anycontrol create your artwork with versatile control on text-to-image generatio | arXiv: 2406.18958
anytime continual learning for open vocabulary classification | arXiv: 2409.08518
apl anchor-based prompt learning for one-stage weakly supervised referring expre
approaching outside scaling unsupervised 3d object detection from 2d scene | arXiv: 2407.08569
architecture-agnostic untrained network priors for image reconstruction with fre | arXiv: 2312.09988
artvlm attribute recognition through vision-based prefix language modeling | arXiv: 2408.04102
asymmetric mask scheme for self-supervised real image denoising | arXiv: 2407.06514
attention decomposition for cross-domain semantic segmentation
attention prompting on image for large vision-language models | arXiv: 2409.17143
attnzero efficient attention discovery for vision transformers
audio-driven talking face generation with stabilized synchronization loss | arXiv: 2307.09368
augdetr improving multi-scale learning for detection transformer
auto-das automated proxy discovery for training-free distillation-aware architec
auto-gas automated proxy discovery for training-free generative architecture sea
avatar fingerprinting for authorized use of synthetic talking-head videos | arXiv: 2305.03713
bad students make great teachers active learning accelerates large-scale visual
bad-gaussians bundle adjusted deblur gaussian splatting | arXiv: 2403.11831
bam-detr boundary-aligned moment detection transformer for temporal sentence gro | arXiv: 2312.00083
bamm bidirectional autoregressive motion model | arXiv: 2403.19435
basic bayesnet structure learning for computational scalable neural image compre
bayesian evidential deep learning for online action detection
be yourself bounded attention for multi-subject text-to-image generation | arXiv: 2403.16990
beaf observing before-after changes to evaluate hallucination in vision-language | arXiv: 2407.13442
beat-it beat-synchronized multi-condition 3d dance generation | arXiv: 2407.07554
benchmarks and challenges in pose estimation for egocentric hand interactions wi | arXiv: 2403.16428
benerf neural radiance fields from a single blurry image and event stream | arXiv: 2407.02174
beta-tuned timestep diffusion model
bi-directional contextual attention for 3d dense captioning | arXiv: 2408.06662
bi-mdrg bridging image history in multimodal dialogue response generation | arXiv: 2408.05926
bi-tta bidirectional test-time adapter for remote physiological measurement | arXiv: 2409.17316
bidirectional stereo image compression with cross-dimensional entropy model | arXiv: 2407.10632
bidirectional uncertainty-based active learning for open-set annotation | arXiv: 2402.15198
binomial self-compensation for motion error in dynamic 3d scanning | arXiv: 2404.06693
blazebvd make scale-time equalization great again for blind video deflickering | arXiv: 2403.06243
blind image deblurring with noise-robust kernel estimation
blink multimodal large language models can see but not perceive | arXiv: 2404.12390
boosting 3d single object tracking with 2d matching distillation and 3d pre-trai
brain netflix scaling data to reconstruct videos from brain signals
brain-id learning contrast-agnostic anatomical representations for brain imaging | arXiv: 2311.16914
brave broadening the visual encoding of vision-language models | arXiv: 2404.07204
bridge past and future overcoming information asymmetry in incremental object de | arXiv: 2407.11499
bridging the gap between human motion and action semantics via kinematic phrases
bridging the gap studio-like avatar creation from a monocular phone capture | arXiv: 2407.19593
brushnet a plug-and-play image inpainting model with decomposed dual-branch diff | arXiv: 2403.06976
byteedit boost comply and accelerate generative image editing | arXiv: 2404.04860
caesarnerf calibrated semantic representation for few-shot generalizable neural | arXiv: 2311.15510
camera height doesnapost change unsupervised training for metric monocular road- | arXiv: 2312.04530
can ood object detectors learn from foundation models | arXiv: 2409.05162
canonicalfusion generating drivable 3d human avatars from multiple images | arXiv: 2407.04345
cardiacnet learning to reconstruct abnormalities for cardiac disease assessment | arXiv: 2410.20769
carformer self-driving with learned object-centric representations | arXiv: 2407.15843
cat enhancing multimodal large language model to answer questions in dynamic aud
category adaptation meets projected distillation in generalized continual catego | arXiv: 2308.12112
cg-slam efficient dense rgb-d slam in a consistent uncertainty-aware 3d gaussian | arXiv: 2403.16095
challenging forgets unveiling the worst-case forget sets in machine unlearning | arXiv: 2403.07362
chameleon a data-efficient generalist for dense visual prediction in the wild | arXiv: 2404.18459
chex interactive localization and region description in chest x-rays | arXiv: 2404.15770
citygaussian real-time high-quality large-scale scene rendering with gaussians | arXiv: 2404.01133
clap isolating content from style through contrastive learning with augmented pr | arXiv: 2311.16445
classification matters improving video action detection with class-specific atte | arXiv: 2407.19698
click-gaussian interactive segmentation to any 3d gaussians | arXiv: 2407.11793
clip-guided generative networks for transferable targeted adversarial attacks | arXiv: 2407.10179
cloudfixer test-time adaptation for 3d point clouds via diffusion-guided geometr
clr-gan improving gans stability and quality via consistent latent representatio
co-synthesis of histopathology nuclei image-label pairs using a context-conditio | arXiv: 2407.14434
coherentgs sparse novel view synthesis with coherent 3d gaussians | arXiv: 2403.19495
coho context-sensitive city-scale hierarchical urban layout generation | arXiv: 2407.11294
coin control-inpainting diffusion prior for human and camera motion estimation | arXiv: 2408.16426
coin-matting confounder intervention for image matting
cola conditional dropout and language-driven robust dual-modal salient object de | arXiv: 2407.06780
coleaf a contrastive-collaborative learning framework for weakly supervised audi | arXiv: 2405.10690
collaborative control for geometry-conditioned pbr image generation | arXiv: 2402.05919
colormae exploring data-independent masking strategies in masked autoencoders | arXiv: 2407.13036
colormnet a memory-based deep spatial-temporal feature propagation network for v | arXiv: 2404.06251
colorpeel color prompt learning with diffusion models via color and shape disent | arXiv: 2407.07197
combining generative and geometry priors for wide-angle portrait correction | arXiv: 2410.09911
comboverse compositional 3d assets creation using spatially-aware diffusion guid | arXiv: 2403.12409
como controllable motion generation through language guided pose code editing | arXiv: 2403.13900
compress3d a compressed latent space for 3d generation from a single image | arXiv: 2403.13524
confidence self-calibration for multi-label class-incremental learning | arXiv: 2403.12559
congeo robust cross-view geo-localization across ground view variations | arXiv: 2403.13965
contourlet residual for prompt learning enhanced infrared image super-resolution
controllable navigation instruction generation with chain of thought prompting | arXiv: 2407.07433
controlling the world by sleight of hand | arXiv: 2408.07147
controlllm augment language models with tools by searching on graphs | arXiv: 2310.17796
controlnet improving conditional controls with efficient consistency feedback | arXiv: 2404.07987
cor-gs sparse-view 3d gaussian splatting via co-regularization | arXiv: 2405.12110
cores orchestrating the dance of reasoning and segmentation | arXiv: 2404.05673
cpm class-conditional prompting machine for audio-visual segmentation | arXiv: 2407.05358
crm single image to 3d textured mesh with convolutional reconstruction model | arXiv: 2403.05034
cross-domain learning for video anomaly detection with limited supervision | arXiv: 2408.05191
cross-platform video person reid a new benchmark dataset and adaptation approach | arXiv: 2408.07500
crossglg llm guides one-shot skeleton-based 3d action recognition in a cross-lev | arXiv: 2403.10082
crossscore towards multi-view image evaluation and scoring | arXiv: 2404.14409
cs2k class-specific and class-shared knowledge guidance for incremental semantic | arXiv: 2407.09047
csot cross-scan object transfer for semi-supervised lidar object detection
cut out the middleman revisiting pose-based gait recognition
d-sco dual-stream conditional diffusion for monocular hand-held object reconstru | arXiv: 2311.14189
damsdet dynamic adaptive multispectral detection transformer with competitive qu | arXiv: 2403.00326
data collection-free masked video modeling | arXiv: 2409.06665
dataset enhancement with instance-level augmentations | arXiv: 2406.08249
dataset growth | arXiv: 2405.18347
datenerf depth-aware text-based editing of nerfs | arXiv: 2404.04526
dc-solver improving predictor-corrector diffusion sampler via dynamic compensati | arXiv: 2409.03755
dcdm diffusion-conditioned-diffusion model for scene text image super-resolution
de-confounded gaze estimation
deblur e-nerf nerf from motion-blurred events under high-speed or low-light cond | arXiv: 2409.17988
deceptive-nerf3dgs diffusion-generated pseudo-observations for high-quality spar | arXiv: 2305.15171
decomposed vector-quantized variational autoencoder for human grasp generation | arXiv: 2407.14062
decoupling common and unique representations for multimodal self-supervised lear | arXiv: 2309.05300
deep cost ray fusion for sparse depth video completion | arXiv: 2409.14935
deep nets with subsampling layers unwittingly discard useful activations at test | arXiv: 2410.01083
deep patch visual slam | arXiv: 2408.01654
defect spectrum a granular look of large-scale defect datasets with rich semanti
denoisplit a method for joint microscopy image splitting and unsupervised denois | arXiv: 2403.11854
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | arXiv: 2403.19588
detailsemnet elevating signature verification through detail-semantic integratio | arXiv: 2511.16364
detecting as labeling rethinking lidar-camera fusion in 3d object detection | arXiv: 2311.07152
dg-pic domain generalized point-in-context learning for point cloud understandin | arXiv: 2407.08801
diff-tracker text-to-image diffusion models are unsupervised trackers | arXiv: 2407.08394
differentiable convex polyhedra optimization from multi-view images | arXiv: 2407.15686
diffit diffusion vision transformers for image generation | arXiv: 2312.02139
diffusion model is a good pose estimator from 3d rf-vision | arXiv: 2403.16198
diffusion models for monocular depth estimation overcoming challenging condition | arXiv: 2407.16698
diffusion models for open-vocabulary segmentation | arXiv: 2306.09316
diffusion-based image-to-image translation by noise correction via prompt interp | arXiv: 2409.08077
diffusion-driven data replay a novel approach to combat forgetting in federated | arXiv: 2409.01128
diffusiondepth diffusion denoising approach for monocular depth estimation | arXiv: 2303.05021
dino-tracker taming dino for self-supervised point tracking in a single video | arXiv: 2403.14548
disco embodied navigation and interaction via differentiable scene semantics and | arXiv: 2407.14758
distill gold from massive ores bi-level data pruning towards efficient dataset d | arXiv: 2305.18381
distilling diffusion models into conditional gans | arXiv: 2405.05967
distribution alignment for fully test-time adaptation with dynamic online data s | arXiv: 2407.12128
distribution-aware robust learning from long-tailed data with noisy labels | arXiv: 2407.16802
divide and fuse body part mesh recovery from partially visible human images | arXiv: 2407.09694
domain reduction strategy for non-line-of-sight imaging | arXiv: 2308.10269
domain-adaptive video deblurring via test-time blurring | arXiv: 2407.09059
domesticating sam for breast ultrasound image segmentation via spatial-frequency
draganything motion control for anything using entity representation | arXiv: 2403.07420
dragapart learning a part-level motion prior for articulated objects | arXiv: 2403.15382
dreamdiffusion high-quality eeg-to-image generation with temporal masked signal
dreamdissector learning disentangled text-to-3d generation from 2d diffusion pri | arXiv: 2407.16260
dreamdrone text-to-image diffusion models are zero-shot perpetual view generator | arXiv: 2312.08746
dreamlip language-image pre-training with long captions | arXiv: 2403.17007
dreammotion space-time self-similar score distillation for zero-shot video editi | arXiv: 2403.12002
dreammover leveraging the prior of diffusion models for image interpolation with | arXiv: 2409.09605
dreamscene360 unconstrained text-to-3d scene generation with panoramic gaussian | arXiv: 2404.06903
dreamstruct understanding slides and user interfaces via synthetic data generati | arXiv: 2410.00201
dreamview injecting view-specific text guidance into text-to-3d generation | arXiv: 2404.06119
dropout mixture low-rank adaptation for visual parameters-efficient fine-tuning
dspdet3d 3d small object detection with dynamic spatial pruning | arXiv: 2305.03716
dual-level adaptive self-labeling for novel class discovery in point cloud segme | arXiv: 2407.12489
dvlo deep visual-lidar odometry with local-to-global feature fusion and bi-direc | arXiv: 2403.18274
dynamic neural radiance field from defocused monocular video | arXiv: 2407.05586
dyset a dynamic masked self-distillation approach for robust trajectory predicti
eaformer scene text segmentation with edge-aware transformers | arXiv: 2407.17020
early preparation pays off new classifier pre-tuning for class incremental seman | arXiv: 2407.14142
ebdm exemplar-guided image translation with brownian-bridge diffusion models | arXiv: 2410.09802
echoscene indoor scene generation via information echo over scene graph diffusio | arXiv: 2405.00915
edformer transformer-based event denoising across varied noise levels
editable image elements for controllable synthesis | arXiv: 2404.16029
edtalk efficient disentanglement for emotional talking head synthesis | arXiv: 2404.01647
efficient and versatile robust fine-tuning of zero-shot models | arXiv: 2408.05749
efficient cascaded multiscale adaptive network for image restoration
efficient depth-guided urban view synthesis | arXiv: 2407.12395
efficient diffusion transformer with step-wise dynamic attention mediators | arXiv: 2408.05710
efficient few-shot action recognition via multi-level post-reasoning
efficient image pre-training with siamese cropped masked autoencoders | arXiv: 2403.17823
efficient inference of vision instruction-following models with elastic cache | arXiv: 2407.18121
egoexo-fitness towards egocentric and exocentric full-body action understanding | arXiv: 2406.08877
egoposer robust real-time egocentric pose estimation from sparse and intermitten | arXiv: 2308.06493
elegantly written disentangling writer and character styles for enhancing online
elevating all zero-shot sketch-based image retrieval through multimodal prompt l | arXiv: 2407.04207
eliminating feature ambiguity for few-shot segmentation | arXiv: 2407.09842
eliminating warping shakes for unsupervised online video stitching | arXiv: 2403.06378
else efficient deep neural network inference through line-based sparsity explora
elysium exploring object-level perception in videos via mllm | arXiv: 2403.16558
emdm efficient motion diffusion model for fast and high-quality motion generatio | arXiv: 2312.02256
energy-induced explicit quantification for multi-modality mri fusion
enhancing diffusion models with text-encoder reinforcement learning | arXiv: 2311.15657
enhancing optimization robustness in 1-bit neural networks through stochastic si
enhancing perceptual quality in video super-resolution through temporally-consis | arXiv: 2311.15908
enhancing vectorized map perception with historical rasterized maps | arXiv: 2409.00620
equi-gspr equivariant se3 graph network model for sparse point cloud registratio | arXiv: 2410.05729
equivariant spatio-temporal self-supervision for lidar object detection | arXiv: 2404.11737
et the exceptional trajectories text-to-camera-trajectory generation with charac | arXiv: 2407.01516
eta inversion designing an optimal eta function for diffusion-based real image e | arXiv: 2403.09468
evaluating text-to-visual generation with image-to-text generation | arXiv: 2404.01291
event trojan asynchronous event-based backdoor attacks | arXiv: 2407.06838
event-based head pose estimation benchmark and method
event-based mosaicing bundle adjustment | arXiv: 2409.07365
evsign sign language recognition and translation with streaming events | arXiv: 2407.12593
exemplar-free continual representation learning via learnable drift compensation | arXiv: 2407.08536
explicitly guided information interaction network for cross-modal point cloud co | arXiv: 2407.02887
exploiting dual-correlation for multi-frame time-of-flight denoising
exploring guided sampling of conditional gans
exploring pre-trained text-to-video diffusion models for referring video object | arXiv: 2403.12042
exploring the feature extraction and relation modeling for light-weight transfor
external knowledge enhanced 3d scene generation from sketch | arXiv: 2403.14121
eyes closed safety on protecting multimodal llms via image-to-text transformatio | arXiv: 2403.09572
facial affective behavior analysis with instruction tuning | arXiv: 2404.05052
falip visual prompt as foveal attention boosts clip zero-shot performance | arXiv: 2407.05578
fastcad real-time cad retrieval and alignment from scans and videos | arXiv: 2403.15161
fine-grained scene graph generation via sample-level bias prediction | arXiv: 2407.19259
finematch aspect-based fine-grained image and text mismatch detection and correc | arXiv: 2404.14715
finepseudo improving pseudo-labelling through temporal-alignablity for semi-supe | arXiv: 2409.01448
fisher calibration for backdoor-robust heterogeneous federated learning
fisherrf active view selection and mapping with radiance fields using fisher inf
flash cache reducing bias in radiance cache based inverse rendering | arXiv: 2409.05867
flashsplat 2d to 3d gaussian splatting segmentation solved optimally | arXiv: 2409.08270
flashtex fast relightable mesh texturing with lightcontrolnet | arXiv: 2402.13251
flat flux-aware imperceptible adversarial attacks on 3d point clouds
flexattention for efficient high-resolution vision-language models | arXiv: 2407.20228
flowcon out-of-distribution detection using flow-based contrastive learning | arXiv: 2407.03489
flying with photons rendering novel views of propagating light | arXiv: 2404.06493
forest2seq revitalizing order prior for sequential indoor scene synthesis | arXiv: 2407.05388
formula-supervised visual-geometric pre-training | arXiv: 2409.13535
foster adaptivity and balance in learning with noisy labels | arXiv: 2407.02778
foundpose unseen object pose estimation with foundation features | arXiv: 2311.18809
fouriscale a frequency perspective on training-free high-resolution image synthe | arXiv: 2403.12963
free-viewpoint video of outdoor sports using a flying camera
freeaugment data augmentation search across all degrees of freedom | arXiv: 2409.04820
freecompose generic zero-shot image composition with diffusion prior | arXiv: 2407.04947
freediff progressive frequency truncation for image editing with diffusion model | arXiv: 2404.11895
freeinit bridging initialization gap in video diffusion models | arXiv: 2312.07537
freemotion a unified framework for number-free text-to-motion synthesis | arXiv: 2405.15763
freemotion mocap-free human motion synthesis with multimodal large language mode | arXiv: 2406.10740
freestyleret retrieving images from style-diversified queries | arXiv: 2312.02428
frequency-spatial entanglement learning for camouflaged object detection | arXiv: 2409.01686
frest feature restoration for semantic segmentation under multiple adverse condi | arXiv: 2407.13437
fsd-bev foreground self-distillation for multi-view 3d object detection | arXiv: 2407.10135
fully sparse 3d occupancy prediction | arXiv: 2312.17118
functional transform-based low-rank tensor factorization for multi-dimensional d
funqa towards surprising video comprehension | arXiv: 2306.14899
futuredepth learning to predict the future improves video depth estimation | arXiv: 2403.12953
g2fr frequency regularization in grid-based feature encoding neural radiance fie
g3r gradient guided generalizable reconstruction | arXiv: 2409.19405
garmentaligner text-to-garment generation via retrieval-augmented multi-level co | arXiv: 2408.12352
gaura generalizable approach for unified restoration and rendering of arbitrary | arXiv: 2407.08221
gaussctrl multi-view consistent text-driven 3d gaussian splatting editing | arXiv: 2403.08733
gaussian grouping segment and edit anything in 3d scenes | arXiv: 2312.00732
gaussianformer scene as gaussians for vision-based 3d semantic occupancy predict | arXiv: 2405.17429
gaussianimage 1000 fps image representation and compression by 2d gaussian splat | arXiv: 2403.08551
gaussreg fast 3d registration with gaussian splatting | arXiv: 2407.05254
gaze target detection based on head-local-global coordination
gazexplain learning to predict natural language explanations of visual scanpaths
general and task-oriented video segmentation | arXiv: 2407.06540
generalizable facial expression recognition | arXiv: 2408.10614
generating 3d house wireframes with semantics | arXiv: 2407.12267
generating human interaction motions in scenes with text control | arXiv: 2404.10685
generative camera dolly extreme monocular dynamic novel view synthesis | arXiv: 2405.14868
genixer empowering multimodal large language model as a powerful data generator | arXiv: 2312.06731
genq quantization in low data regimes with generative synthetic data | arXiv: 2312.05272
geometrysticker enabling ownership claim of recolorized neural radiance fields | arXiv: 2407.13390
geowizard unleashing the diffusion priors for 3d geometry estimation from a sing | arXiv: 2403.12013
getting it right improving spatial consistency in text-to-image models | arXiv: 2404.01197
git towards generalist vision transformer through universal language interface | arXiv: 2403.09394
gkgnet group k-nearest neighbor based graph convolutional network for multi-labe | arXiv: 2308.14378
global-to-pixel regression for human mesh recovery
goldfish vision-language understanding of arbitrarily long videos | arXiv: 2407.12679
gpsformer a global perception and local structure fitting-based transformer for | arXiv: 2407.13519
gra detecting oriented objects through group-wise rotating and attention | arXiv: 2403.11127
grace graph-based contextual debiasing for fair visual question answering
gradient-regularized out-of-distribution detection | arXiv: 2404.12368
graphbev towards robust bev feature alignment for multi-modal 3d object detectio | arXiv: 2403.11848
graspxl generating grasping motions for diverse objects at scale | arXiv: 2403.19649
grm large gaussian reconstruction model for efficient 3d reconstruction and gene | arXiv: 2403.14621
groma localized visual tokenization for grounding multimodal large language mode | arXiv: 2404.13013
grounding language models for visual entity recognition | arXiv: 2402.18695
gs-lrm large reconstruction model for 3d gaussian splatting | arXiv: 2404.19702
gs-pose category-level object pose estimation via geometric and semantic corresp | arXiv: 2311.13777
gtp-4o modality-prompted heterogeneous graph learning for omni-modal biomedical | arXiv: 2407.05540
gvgen text-to-3d generation with volumetric representation | arXiv: 2403.12957
h-v2x a large scale highway dataset for bev perception
hac hash-grid assisted context for 3d gaussian splatting compression | arXiv: 2403.14530
handling the non-smooth challenge in tensor svd a multi-objective tensor recover | arXiv: 2311.13958
harnessing text-to-image diffusion models for category-agnostic pose estimation
hat history-augmented anchor transformer for online temporal action localization | arXiv: 2408.06437
headgas real-time animatable head avatars via 3d gaussian splatting | arXiv: 2312.02902
heterogeneous graph learning for scene graph prediction in 3d point clouds
hiding imperceptible noise in curvature-aware patches for 3d point cloud attack
hiei a universal framework for generating high-quality emerging images from natu
hierarchical temporal context learning for camera-based semantic scene completio | arXiv: 2407.02077
hierarchically structured neural bones for reconstructing animatable objects fro | arXiv: 2408.00351
high-fidelity 3d textured shapes generation by sparse encoding and adversarial d
high-precision self-supervised monocular depth estimation with rich-resource pri | arXiv: 2408.00361
high-resolution and few-shot view synthesis from asymmetric dual-lens inputs
himo a new benchmark for full-body human interacting with multiple objects | arXiv: 2407.12371
how video meetings change your expression | arXiv: 2406.00955
hpe-li wifi-enabled lightweight dual selective kernel convolution for human pose
hpff hierarchical locally supervised learning with patch feature fusion | arXiv: 2407.05638
human hair reconstruction with strand-aligned 3d gaussians
human motion forecasting in dynamic domain shifts a homeostatic continual test-t
humos human motion model conditioned on body shape | arXiv: 2409.03944
hybridbooth hybrid prompt inversion for efficient subject-driven generation | arXiv: 2410.08192
hydra a hyper agent for dynamic compositional visual reasoning | arXiv: 2403.12884
hyperion - a fast versatile symbolic gaussian belief propagation framework for c | arXiv: 2407.07074
i canapost believe itaposs not scene flow
i-medsam implicit medical image segmentation with segment anything | arXiv: 2311.17081
i2-slam inverting imaging process for robust photorealistic dense slam
iam-vfi interpolate any motion for video frame interpolation with motion complex
idempotent unsupervised representation learning for skeleton-based action recogn | arXiv: 2410.20349
idol unified dual-modal latent diffusion for human-centric joint video-depth gen | arXiv: 2407.10937
image demoiréing in raw and srgb domains
image-feature weak-to-strong consistency an enhanced paradigm for semi-supervise | arXiv: 2408.12614
imaging interiors an implicit solution to electromagnetic inverse scattering pro | arXiv: 2407.09352
implicit concept removal of diffusion models | arXiv: 2310.05873
implicit filtering for learning neural signed distance functions from 3d point c | arXiv: 2407.13342
implicit style-content separation using b-lora | arXiv: 2403.14572
improving 2d feature representations by 3d-aware fine-tuning | arXiv: 2407.20229
improving agent behaviors with rl fine-tuning for autonomous driving | arXiv: 2409.18343
improving domain generalization in self-supervised monocular depth estimation vi | arXiv: 2411.02149
improving intervention efficacy via concept realignment in concept bottleneck mo | arXiv: 2405.01531
improving knowledge distillation via regularizing feature direction and norm
improving medical multi-modal contrastive learning with expert annotations | arXiv: 2403.10153
improving point-based crowd counting and localization based on auxiliary point g | arXiv: 2405.10589
improving zero-shot generalization for clip with variational adapter
infinite-id identity-preserved personalization via id-semantics decoupling parad | arXiv: 2403.11781
infmae a foundation model in the infrared modality | arXiv: 2402.00407
instance-dependent noisy-label learning with graphical model based noise-rate es | arXiv: 2305.19486
integrating markov blanket discovery into causal representation learning for dom
interactive 3d object detection with prompts
interleaving one-class and weakly-supervised models with adaptive thresholding f | arXiv: 2401.13551
intrinsic single-image hdr reconstruction | arXiv: 2409.13803
invertible neural warp for nerf | arXiv: 2407.12354
irgen generative modeling for image retrieval | arXiv: 2303.10126
is retain set all you need in machine unlearning restoring performance of unlear | arXiv: 2404.12922
is user feedback always informative retrieval latent defending for semi-supervis | arXiv: 2407.15383
isomorphic pruning for vision models | arXiv: 2407.04616
ittakestwo leveraging peer representations for semi-supervised lidar semantic se | arXiv: 2407.07171
ivtp instruction-guided visual token pruning for large vision-language models
joint rgb-spectral decomposition model guided image enhancement in mobile photog | arXiv: 2407.17996
jointdreamer ensuring geometry consistency and text congruence in text-to-3d gen | arXiv: 2407.12291
kalman-inspired feature propagation for video face super-resolution | arXiv: 2408.05205
l-differ single image reflection removal with language-based diffusion model
label-anticipated event disentanglement for audio-visual video parsing | arXiv: 2407.08126
lagrangian hashing for compressed neural field representations | arXiv: 2409.05334
lami-detr open-vocabulary detection with language model instruction | arXiv: 2407.11335
language-driven 6-dof grasp detection using negative prompt guidance | arXiv: 2407.13842
lapose laplacian mixture shape modeling for rgb-based category-level object pose | arXiv: 2409.15727
lara efficient large-baseline radiance fields | arXiv: 2407.04699
large motion model for unified multi-modal motion generation | arXiv: 2404.01284
lass3d language-assisted semi-supervised 3d semantic segmentation with progressi
latent guard a safety framework for text-to-image generation | arXiv: 2404.08031
latent-inr a flexible framework for implicit representations of videos with disc | arXiv: 2408.02672
layeredflow a real-world benchmark for non-lambertian multi-layer optical flow | arXiv: 2409.05688
layoutdetr detection transformer is a good multimodal layout designer | arXiv: 2212.09877
lazy diffusion transformer for interactive image editing | arXiv: 2404.12382
lcm-lookahead for encoder-based text-to-image personalization | arXiv: 2404.03620
learn from the learnt source-free active domain adaptation via contrastive sampl | arXiv: 2407.18899
learning 3d geometry and feature consistent gaussian splatting for object remova | arXiv: 2404.13679
learning 3d-aware gans from unposed images with template feature field | arXiv: 2404.05705
learning anomalies with normality prior for unsupervised video anomaly detection
learning camouflaged object detection from noisy pseudo label | arXiv: 2407.13157
learning chain of counterfactual thought for bias-robust vision-language reasoni
learning cross-hand policies of high-dof reaching and grasping | arXiv: 2404.09150
learning differentially private diffusion models via stochastic adversarial dist | arXiv: 2408.14738
learning exhaustive correlation for spectral super-resolution where spatial-spec | arXiv: 2312.12833
learning from the web language drives weakly-supervised incremental learning for | arXiv: 2407.13363
learning representations of satellite images from metadata supervision
learning semantic latent directions for accurate and controllable human motion p | arXiv: 2407.11494
learning to generate conditional tri-plane for 3d-aware expression controllable | arXiv: 2404.00636
learning to obstruct few-shot image classification over restricted classes | arXiv: 2409.19210
learning to robustly reconstruct dynamic scenes from low-light spike streams
learning trimodal relation for audio-visual question answering with missing moda | arXiv: 2407.16171
lego learning egocentric action frame generation via visual instruction tuning | arXiv: 2312.03849
lego learning to disentangle and invert personalized concepts beyond object appe | arXiv: 2311.13833
leia latent view-invariant embeddings for implicit 3d articulation | arXiv: 2409.06703
leveraging hierarchical feature sharing for efficient dataset condensation | arXiv: 2310.07506
leveraging temporal contextualization for video action recognition | arXiv: 2404.09490
lgm large multi-view gaussian model for high-resolution 3d content creation | arXiv: 2402.05054
lidar-event stereo fusion with hallucinations | arXiv: 2408.04633
lift a surprisingly simple lightweight feature transform for dense vit descripto | arXiv: 2403.14625
linearly controllable gan unsupervised feature categorization and decomposition
listen to look into the future audio-visual egocentric gaze anticipation | arXiv: 2305.03907
livehps robust and coherent motion capture in dynamic free environment | arXiv: 2407.09833
livephoto real image animation with text-guided motion control | arXiv: 2312.02928
llm as copilot for coarse-grained vision-and-language navigation
ln3diff scalable latent neural fields diffusion for speedy 3d generation | arXiv: 2403.12019
loa-trans enhancing visual grounding by location-aware transformers
local action-guided motion diffusion model for text-to-motion generation | arXiv: 2407.10528
local all-pair correspondence for point tracking
long-tail temporal action segmentation with group-wise temporal logit adjustment | arXiv: 2408.09919
m ampmaposs a benchmark to evaluate tool-use for multi-step multi-modal tasks
m2d2m multi-motion generation from text with discrete diffusion models | arXiv: 2407.14502
macdiff unified skeleton modeling with masked conditional diffusion | arXiv: 2409.10473
magdiff multi-alignment diffusion for high-fidelity video generation and editing | arXiv: 2311.17338
magiceraser erasing any objects via semantics-aware control | arXiv: 2410.10207
magr manifold-aligned graph regularization for continual action quality assessme | arXiv: 2403.04398
mahalanobis distance-based multi-view optimal transport for multi-view crowd loc | arXiv: 2409.01726
mambair a simple baseline for image restoration with state-space model | arXiv: 2402.15648
manikin biomechanically accurate neural inverse kinematics for human motion esti
mapdistill boosting efficient camera-based hd map construction via camera-lidar | arXiv: 2407.11682
maptracker tracking with strided memory fusion for consistent vector hd mapping | arXiv: 2403.15951
marineinst a foundation model for marine image analysis with instance visual des
mariner enhancing novel views by matching rendered images with nearby references | arXiv: 2407.13745
marvelovd marrying object recognition and vision-language models for robust open | arXiv: 2407.21465
masked angle-aware autoencoder for remote sensing images | arXiv: 2408.01946
masked video and body-worn imu autoencoder for egocentric action recognition | arXiv: 2407.06628
mathverse does your multi-modal llm truly see the diagrams in visual math proble | arXiv: 2403.14624
megascenes scene-level view synthesis at scale | arXiv: 2406.11819
membn robust test-time adaptation via batch norm with statistics memory
memory-efficient fine-tuning for quantized diffusion model | arXiv: 2401.04339
merlin empowering multimodal llms with foresight minds
merlin single-shot material estimation and relighting for photometric stereo | arXiv: 2409.00674
mesh2nerf direct mesh supervision for neural radiance field representation and g | arXiv: 2403.19319
meshfeat multi-resolution features for neural fields on meshes | arXiv: 2407.13592
meta-prompting for automating zero-shot visual recognition with llms | arXiv: 2403.11755
metaaug meta-data augmentation for post-training quantization | arXiv: 2407.14726
migs multi-identity gaussian splatting via tensor decomposition | arXiv: 2407.07284
milliflow scene flow estimation on mmwave radar point cloud for human motion sen | arXiv: 2306.17010
mixdq memory-efficient few-step text-to-image diffusion models with metric-decou | arXiv: 2405.17873
mm1 methods analysis and insights from multimodal llm pre-training
mmbench is your multi-modal model an all-around player | arXiv: 2307.06281
modeling and driving human body soundfields through acoustic primitives | arXiv: 2407.13083
moe-diffir task-customized diffusion priors for universal compressed image resto | arXiv: 2407.10833
mofa-video controllable image animation via generative motion field adaptions in | arXiv: 2405.20222
momentum auxiliary network for supervised local learning | arXiv: 2407.05623
monocular occupancy prediction for scalable indoor scenes | arXiv: 2407.11730
monowad weather-adaptive diffusion model for robust monocular 3d object detectio | arXiv: 2407.16448
motion mamba efficient and long sequence motion generation | arXiv: 2403.07487
motion-prior contrast maximization for dense continuous-time motion estimation | arXiv: 2407.10802
motionchain conversational motion controllers via multimodal prompts | arXiv: 2404.01700
motionlcm real-time controllable motion generation via latent consistency model | arXiv: 2404.19759
multi-hmr multi-person whole-body human mesh recovery in a single shot | arXiv: 2402.14654
multi-label cluster discrimination for visual representation learning | arXiv: 2407.17331
multi-memory matching for unsupervised visible-infrared person re-identification | arXiv: 2401.06825
multi-person pose forecasting with individual interaction perceptron and prior l
multigen zero-shot image generation from multi-modal prompts
mutdet mutually optimizing pre-training for remote sensing object detection | arXiv: 2407.09920
mutual learning for acoustic matching and dereverberation via visual scene-drive | arXiv: 2407.10373
mvdd multi-view depth diffusion models | arXiv: 2312.04875
mvdiffusion a dense high-resolution multi-view diffusion model for single or spa | arXiv: 2402.12712
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo | arXiv: 2405.12218
mvsplat efficient 3d gaussian splatting from sparse multi-view images | arXiv: 2403.14627
myvlm personalizing vlms for user-specific queries | arXiv: 2403.14599
navgpt-2 unleashing navigational reasoning capability for large vision-language | arXiv: 2407.12366
navigation instruction generation with bev perception and large language models | arXiv: 2407.15087
NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration | arXiv: 2309.07322
neural volumetric world models for autonomous driving
neuroncap photorealistic closed-loop safety testing for autonomous driving | arXiv: 2404.07762
neusdfusion a spatial-aware generative model for 3d shape completion reconstruct | arXiv: 2403.18241
ngp-rt fusing multi-level hash features with lightweight attention for real-time | arXiv: 2407.10482
nl2contact natural language guided 3d hand-object contact modeling with diffusio | arXiv: 2407.12727
noise-assisted prompt learning for image forgery detection and localization
non-parametric sensor noise modeling and synthesis
nonverbal interaction detection | arXiv: 2407.08133
novum neural object volumes for robust object classification | arXiv: 2305.14668
nucraft crafting high resolution 3d semantic occupancy for unified 3d scene unde
nymeria a massive collection of multimodal egocentric daily motion in the wild | arXiv: 2406.09905
oapt offset-aware partition transformer for double jpeg artifacts removal | arXiv: 2408.11480
object-aware nir-to-visible translation
occgen generative multi-modal 3d occupancy prediction for autonomous driving | arXiv: 2404.15014
occluded gait recognition with mixture of experts an action detection perspectiv
occlusion handling in 3d human pose estimation with perturbed positional encodin | arXiv: 2405.17397
occlusion-aware seamless segmentation | arXiv: 2407.02182
occworld learning a 3d occupancy world model for autonomous driving | arXiv: 2311.16038
octopus embodied vision-language programmer from environmental feedback | arXiv: 2310.08588
ogni-dc robust depth completion with optimization-guided neural iterations | arXiv: 2406.11711
olaf a plug-and-play framework for enhanced multi-object multi-part scene parsin | arXiv: 2411.02858
omg occlusion-friendly personalized multi-concept generation in diffusion models | arXiv: 2403.10983
omni-recon harnessing image-based rendering for general-purpose neural radiance | arXiv: 2403.11131
omni6d large-vocabulary 3d object dataset for category-level 6d object pose esti | arXiv: 2409.18261
omnisat self-supervised modality fusion for earth observation | arXiv: 2404.08351
omnissr zero-shot omnidirectional image super-resolution using stable diffusion | arXiv: 2404.10312
omniview-tuning boosting viewpoint invariance of vision-language pre-training mo | arXiv: 2404.12139
on calibration of object detectors pitfalls evaluation and baselines | arXiv: 2405.20459
on the error analysis of 3d gaussian splatting and an optimal projection strateg | arXiv: 2402.00752
on the utility of 3d hand poses for action recognition | arXiv: 2403.09805
one-stage prompt-based continual learning | arXiv: 2402.16189
onerestore a universal restoration framework for composite degradation | arXiv: 2407.04621
onetrack demystifying the conflict between detection and tracking in end-to-end
online temporal action localization with memory-augmented transformer | arXiv: 2408.02957
open object-wise position embedding for multi-view 3d object detection | arXiv: 2407.10753
open vocabulary 3d scene understanding via geometry guided self-distillation | arXiv: 2407.13362
open-vocabulary 3d semantic segmentation with text-to-image diffusion models | arXiv: 2407.13642
openkd opening prompt diversity for zero- and few-shot keypoint detection | arXiv: 2409.19899
openpsg open-set panoptic scene graph generation via large multimodal models | arXiv: 2407.11213
operational open-set recognition and postmax refinement
ophnet a large-scale video benchmark for ophthalmic surgical workflow understand | arXiv: 2406.07471
optimizing diffusion models for joint trajectory prediction and controllable gen | arXiv: 2408.00766
optimizing factorized encoder models time and memory reduction for scalable and
optimizing illuminant estimation in dual-exposure hdr imaging
overcoming distribution mismatch in quantizing image super-resolution networks | arXiv: 2307.13337
p2p-bridge diffusion bridges for 3d point cloud denoising | arXiv: 2408.16325
pairwise distance distillation for unsupervised real-world image super-resolutio | arXiv: 2407.07302
panofree tuning-free holistic multi-view image generation with cross-view self-g | arXiv: 2408.02157
panovos bridging non-panoramic and panoramic views with transformer for video se | arXiv: 2309.12303
papr training-free one-step patch pruning with lightweight convnets for faster i | arXiv: 2403.16020
part2object hierarchical unsupervised 3d instance segmentation | arXiv: 2407.10084
partcraft crafting creative objects by parts | arXiv: 2407.04604
partstad 2d-to-3d part segmentation task adaptation | arXiv: 2401.05906
pathology-knowledge enhanced multi-instance prompt learning for few-shot whole s | arXiv: 2407.10814
pcf-lift panoptic lifting by probabilistic contrastive fusion | arXiv: 2410.10659
per-gaussian embedding-based deformation for deformable 3d gaussian splatting | arXiv: 2404.03613
petface a large-scale dataset and benchmark for animal identification | arXiv: 2407.13555
physdreamer physics-based interaction with 3d objects via video generation | arXiv: 2404.13026
pisr polarimetric neural implicit surface reconstruction for textureless and spe | arXiv: 2409.14331
pite pixel-temporal alignment for large video-language model | arXiv: 2409.07239
pixel-aware stable diffusion for realistic image super-resolution and personaliz | arXiv: 2308.14469
pixel-gs density control with pixel-aware gradient for 3d gaussian splatting | arXiv: 2403.15530
plain-det a plain multi-dataset object detector | arXiv: 2407.10083
plan posture and go towards open-vocabulary text-to-motion generation
plot text-based person search with part slot attention for corresponding part di | arXiv: 2409.13475
poa pre-training once for models of all sizes | arXiv: 2408.01031
point-supervised panoptic segmentation via estimating pseudo labels from learnab
pointllm empowering large language models to understand point clouds | arXiv: 2308.16911
ponymation learning articulated 3d animal motions from unlabeled online videos | arXiv: 2312.13604
portrait4d-v2 pseudo multi-view data creates better 4d head synthesizer | arXiv: 2403.13570
pose-aware self-supervised learning with viewpoint trajectory regularization | arXiv: 2403.14973
posesor human pose can guide our attention
posformer recognizing complex handwritten mathematical expression with position | arXiv: 2407.07764
power variable projection for initialization-free large-scale bundle adjustment | arXiv: 2405.05079
powerful and flexible personalized text-to-image generation via reinforcement le | arXiv: 2407.06642
pq-sam post-training quantization for segment anything model
prelar world model pre-training with learnable action representation
preventing catastrophic overfitting in fast adversarial training a bi-level opti | arXiv: 2407.12443
prioritized semantic learning for zero-shot instance navigation | arXiv: 2403.11650
probabilistic weather forecasting with deterministic guidance-based diffusion mo
prodepth boosting self-supervised multi-frame monocular depth with probabilistic | arXiv: 2407.09303
progressive classifier and feature extractor adaptation for unsupervised domain | arXiv: 2311.16474
progressive pretext task learning for human trajectory prediction | arXiv: 2407.11588
projecting points to axes oriented object detection via point-axis representatio | arXiv: 2407.08489
promerge prompt and merge for unsupervised instance segmentation | arXiv: 2409.18961
promptccd learning gaussian mixture prompt pool for continual category discovery | arXiv: 2407.19001
prompting future driven diffusion model for hand motion prediction
prompting language-informed distribution for compositional zero-shot learning | arXiv: 2305.14428
promptiqa boosting the performance and generalization for no-reference image qua | arXiv: 2403.04993
propose assess search harnessing llms for goal-oriented planning in instructiona | arXiv: 2409.20557
protecting nerfsapos copyright via plug-and-play watermarking base model
pyra parallel yielding re-activation for training-inference efficient task adapt | arXiv: 2403.09192
quantized prompt for efficient generalization of vision-language models | arXiv: 2407.10704
quar-vla vision-language-action model for quadruped robots | arXiv: 2312.14457
querycdr query-based controllable distortion rectification network for fisheye i | arXiv: 2412.13496
r2-bench benchmarking the robustness of referring perception models under pertur
radedit stress-testing biomedical vision models via diffusion image editing | arXiv: 2312.12865
radiative gaussian splatting for efficient x-ray novel view synthesis | arXiv: 2403.04116
raindrop clarity a dual-focused dataset for day and night raindrop removal | arXiv: 2407.16957
random walk on pixel manifolds for anomaly segmentation of complex driving scene | arXiv: 2404.17961
rapid-seg range-aware pointwise distance distribution networks for 3d lidar segm | arXiv: 2407.10159
raw-adapter adapting pre-trained visual model to camera raw images | arXiv: 2408.14802
ray-distance volume rendering for neural scene reconstruction | arXiv: 2408.15524
real-data-driven 2000 fps color video from mosaicked chromatic spikes
realfred an embodied instruction following benchmark in photo-realistic environm | arXiv: 2407.18550
realistic human motion generation with cross-diffusion models | arXiv: 2312.10993
realviformer investigating attention for real-world video super-resolution | arXiv: 2407.13987
reason2drive towards interpretable and chain-based reasoning for autonomous driv | arXiv: 2312.03661
rebalancing using estimated class distribution for imbalanced semi-supervised le
reconstruction and simulation of elastic objects with spring-mass 3d gaussians | arXiv: 2403.09434
rectify the regression bias in long-tailed object detection | arXiv: 2401.15885
referring atomic video action recognition | arXiv: 2407.01872
regiondrag fast region-based image editing with diffusion models | arXiv: 2407.18247
reground improving textual and spatial grounding at no cost | arXiv: 2403.13589
rejection sampling imle designing priors for better few-shot image synthesis | arXiv: 2409.17439
reliability in semantic segmentation can we use synthetic data | arXiv: 2312.09231
reliable spatial-temporal voxels for multi-modal test-time adaptation
reloo reconstructing humans dressed in loose garments from monocular video in th | arXiv: 2409.15269
remamber referring image segmentation with mamba twister
removing distributional discrepancies in captions improves image-text alignment | arXiv: 2410.00905
renoise real image inversion through iterative noising | arXiv: 2403.14602
repaint123 fast and high-quality one image to 3d generation with progressive con
repose 3d human pose estimation via spatio-temporal depth relational consistency
representing topological self-similarity using fractal feature maps for accurate | arXiv: 2407.14754
reprojection errors as prompts for efficient scene coordinate regression | arXiv: 2409.04178
resilience of entropy model in distributed neural networks | arXiv: 2403.00942
responsible visual editing | arXiv: 2404.05580
restoring images in adverse weather conditions via histogram transformer | arXiv: 2407.10172
rethinking data augmentation for robust lidar semantic segmentation in adverse w | arXiv: 2407.02286
rethinking data bias dataset copyright protection via embedding class-wise hidde
rethinking image super-resolution from training data perspectives | arXiv: 2409.00768
rethinking lidar domain generalization single source as multiple density domains | arXiv: 2312.12098
rethinking unsupervised outlier detection via multiple thresholding | arXiv: 2407.05382
rethinking video-text understanding retrieval from counterfactually augmented da | arXiv: 2407.13094
revision rendering tools enable spatial fidelity in vision-language models | arXiv: 2408.02231
revisiting supervision for continual representation learning | arXiv: 2311.13321
rgnet a unified clip retrieval and grounding network for long videos | arXiv: 2312.06729
ringid rethinking tree-ring watermarking for enhanced multi-key identification | arXiv: 2404.14055
risk-aware self-consistent imitation learning for trajectory planning in autonom
risurconv rotation invariant surface attention-augmented convolutions for 3d poi | arXiv: 2408.06110
roadpainter points are ideal navigators for topology transformer | arXiv: 2407.15349
robust calibration of large vision-language adapters | arXiv: 2407.13588
robust fitting on a gate quantum computer | arXiv: 2409.02006
robust-wide robust watermarking against instruction-driven image editing | arXiv: 2402.12688
rodinhd high-fidelity 3d avatar generation with diffusion models | arXiv: 2407.06938
roguenerf a robust geometry-consistent universal enhancer for nerf | arXiv: 2403.11909
roofdiffusion constructing roofs from severely corrupted point data via diffusio | arXiv: 2404.09290
rotary position embedding for vision transformer | arXiv: 2403.13298
rpbg towards robust neural point-based graphics in the wild | arXiv: 2405.05663
R²-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | arXiv: 2404.00801
s3d-nerf single-shot speech-driven neural radiance field for high fidelity talki
sa-dvae improving zero-shot skeleton-based action recognition by disentangled va | arXiv: 2407.13460
safe-sim safety-critical closed-loop traffic simulation with diffusion-controlla | arXiv: 2401.00391
safnet selective alignment fusion network for efficient hdr imaging | arXiv: 2407.16308
sags structure-aware 3d gaussian splatting | arXiv: 2404.19149
sair learning semantic-aware implicit representation | arXiv: 2310.09285
sapiens foundation for human vision models | arXiv: 2408.12569
sc4d sparse-controlled video-to-4d generation and motion transfer | arXiv: 2404.03736
scalable group choreography via variational phase manifold learning | arXiv: 2407.18839
scaledreamer scalable text-to-3d synthesis with asynchronous score distillation | arXiv: 2407.02040
scaling backwards minimal synthetic pre-training | arXiv: 2408.00677
scanreason empowering 3d visual grounding with reasoning capabilities | arXiv: 2407.01525
scantalk 3d talking heads from unregistered scans | arXiv: 2403.10942
scape a simple and strong category-agnostic pose estimator | arXiv: 2407.13483
scatterformer efficient voxel transformer with scattered linear attention | arXiv: 2401.00912
scenegraphloc cross-modal coarse visual localization on 3d scene graphs | arXiv: 2404.00469
sceneverse scaling 3d vision-language learning for grounded scene understanding | arXiv: 2401.09340
sclip rethinking self-attention for dense vision-language inference | arXiv: 2312.01597
scpnet unsupervised cross-modal homography estimation via intra-modal self-super
sea-raft simple efficient accurate raft for optical flow | arXiv: 2405.14793
sediff structure extraction for domain adaptive depth estimation via denoising d
see and think embodied agent in virtual environment | arXiv: 2311.15209
seed a simple and effective 3d detr in point clouds | arXiv: 2407.10749
seeing the unseen a frequency prompt guided transformer for image restoration | arXiv: 2404.00288
seflow a self-supervised scene flow method in autonomous driving | arXiv: 2407.01702
seggen supercharging segmentation models with text2mask and mask2img synthesis | arXiv: 2311.03355
segmentation-guided layer-wise image vectorization with gradient fills | arXiv: 2408.15741
segpoint segment any point cloud via large language model | arXiv: 2407.13761
seit masked token modeling improves storage-efficient training | arXiv: 2312.10105
select and distill selective dual-teacher knowledge transfer for continual learn | arXiv: 2403.09296
self-adapting large visual-language models to edge devices across visual modalit | arXiv: 2403.04908
self-supervised any-point tracking by contrastive random walks | arXiv: 2409.16288
self-supervised co-salient object detection via feature correspondences at multi | arXiv: 2403.11107
self-supervised feature adaptation for 3d industrial anomaly detection | arXiv: 2401.03145
self-supervised video copy localization with regional token representation
semantically guided representation learning for action anticipation | arXiv: 2407.02309
semantichuman-hd high-resolution semantic disentangled 3d human generation | arXiv: 2403.10166
semgrasp semantic grasp generation via language aligned discretization | arXiv: 2404.03590
semi-supervised video desnowing network via temporal decoupling experts and dist | arXiv: 2410.07901
semtrack a large-scale dataset for semantic tracking in the wild
senc handling self-collision in neural cloth simulation | arXiv: 2407.12479
sfpnet sparse focal point network for semantic segmentation on general lidar poi | arXiv: 2407.11569
sgs-slam semantic gaussian splatting for neural dense slam | arXiv: 2402.03246
shape-guided configuration-aware learning for endoscopic-image-based pose estima
shapefusion a 3d diffusion model for localized shape editing | arXiv: 2403.19773
sharegpt4v improving large multi-modal models with better captions | arXiv: 2311.12793
shedding more light on robust classifiers under the lens of energy-based models | arXiv: 2407.06315
shifted autoencoders for point annotation restoration in object counting | arXiv: 2312.07190
shine saliency-aware hierarchical negative ranking for compositional temporal gr | arXiv: 2407.05118
siamese vision transformers are scalable audio-visual learners | arXiv: 2403.19638
sigma sinkhorn-guided masked video modeling | arXiv: 2407.15447
signavatars a large-scale 3d sign language holistic motion dataset and benchmark | arXiv: 2310.20436
silc improving vision language pretraining with self-distillation | arXiv: 2310.13355
simpb a single model for 2d and 3d object detection from multiple cameras | arXiv: 2403.10353
simple unsupervised knowledge distillation with space similarity | arXiv: 2409.13939
sinder repairing the singular defects of dinov2 | arXiv: 2407.16826
skymask attack-agnostic robust federated learning with fine-grained learnable ma | arXiv: 2312.12484
slack semantic location and appearance aware open-vocabulary tracking | arXiv: 2409.11235
sledge synthesizing driving environments with generative models and rule-based t | arXiv: 2403.17933
slotlifter slot-guided feature lifting for learning object-centric radiance fiel | arXiv: 2408.06697
smoodi stylized motion diffusion model | arXiv: 2407.12783
soft prompt generation for domain generalization | arXiv: 2404.19286
sos segment object system for open-world instance segmentation with object prior | arXiv: 2409.14627
source prompt disentangled inversion for boosting image editability with diffusi | arXiv: 2403.11105
spacejam a lightweight and regularization-free method for fast joint alignment o | arXiv: 2407.11850
spamming labels efficient annotations for the trackers of tomorrow | arXiv: 2404.11426
sparsessp 3d subcellular structure prediction from sparse-view transmitted light | arXiv: 2407.02159
spatialformer towards generalizable vision transformers with explicit spatial un
spatially-variant degradation model for dataset-free super-resolution | arXiv: 2407.08252
spatio-temporal proximity-aware dual-path model for panoramic activity recogniti | arXiv: 2403.14113
spectral subsurface scattering for material classification
spectram-ps spectrally multiplexed photometric stereo under unknown spectral com
spherical linear interpolation and text-anchoring for zero-shot composed image r | arXiv: 2405.00571
spherical world-locking for audio-visual localization in egocentric videos | arXiv: 2408.05364
spin hierarchical segmentation with subpart granularity in natural images | arXiv: 2407.09686
splatfields neural gaussian splats for sparse 3d and 4d reconstruction | arXiv: 2409.11211
sq-llava self-questioning for large vision-language assistant | arXiv: 2403.11299
stable preference redefining training paradigm of human preference model for tex
stepwise multi-grained boundary detector for point-supervised temporal action lo
stream query denoising for vectorized hd-map construction | arXiv: 2401.09112
stripe observation guided inference cost-free attention mechanism
stsp spatial-temporal subspace projection for video class-incremental learning
styletokenizer defining image style by a single instance for controlling diffusi | arXiv: 2409.02543
supergaussian repurposing video models for 3d super resolution | arXiv: 2406.00609
superpixel-informed implicit neural representation for multi-dimensional data | arXiv: 2411.11356
surface reconstruction from 3d gaussian splatting via local structural hints
sv3d novel multi-view synthesis and 3d generation from a single image using late | arXiv: 2403.12008
sync from the sea retrieving alignable videos from large-scale datasets | arXiv: 2409.01445
synchronous diffusion for unsupervised smooth non-rigid 3d shape matching
synergy of sight and semantics visual intention understanding with clip
t-mae temporal masked autoencoders for point cloud representation learning | arXiv: 2312.10217
talkinggaussian structure-persistent 3d talking head synthesis via gaussian spla | arXiv: 2404.15264
taming latent diffusion model for neural radiance field inpainting | arXiv: 2404.09995
taptr tracking any point with transformers as detection | arXiv: 2403.13042
tcc-det temporarily consistent cues for weakly-supervised 3d detection
teaching tailored to talent adverse weather restoration via prompt pool and dept | arXiv: 2409.15739
tela text to layer-wise 3d clothed human generation | arXiv: 2404.16748
temporally consistent stereo matching | arXiv: 2407.11950
tensorial template matching for fast cross-correlation with rotations and its ap | arXiv: 2408.02398
text-guided video masked autoencoder | arXiv: 2408.00759
text2place affordance-aware text guided human placement | arXiv: 2407.15446
textdiffuser-2 unleashing the power of language models for text rendering | arXiv: 2311.16465
textual-visual logic challenge understanding and reasoning in text-to-image gene
texture-gs disentangling the geometry and texture for 3d gaussian splatting edit | arXiv: 2403.10050
tf-fas twofold-element fine-grained semantic guidance for generalizable face ant
the fabrication of reality and fantasy scene generation with llm-assisted prompt | arXiv: 2407.12579
the hard positive truth about vision-language compositionality | arXiv: 2409.17958
the nerfect match exploring nerf features for visual localization | arXiv: 2403.09577
thermal3d-gs physics-induced 3d gaussians for thermal infrared novel-view synthe | arXiv: 2409.08042
timecraft navigate weakly-supervised temporal grounded video question answering
tip tabular-image pre-training for multimodal classification with incomplete dat | arXiv: 2407.07582
tod3cap towards 3d dense captioning in outdoor scenes | arXiv: 2403.19589
token compensator altering inference cost of vision transformer without re-tunin | arXiv: 2408.06798
topology-preserving downsampling of binary images | arXiv: 2407.17786
toward tiny and high-quality facial makeup with data amplify learning | arXiv: 2403.15033
towards model-agnostic dataset condensation by heterogeneous models | arXiv: 2409.14538
towards multi-modal transformers in federated learning | arXiv: 2404.12467
towards natural language-guided drones geotext-1652 benchmark with spatial relat | arXiv: 2311.12751
towards open-ended visual quality comparison | arXiv: 2402.16641
towards open-ended visual recognition with large language models | arXiv: 2311.08400
towards real-world adverse weather image restoration enhancing clearness and sem | arXiv: 2409.02101
towards real-world event-guided low-light video enhancement and deblurring | arXiv: 2408.14916
towards reliable advertising image generation using human feedback | arXiv: 2408.00418
towards unified representation of invariant-specific features in missing modalit
tpa3d triplane attention for fast text-to-3d generation | arXiv: 2312.02647
track everything everywhere fast and robustly | arXiv: 2403.17931
tracking meets lora faster training larger model stronger performance | arXiv: 2403.05231
tracknerf bundle adjusting nerf from sparse and noisy views via feature tracks | arXiv: 2408.10739
train till you drop towards stable and robust source-free unsupervised 3d domain | arXiv: 2409.04409
tram global trajectory and motion of 3d humans from in-the-wild videos | arXiv: 2403.17346
transferable 3d adversarial shape completion using diffusion models | arXiv: 2407.10077
ttt-mim test-time training with masked image modeling for denoising distribution
u-cope taking a further step to universal 9d category-level object pose estimati
udifftext a unified framework for high-quality text synthesis in arbitrary image | arXiv: 2312.04884
umbrae unified multimodal brain decoding | arXiv: 2404.07202
un-evimo unsupervised event-based independent motion segmentation | arXiv: 2312.00114
uncertainty-driven spectral compressive imaging with spatial-frequency transform
understanding physical dynamics with counterfactual world modeling | arXiv: 2312.06721
uni3dl a unified model for 3d vision-language understanding
unic universal classification models via multi-teacher distillation | arXiv: 2408.05088
unicode learning a unified codebook for multimodal large language models | arXiv: 2403.09072
unidream unifying diffusion priors for relightable text-to-3d generation | arXiv: 2312.08754
unifs universal few-shot instance perception with point representations | arXiv: 2404.19401
uniinr event-guided unified rolling shutter correction deblurring and interpolat | arXiv: 2305.15078
unim2ae multi-modal masked autoencoders with unified 3d representation for 3d pe
unitraj a unified framework for scalable vehicle trajectory prediction | arXiv: 2403.15098
unleashing the power of prompt-driven nucleus instance segmentation | arXiv: 2311.15939
unrolled decomposed unpaired learning for controllable low-light video enhanceme | arXiv: 2408.12316
unsupervised exposure correction | arXiv: 2507.17252
unsupervised moving object segmentation with atmospheric turbulence
unsupervised multi-modal medical image registration via invertible translation
unveiling advanced frequency disentanglement paradigm for low-light image enhanc | arXiv: 2409.01641
unveiling privacy risks in stochastic neural networks training effective image r
upose3d uncertainty-aware 3d human pose estimation with cross-view and temporal | arXiv: 2404.14634
upper-body hierarchical graph for skeleton based emotion recognition in assistiv
vamos versatile action models for video understanding | arXiv: 2311.13627
vary scaling up the vision vocabulary for large vision-language model | arXiv: 2312.06109
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | arXiv: 2312.06109
vcd-texture variance alignment based 3d-2d co-denoising for text-guided texturin | arXiv: 2407.04461
versatile incremental learning towards class and domain-agnostic incremental lea | arXiv: 2409.10956
versatilegaussian real-time neural rendering for versatile tasks using gaussian
vfusion3d learning scalable 3d generative models from video diffusion models | arXiv: 2403.12034
vic-mae self-supervised representation learning from images and video with contr | arXiv: 2303.12001
videoagent a memory-augmented multimodal agent for video understanding | arXiv: 2403.11481
videoclusternet self-supervised and adaptive face clustering for videos | arXiv: 2407.12214
videomamba spatio-temporal selective state space model | arXiv: 2407.08476
videomamba state space model for efficient video understanding | arXiv: 2403.06977
videoshop localized semantic video editing with noise-extrapolated diffusion inv | arXiv: 2403.14617
view selection for 3d captioning via diffusion ranking | arXiv: 2404.07984
visa reasoning video object segmentation via large language models | arXiv: 2407.11325
visage video instance segmentation with appearance-guided enhancement | arXiv: 2312.04885
visfocus prompt-guided vision encoders for ocr-free dense document understanding | arXiv: 2407.12594
visible and clear finding tiny objects in difference map | arXiv: 2405.11276
visiontrap vision-augmented trajectory prediction guided by textual descriptions | arXiv: 2407.12345
vista3d unravel the 3d darkside of a single image | arXiv: 2409.12193
visual grounding for object-level generalization in reinforcement learning | arXiv: 2408.01942
vp-sam taming segment anything model for video polyp segmentation via disentangl
walker self-supervised multiple object tracking by walking on temporal appearanc | arXiv: 2409.17221
wast-3d wasserstein-2 distance for scene-to-scene stylization on 3d gaussians | arXiv: 2409.17917
wavelength-embedding-guided filter-array transformer for spectral demosaicing
weak-to-strong compositional learning from generative models for language-based | arXiv: 2407.15296
weakly supervised 3d object detection via multi-level visual guidance | arXiv: 2312.07530
weakly-supervised camera localization by ground-to-satellite image registration | arXiv: 2409.06471
wear-any-way manipulable virtual try-on via sparse correspondence alignment | arXiv: 2403.12965
webrpg automatic web rendering parameters generation for visual presentation | arXiv: 2407.15502
wecromcl weakly supervised cross-modality contrastive learning for transcription | arXiv: 2407.19507
when do we not need larger vision models | arXiv: 2403.13043
wildvidfit video virtual try-on in the wild via image-based controlled diffusion | arXiv: 2407.10625
wordrobe text-guided generation of textured 3d garments | arXiv: 2403.17541
worldpose a world cup dataset for global 3d human pose estimation | arXiv: 2501.02771
x-former unifying contrastive and reconstruction learning for mllms | arXiv: 2407.13851
xpsr cross-modal priors for diffusion-based image super-resolution | arXiv: 2403.05049
yolov9 learning what you want to learn using programmable gradient information | arXiv: 2402.13616
you only learn one query learning unified human query for single-stage multi-per | arXiv: 2312.05525
you only need one step fast super-resolution with stable diffusion via scale dis | arXiv: 2401.17258
zero-shot detection of ai-generated images | arXiv: 2409.15875
zero-shot multi-object scene completion | arXiv: 2403.14628
zero-shot object counting with good exemplars | arXiv: 2407.04948
zest zero-shot material transfer from a single image | arXiv: 2404.06425
zigma a dit-style zigzag mamba diffusion model | arXiv: 2403.13802
ziplora any subject in any style by effectively merging loras | arXiv: 2311.13600
∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions | arXiv: 2407.14709
3dego 3d editing on the go | arXiv: 2407.10102
crossscore towards multiview image evaluation and scori | arXiv: 2404.14409
dgpic domain generalized pointincontext learning for po | arXiv: 2407.08801
dreamdrone texttoimage diffusion models are zeroshot perpetu | arXiv: 2312.08746
dreamview injecting viewspecific text guidance into textto3d | arXiv: 2404.06119
falip visual prompt as foveal attention boosts clip zer | arXiv: 2407.05578
jointdreamer ensuring geometry consistency and text congruen | arXiv: 2407.12291
scenegraphloc crossmodal coarse visual localization on 3d sc | arXiv: 2404.00469
sceneverse scaling 3d visionlanguage learning for grounded s | arXiv: 2401.09340
t-mae temporal masked autoencoders for point cloud representation learning | arXiv: 2312.10217
towards multimodal transformers in federated learning | arXiv: 2404.12467
action2sound ambientaware generation of action sounds from e | arXiv: 2406.09272
controlllm augment language models with tools | arXiv: 2310.17796
4d contrastive superflows are dense 3d representation learners | arXiv: 2407.06190
dvlo deep visuallidar odometry with localtoglobal featu | arXiv: 2403.18274
lidarevent stereo fusion with hallucinations | arXiv: 2408.04633
navigation instruction generation with bev | arXiv: 2407.15087
occgen generative multimodal 3d occupancy prediction for aut | arXiv: 2404.15014
reason2drive towards interpretable and chainbased reasoning | arXiv: 2312.03661
safe-sim safety-critical closed-loop traffic simulation with diffusion-cont | arXiv: 2401.00391
visiontrap visionaugmented trajectory prediction guided | arXiv: 2407.12345
dreamstruct understanding slides and user interfaces via synthetic data generati | arXiv: 2410.00201
bi-mdrg bridging image history in multimodal dialogue response generation | arXiv: 2408.05926
bimdrg bridging image history in multimodal dialogue respons | arXiv: 2408.05926
synchronous diffusion for unsupervised smooth non-rigid 3d shape matching | arXiv: 2407.08244
3dgazenet generalizing 3d gaze estimation with weak-supervision from synthetic v | arXiv: 2212.02997
large motion model for unified multimodal motion generation | arXiv: 2404.01284
quarvla visionlanguageaction model for quadruped robots | arXiv: 2312.14457
selfsupervised feature adaptation for 3d industrial ano | arXiv: 2401.03145
wordrobe textguided generation of textured 3d garments | arXiv: 2403.17541
a closer look at gan priors exploiting intermediate features | arXiv: 2407.13863
a highquality robust diffusion framework for corrupted datas | arXiv: 2311.17101
anycontrol create your artwork with versatile control on tex | arXiv: 2406.18958
colorpeel color prompt learning with diffusion models v | arXiv: 2407.07197
difftracker texttoimage diffusion models are unsupervised tr | arXiv: 2407.08394
emdm efficient motion diffusion model for fast and high | arXiv: 2312.02256
finematch aspectbased finegrained image and text mismat | arXiv: 2404.14715
freediff progressive frequency truncation for image edi | arXiv: 2404.11895
getting it right improving spatial consistency in texttoimag | arXiv: 2404.01197
hybridbooth hybrid prompt inversion for efficient subje | arXiv: 2410.08192
infiniteid identitypreserved personalization via idsema | arXiv: 2403.11781
latent guard a safety framework for texttoimage generation | arXiv: 2404.08031
lcmlookahead for encoderbased texttoimage personalization | arXiv: 2404.03620
learning trimodal relation for audiovisual question answerin | arXiv: 2407.16171
lego learning egocentric action frame generation via vi | arXiv: 2312.03849
mixdq memoryefficient fewstep texttoimage diffusion models w | arXiv: 2405.17873
motionchain conversational motion controllers via multimodal | arXiv: 2404.01700
pixelaware stable diffusion for realistic image superre | arXiv: 2308.14469
ponymation learning articulated 3d animal motions from | arXiv: 2312.13604
powerful and flexible personalized texttoimage generation vi | arXiv: 2407.06642
removing distributional discrepancies in captions improves i | arXiv: 2410.00905
scaledreamer scalable textto3d synthesis with asynchronous s | arXiv: 2407.02040
text2place affordanceaware text guided human placement | arXiv: 2407.15446
textdiffuser2 unleashing the power of language models f | arXiv: 2311.16465
towards reliable advertising image generation using human fe | arXiv: 2408.00418
xpsr crossmodal priors for diffusionbased image superresolut | arXiv: 2403.05049
artvlm attribute recognition through vision-based prefix language modeling | arXiv: 2408.04102
artvlm attribute recognition through visionbased prefix lang | arXiv: 2408.04102
grounding language models for visual entity recognition | arXiv: 2402.18695
multi-label cluster discrimination for visual representation learning | arXiv: 2407.17331
onerestore a universal restoration framework for composite degradation | arXiv: 2407.04621
towards open-ended visual recognition with large language models | arXiv: 2311.08400
detailsemnet elevating signature verification through detail-semantic integratio | arXiv: 2511.16364
improving intervention efficacy via concept realignment in concept bottleneck mo | arXiv: 2405.01531
plot text-based person search with part slot attention for corresponding part di | arXiv: 2409.13475
poa pre-training once for models of all sizes | arXiv: 2408.01031
colormnet a memory-based deep spatial-temporal feature propagation network for v | arXiv: 2404.06251
deep cost ray fusion for sparse depth video completion | arXiv: 2409.14935
distribution alignment for fully test-time adaptation with dynamic online data s | arXiv: 2407.12128
eliminating warping shakes for unsupervised online video stitching | arXiv: 2403.06378
gradient-regularized out-of-distribution detection | arXiv: 2404.12368
image-feature weak-to-strong consistency an enhanced paradigm for semi-supervise | arXiv: 2408.12614
imaging interiors an implicit solution to electromagnetic inverse scattering pro | arXiv: 2407.09352
instance-dependent noisy-label learning with graphical model based noise-rate es | arXiv: 2305.19486
ogni-dc robust depth completion with optimization-guided neural iterations | arXiv: 2406.11711
r2-bench benchmarking the robustness of referring perception models under pertur
sigma sinkhorn-guided masked video modeling | arXiv: 2407.15447
sync from the sea retrieving alignable videos from large-scale datasets | arXiv: 2409.01445
versatile incremental learning towards class and domain-agnostic incremental lea | arXiv: 2409.10956
visfocus prompt-guided vision encoders for ocr-free dense document understanding | arXiv: 2407.12594
visfocus promptguided vision encoders for ocrfree dense | arXiv: 2407.12594
cultural value differences llms | arXiv: 2407.16891
funqa towards surprising video comprehension | arXiv: 2306.14899
zeroshot object counting with good exemplars | arXiv: 2407.04948
cross-domain learning for video anomaly detection with limited supervision | arXiv: 2408.05191
dragapart learning a part-level motion prior for articulated objects | arXiv: 2403.15382
learning to obstruct few-shot image classification over restricted classes | arXiv: 2409.19210
plan posture and go towards open-vocabulary text-to-motion generation | arXiv: 2312.14828
prelar world model pre-training with learnable action representation
prompting language-informed distribution for compositional zero-shot learning | arXiv: 2305.14428
scaling backwards minimal synthetic pre-training | arXiv: 2408.00677
scantalk 3d talking heads from unregistered scans | arXiv: 2403.10942
controllable navigation instruction generation | arXiv: 2407.07433
magr manifold-aligned graph regularization for continual action quality assessme | arXiv: 2403.04398
gtp4o modalityprompted heterogeneous graph learning for | arXiv: 2407.05540
improving medical multimodal contrastive learning with exper | arXiv: 2403.10153
pathologyknowledge enhanced multiinstance prompt learni | arXiv: 2407.10814
tip tabularimage pretraining for multimodal classification w | arXiv: 2407.07582
genq quantization in low data regimes with generative synthetic data | arXiv: 2312.05272
attention prompting on image for large visionlanguage models | arXiv: 2409.17143
beaf observing beforeafter changes to evaluate hallucination | arXiv: 2407.13442
brave broadening the visual encoding of visionlanguage model | arXiv: 2404.07204
cat audio visual qa | arXiv: 2403.04640
clap isolating content from style through contrastive learni | arXiv: 2311.16445
classact active learning | arXiv: 2312.05328
decoupling common and unique representations for multimodal | arXiv: 2309.05300
elevating all zeroshot sketchbased image retrieval through m | arXiv: 2407.04207
eyes closed safety on protecting multimodal llms via imageto | arXiv: 2403.09572
flexattention for efficient highresolution visionlanguage mo | arXiv: 2407.20228
freemotion mocapfree human motion synthesis with multimodal | arXiv: 2406.10740
genixer empowering multimodal large language model as a powe | arXiv: 2312.06731
groma localized visual tokenization for grounding multimodal | arXiv: 2404.13013
marvelovd marrying object recognition and visionlanguage mod | arXiv: 2407.21465
mathverse does your multimodal llm truly see the diagrams in | arXiv: 2403.14624
metaprompting for automating zeroshot visual recognitio | arXiv: 2403.11755
mmbench is your multimodal model an allaround player | arXiv: 2307.06281
myvlm personalizing vlms for userspecific queries | arXiv: 2403.14599
navgpt2 unleashing navigational reasoning capability | arXiv: 2407.12366
nymeria a massive collection of multimodal egocentric daily motion in the wild | arXiv: 2406.09905
omniviewtuning boosting viewpoint invariance of visionlangua | arXiv: 2404.12139
quantized prompt for efficient generalization of visionlangu | arXiv: 2407.10704
robust calibration of large visionlanguage adapters | arXiv: 2407.13588
sharegpt4v improving large multimodal models with better cap | arXiv: 2311.12793
sqllava selfquestioning for large visionlanguage assistant | arXiv: 2403.11299
the hard positive truth about visionlanguage compositionalit | arXiv: 2409.17958
towards openended visual quality comparison | arXiv: 2402.16641
towards realworld adverse weather image restoration enhancin | arXiv: 2409.02101
unicode learning a unified codebook for multimodal large lan | arXiv: 2403.09072
xformer unifying contrastive and reconstruction learning for | arXiv: 2407.13851
slimer zero shot ner | arXiv: 2407.01272
a new dataset and framework for real-world blurred images super-resolution | arXiv: 2407.14880
afreeca annotationfree counting for all | arXiv: 2403.04943
be yourself bounded attention for multisubject texttoimage g | arXiv: 2403.16990
i canapost believe itaposs not scene flow | arXiv: 2403.04739
layoutdetr detection transformer is a good multimodal layout | arXiv: 2212.09877
towards natural languageguided drones geotext1652 bench | arXiv: 2311.12751
tracking meets lora faster training larger model strong | arXiv: 2403.05231
docling pdf document conversion | arXiv: 2408.09869
teaching tailored to talent adverse weather restoration | arXiv: 2409.15739
adaglimpse active visual exploration with arbitrary glimpse position and scale | arXiv: 2404.03482
octopus embodied visionlanguage programmer from environmental feedback | arXiv: 2310.08588
adapting fine-grained cross-view localization to areas without fine ground truth | arXiv: 2406.00474
disco embodied navigation and interaction | arXiv: 2407.14758
prioritized semantic learning for zeroshot instance navigation | arXiv: 2403.11650
semgrasp semantic grasp generation via language aligned | arXiv: 2404.03590
adalog post-training quantization for vision transformers with adaptive logarith | arXiv: 2407.12951
controlnet improving conditional controls with efficien | arXiv: 2404.07987
densenets reloaded paradigm shift beyond resnets and vits | arXiv: 2403.19588
openpsg openset panoptic scene graph generation via large mu | arXiv: 2407.11213
sclip rethinking selfattention for dense visionlanguage infe | arXiv: 2312.01597
distribution-aware robust learning from long-tailed data with noisy labels | arXiv: 2407.16802
grace graph-based contextual debiasing for fair visual question answering
blazebvd make scale-time equalization great again for blind video deflickering | arXiv: 2403.06243
draganything motion control for anything using entity representation | arXiv: 2403.07420
dreammotion space-time self-similar score distillation for zero-shot video editi | arXiv: 2403.12002
evaluating text-to-visual generation with image-to-text generation | arXiv: 2404.01291
exploring pre-trained text-to-video diffusion models for referring video object | arXiv: 2403.12042
exploring pretrained texttovideo diffusion models for referr | arXiv: 2403.12042
freeinit bridging initialization gap in video diffusion | arXiv: 2312.07537
freeinit bridging initialization gap in video diffusion models | arXiv: 2312.07537
kalman-inspired feature propagation for video face super-resolution | arXiv: 2408.05205
magdiff multi-alignment diffusion for high-fidelity video generation and editing | arXiv: 2311.17338
mofa-video controllable image animation via generative motion field adaptions in | arXiv: 2405.20222
physdreamer physics-based interaction with 3d objects via video generation | arXiv: 2404.13026
realviformer investigating attention for real-world video super-resolution | arXiv: 2407.13987
sv3d novel multi-view synthesis and 3d generation from a single image using late | arXiv: 2403.12008
vfusion3d learning scalable 3d generative models from video diffusion models | arXiv: 2403.12034
videoshop localized semantic video editing with noise-extrapolated diffusion inv | arXiv: 2403.14617
actionswitch class-agnostic detection of simultaneous actions in streaming video | arXiv: 2407.12987
elysium exploring objectlevel perception in videos via mllm | arXiv: 2403.16558
finepseudo improving pseudo-labelling through temporal-alignablity for semi-supe | arXiv: 2409.01448
nymeria a massive collection of multimodal egocentric daily | arXiv: 2406.09905
pite pixeltemporal alignment for large videolanguage mo | arXiv: 2409.07239