ECCV2024 论文笔记 TODO¶
总计: 1041 篇 | 已完成: 1041 | 待更新: 0
- 2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction | arXiv: 2409.09969
- 3D Congealing: 3D-Aware Image Alignment in the Wild | arXiv: 2404.02125
- 3D Hand Pose Estimation in Everyday Egocentric Images | arXiv: 2312.06583
- 3D Reconstruction of Objects in Hands without Real World 3D Supervision | arXiv: 2305.03036
- 3D Single-Object Tracking in Point Clouds with High Temporal Variation | arXiv: 2408.02049
- 3DEgo: 3D Editing on the Go! | arXiv: 2407.10102
- 3dfg-pifu 3d feature grids for human digitization from sparse views
- 3DGazeNet: Generalizing 3D Gaze Estimation with Weak-Supervision from Synthetic Views | arXiv: 2212.02997
- 3dsa multi-view 3d human pose estimation with 3d space attention mechanisms
- 3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting | arXiv: 2408.03753
- 3×2: 3D Object Part Segmentation by 2D Semantic Correspondences | arXiv: 2407.09648
- 4D Contrastive Superflows are Dense 3D Representation Learners | arXiv: 2407.06190
- 4diff 3d-aware diffusion model for third-to-first viewpoint translation
- 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model | arXiv: 2407.15484
- a cephalometric landmark regression method based on dual-encoder for high-resolu
- A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks | arXiv: 2407.13863
- A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis | arXiv: 2311.12897
- A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control | arXiv: 2407.15631
- a direct approach to viewing graph solvability
- A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation | arXiv: 2406.07320
- A High-Quality Robust Diffusion Framework for Corrupted Dataset | arXiv: 2311.17101
- A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis | arXiv: 2503.06973
- A New Dataset and Framework for Real-World Blurred Images Super-Resolution | arXiv: 2407.14880
- A Probability-guided Sampler for Neural Implicit Surface Rendering | arXiv: 2506.08619
- a rotation-invariant texture vit for fine-grained recognition of esophageal canc
- A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties | arXiv: 2312.13764
- A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars | arXiv: 2401.04730
- A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting | arXiv: 2401.10227
- A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging | arXiv: 2407.21517
- ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-Agnostic Counting | arXiv: 2309.04820
- AccDiffusion: An Accurate Method for Higher-Resolution Image Generation | arXiv: 2407.10738
- Accelerating Image Super-Resolution Networks with Pixel-Level Classification | arXiv: 2407.21448
- Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention | arXiv: 2407.06683
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos | arXiv: 2406.09272
- ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos | arXiv: 2407.12987
- ActionVOS: Actions as Prompts for Video Object Segmentation | arXiv: 2407.07402
- Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images | arXiv: 2303.11530
- Active Generation for Image Classification | arXiv: 2403.06517
- AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection | arXiv: 2407.15795
- AdaDiffSR: Adaptive Region-Aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution | arXiv: 2410.17752
- AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition | arXiv: 2407.01332
- AdaGen: Learning Adaptive Policy for Image Synthesis | arXiv: 2603.06993
- AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale | arXiv: 2404.03482
- AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer | arXiv: 2407.12951
- AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation | arXiv: 2409.00342
- Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts | arXiv: 2407.14872
- Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth | arXiv: 2406.00474
- Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction | arXiv: 2403.07263
- Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling | arXiv: 2407.08256
- Adaptive Correspondence Scoring for Unsupervised Medical Image Registration | arXiv: 2312.00837
- Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification | arXiv: 2410.06977
- Adaptive Human Trajectory Prediction via Latent Corridors | arXiv: 2312.06653
- Adaptive Multi-head Contrastive Learning | arXiv: 2310.05615
- adaptive multi-task learning for few-shot object detection
- Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing | arXiv: 2409.11738
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting | arXiv: 2403.09513
- addme zero-shot group-photo synthesis by inserting people into scenes
- addressclip empowering vision-language models for city-wide image address locali | arXiv: 2407.08156
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization | arXiv: 2407.08156
- ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation | arXiv: 2408.09042
- admap anti-disturbance framework for vectorized hd map construction
- adversarially robust distillation by reducing the student-teacher variance gap
- aednet adaptive embedding and multiview-aware disentanglement for point cloud co
- aff-ttention affordances and attention models for short-term object interaction | arXiv: 2406.01194
- afreeca annotation-free counting for all | arXiv: 2403.04943
- agent3d-zero an agent for zero-shot 3d understanding | arXiv: 2403.11835
- aid-appeal automatic image dataset and algorithm for content appeal enhancement | arXiv: 2407.05546
- align before collaborate mitigating feature misalignment for robust multi-agent
- alignist cad-informed orientation distribution estimation by fusing shape and co | arXiv: 2409.06683
- alternate diverse teaching for semi-supervised medical image segmentation | arXiv: 2311.17325
- amego active memory from long egocentric videos | arXiv: 2409.10917
- an economic framework for 6-dof grasp detection
- an incremental unified framework for small defect inspection
- analysis-by-synthesis transformer for single-view 3d reconstruction
- analytic-splatting anti-aliased 3d gaussian splatting via analytic integration | arXiv: 2403.11056
- animatabledreamer text-guided non-rigid 3d model generation and reconstruction w | arXiv: 2312.03795
- any target can be offense adversarial example generation via generalized latent | arXiv: 2407.12292
- anycontrol create your artwork with versatile control on text-to-image generatio | arXiv: 2406.18958
- anytime continual learning for open vocabulary classification | arXiv: 2409.08518
- apl anchor-based prompt learning for one-stage weakly supervised referring expre
- approaching outside scaling unsupervised 3d object detection from 2d scene | arXiv: 2407.08569
- architecture-agnostic untrained network priors for image reconstruction with fre | arXiv: 2312.09988
- artvlm attribute recognition through vision-based prefix language modeling | arXiv: 2408.04102
- asymmetric mask scheme for self-supervised real image denoising | arXiv: 2407.06514
- attention decomposition for cross-domain semantic segmentation
- attention prompting on image for large vision-language models | arXiv: 2409.17143
- attnzero efficient attention discovery for vision transformers
- audio-driven talking face generation with stabilized synchronization loss | arXiv: 2307.09368
- augdetr improving multi-scale learning for detection transformer
- auto-das automated proxy discovery for training-free distillation-aware architec
- auto-gas automated proxy discovery for training-free generative architecture sea
- avatar fingerprinting for authorized use of synthetic talking-head videos | arXiv: 2305.03713
- bad students make great teachers active learning accelerates large-scale visual
- bad-gaussians bundle adjusted deblur gaussian splatting | arXiv: 2403.11831
- bam-detr boundary-aligned moment detection transformer for temporal sentence gro | arXiv: 2312.00083
- bamm bidirectional autoregressive motion model | arXiv: 2403.19435
- basic bayesnet structure learning for computational scalable neural image compre
- bayesian evidential deep learning for online action detection
- be yourself bounded attention for multi-subject text-to-image generation | arXiv: 2403.16990
- beaf observing before-after changes to evaluate hallucination in vision-language | arXiv: 2407.13442
- beat-it beat-synchronized multi-condition 3d dance generation | arXiv: 2407.07554
- benchmarks and challenges in pose estimation for egocentric hand interactions wi | arXiv: 2403.16428
- benerf neural radiance fields from a single blurry image and event stream | arXiv: 2407.02174
- beta-tuned timestep diffusion model
- bi-directional contextual attention for 3d dense captioning | arXiv: 2408.06662
- bi-mdrg bridging image history in multimodal dialogue response generation | arXiv: 2408.05926
- bi-tta bidirectional test-time adapter for remote physiological measurement | arXiv: 2409.17316
- bidirectional stereo image compression with cross-dimensional entropy model | arXiv: 2407.10632
- bidirectional uncertainty-based active learning for open-set annotation | arXiv: 2402.15198
- binomial self-compensation for motion error in dynamic 3d scanning | arXiv: 2404.06693
- blazebvd make scale-time equalization great again for blind video deflickering | arXiv: 2403.06243
- blind image deblurring with noise-robust kernel estimation
- blink multimodal large language models can see but not perceive | arXiv: 2404.12390
- boosting 3d single object tracking with 2d matching distillation and 3d pre-trai
- brain netflix scaling data to reconstruct videos from brain signals
- brain-id learning contrast-agnostic anatomical representations for brain imaging | arXiv: 2311.16914
- brave broadening the visual encoding of vision-language models | arXiv: 2404.07204
- bridge past and future overcoming information asymmetry in incremental object de | arXiv: 2407.11499
- bridging the gap between human motion and action semantics via kinematic phrases
- bridging the gap studio-like avatar creation from a monocular phone capture | arXiv: 2407.19593
- brushnet a plug-and-play image inpainting model with decomposed dual-branch diff | arXiv: 2403.06976
- byteedit boost comply and accelerate generative image editing | arXiv: 2404.04860
- caesarnerf calibrated semantic representation for few-shot generalizable neural | arXiv: 2311.15510
- camera height doesnapost change unsupervised training for metric monocular road- | arXiv: 2312.04530
- can ood object detectors learn from foundation models | arXiv: 2409.05162
- canonicalfusion generating drivable 3d human avatars from multiple images | arXiv: 2407.04345
- cardiacnet learning to reconstruct abnormalities for cardiac disease assessment | arXiv: 2410.20769
- carformer self-driving with learned object-centric representations | arXiv: 2407.15843
- cat enhancing multimodal large language model to answer questions in dynamic aud
- category adaptation meets projected distillation in generalized continual catego | arXiv: 2308.12112
- cg-slam efficient dense rgb-d slam in a consistent uncertainty-aware 3d gaussian | arXiv: 2403.16095
- challenging forgets unveiling the worst-case forget sets in machine unlearning | arXiv: 2403.07362
- chameleon a data-efficient generalist for dense visual prediction in the wild | arXiv: 2404.18459
- chex interactive localization and region description in chest x-rays | arXiv: 2404.15770
- citygaussian real-time high-quality large-scale scene rendering with gaussians | arXiv: 2404.01133
- clap isolating content from style through contrastive learning with augmented pr | arXiv: 2311.16445
- classification matters improving video action detection with class-specific atte | arXiv: 2407.19698
- click-gaussian interactive segmentation to any 3d gaussians | arXiv: 2407.11793
- clip-guided generative networks for transferable targeted adversarial attacks | arXiv: 2407.10179
- cloudfixer test-time adaptation for 3d point clouds via diffusion-guided geometr
- clr-gan improving gans stability and quality via consistent latent representatio
- co-synthesis of histopathology nuclei image-label pairs using a context-conditio | arXiv: 2407.14434
- coherentgs sparse novel view synthesis with coherent 3d gaussians | arXiv: 2403.19495
- coho context-sensitive city-scale hierarchical urban layout generation | arXiv: 2407.11294
- coin control-inpainting diffusion prior for human and camera motion estimation | arXiv: 2408.16426
- coin-matting confounder intervention for image matting
- cola conditional dropout and language-driven robust dual-modal salient object de | arXiv: 2407.06780
- coleaf a contrastive-collaborative learning framework for weakly supervised audi | arXiv: 2405.10690
- collaborative control for geometry-conditioned pbr image generation | arXiv: 2402.05919
- colormae exploring data-independent masking strategies in masked autoencoders | arXiv: 2407.13036
- colormnet a memory-based deep spatial-temporal feature propagation network for v | arXiv: 2404.06251
- colorpeel color prompt learning with diffusion models via color and shape disent | arXiv: 2407.07197
- combining generative and geometry priors for wide-angle portrait correction | arXiv: 2410.09911
- comboverse compositional 3d assets creation using spatially-aware diffusion guid | arXiv: 2403.12409
- como controllable motion generation through language guided pose code editing | arXiv: 2403.13900
- compress3d a compressed latent space for 3d generation from a single image | arXiv: 2403.13524
- confidence self-calibration for multi-label class-incremental learning | arXiv: 2403.12559
- congeo robust cross-view geo-localization across ground view variations | arXiv: 2403.13965
- contourlet residual for prompt learning enhanced infrared image super-resolution
- controllable navigation instruction generation with chain of thought prompting | arXiv: 2407.07433
- controlling the world by sleight of hand | arXiv: 2408.07147
- controlllm augment language models with tools by searching on graphs | arXiv: 2310.17796
- controlnet improving conditional controls with efficient consistency feedback | arXiv: 2404.07987
- cor-gs sparse-view 3d gaussian splatting via co-regularization | arXiv: 2405.12110
- cores orchestrating the dance of reasoning and segmentation | arXiv: 2404.05673
- cpm class-conditional prompting machine for audio-visual segmentation | arXiv: 2407.05358
- crm single image to 3d textured mesh with convolutional reconstruction model | arXiv: 2403.05034
- cross-domain learning for video anomaly detection with limited supervision | arXiv: 2408.05191
- cross-platform video person reid a new benchmark dataset and adaptation approach | arXiv: 2408.07500
- crossglg llm guides one-shot skeleton-based 3d action recognition in a cross-lev | arXiv: 2403.10082
- crossscore towards multi-view image evaluation and scoring | arXiv: 2404.14409
- cs2k class-specific and class-shared knowledge guidance for incremental semantic | arXiv: 2407.09047
- csot cross-scan object transfer for semi-supervised lidar object detection
- cut out the middleman revisiting pose-based gait recognition
- d-sco dual-stream conditional diffusion for monocular hand-held object reconstru | arXiv: 2311.14189
- damsdet dynamic adaptive multispectral detection transformer with competitive qu | arXiv: 2403.00326
- data collection-free masked video modeling | arXiv: 2409.06665
- dataset enhancement with instance-level augmentations | arXiv: 2406.08249
- dataset growth | arXiv: 2405.18347
- datenerf depth-aware text-based editing of nerfs | arXiv: 2404.04526
- dc-solver improving predictor-corrector diffusion sampler via dynamic compensati | arXiv: 2409.03755
- dcdm diffusion-conditioned-diffusion model for scene text image super-resolution
- de-confounded gaze estimation
- deblur e-nerf nerf from motion-blurred events under high-speed or low-light cond | arXiv: 2409.17988
- deceptive-nerf3dgs diffusion-generated pseudo-observations for high-quality spar | arXiv: 2305.15171
- decomposed vector-quantized variational autoencoder for human grasp generation | arXiv: 2407.14062
- decoupling common and unique representations for multimodal self-supervised lear | arXiv: 2309.05300
- deep cost ray fusion for sparse depth video completion | arXiv: 2409.14935
- deep nets with subsampling layers unwittingly discard useful activations at test | arXiv: 2410.01083
- deep patch visual slam | arXiv: 2408.01654
- defect spectrum a granular look of large-scale defect datasets with rich semanti
- denoisplit a method for joint microscopy image splitting and unsupervised denois | arXiv: 2403.11854
- DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | arXiv: 2403.19588
- detailsemnet elevating signature verification through detail-semantic integratio | arXiv: 2511.16364
- detecting as labeling rethinking lidar-camera fusion in 3d object detection | arXiv: 2311.07152
- dg-pic domain generalized point-in-context learning for point cloud understandin | arXiv: 2407.08801
- diff-tracker text-to-image diffusion models are unsupervised trackers | arXiv: 2407.08394
- differentiable convex polyhedra optimization from multi-view images | arXiv: 2407.15686
- diffit diffusion vision transformers for image generation | arXiv: 2312.02139
- diffusion model is a good pose estimator from 3d rf-vision | arXiv: 2403.16198
- diffusion models for monocular depth estimation overcoming challenging condition | arXiv: 2407.16698
- diffusion models for open-vocabulary segmentation | arXiv: 2306.09316
- diffusion-based image-to-image translation by noise correction via prompt interp | arXiv: 2409.08077
- diffusion-driven data replay a novel approach to combat forgetting in federated | arXiv: 2409.01128
- diffusiondepth diffusion denoising approach for monocular depth estimation | arXiv: 2303.05021
- dino-tracker taming dino for self-supervised point tracking in a single video | arXiv: 2403.14548
- disco embodied navigation and interaction via differentiable scene semantics and | arXiv: 2407.14758
- distill gold from massive ores bi-level data pruning towards efficient dataset d | arXiv: 2305.18381
- distilling diffusion models into conditional gans | arXiv: 2405.05967
- distribution alignment for fully test-time adaptation with dynamic online data s | arXiv: 2407.12128
- distribution-aware robust learning from long-tailed data with noisy labels | arXiv: 2407.16802
- divide and fuse body part mesh recovery from partially visible human images | arXiv: 2407.09694
- domain reduction strategy for non-line-of-sight imaging | arXiv: 2308.10269
- domain-adaptive video deblurring via test-time blurring | arXiv: 2407.09059
- domesticating sam for breast ultrasound image segmentation via spatial-frequency
- draganything motion control for anything using entity representation | arXiv: 2403.07420
- dragapart learning a part-level motion prior for articulated objects | arXiv: 2403.15382
- dreamdiffusion high-quality eeg-to-image generation with temporal masked signal
- dreamdissector learning disentangled text-to-3d generation from 2d diffusion pri | arXiv: 2407.16260
- dreamdrone text-to-image diffusion models are zero-shot perpetual view generator | arXiv: 2312.08746
- dreamlip language-image pre-training with long captions | arXiv: 2403.17007
- dreammotion space-time self-similar score distillation for zero-shot video editi | arXiv: 2403.12002
- dreammover leveraging the prior of diffusion models for image interpolation with | arXiv: 2409.09605
- dreamscene360 unconstrained text-to-3d scene generation with panoramic gaussian | arXiv: 2404.06903
- dreamstruct understanding slides and user interfaces via synthetic data generati | arXiv: 2410.00201
- dreamview injecting view-specific text guidance into text-to-3d generation | arXiv: 2404.06119
- dropout mixture low-rank adaptation for visual parameters-efficient fine-tuning
- dspdet3d 3d small object detection with dynamic spatial pruning | arXiv: 2305.03716
- dual-level adaptive self-labeling for novel class discovery in point cloud segme | arXiv: 2407.12489
- dvlo deep visual-lidar odometry with local-to-global feature fusion and bi-direc | arXiv: 2403.18274
- dynamic neural radiance field from defocused monocular video | arXiv: 2407.05586
- dyset a dynamic masked self-distillation approach for robust trajectory predicti
- eaformer scene text segmentation with edge-aware transformers | arXiv: 2407.17020
- early preparation pays off new classifier pre-tuning for class incremental seman | arXiv: 2407.14142
- ebdm exemplar-guided image translation with brownian-bridge diffusion models | arXiv: 2410.09802
- echoscene indoor scene generation via information echo over scene graph diffusio | arXiv: 2405.00915
- edformer transformer-based event denoising across varied noise levels
- editable image elements for controllable synthesis | arXiv: 2404.16029
- edtalk efficient disentanglement for emotional talking head synthesis | arXiv: 2404.01647
- efficient and versatile robust fine-tuning of zero-shot models | arXiv: 2408.05749
- efficient cascaded multiscale adaptive network for image restoration
- efficient depth-guided urban view synthesis | arXiv: 2407.12395
- efficient diffusion transformer with step-wise dynamic attention mediators | arXiv: 2408.05710
- efficient few-shot action recognition via multi-level post-reasoning
- efficient image pre-training with siamese cropped masked autoencoders | arXiv: 2403.17823
- efficient inference of vision instruction-following models with elastic cache | arXiv: 2407.18121
- egoexo-fitness towards egocentric and exocentric full-body action understanding | arXiv: 2406.08877
- egoposer robust real-time egocentric pose estimation from sparse and intermitten | arXiv: 2308.06493
- elegantly written disentangling writer and character styles for enhancing online
- elevating all zero-shot sketch-based image retrieval through multimodal prompt l | arXiv: 2407.04207
- eliminating feature ambiguity for few-shot segmentation | arXiv: 2407.09842
- eliminating warping shakes for unsupervised online video stitching | arXiv: 2403.06378
- else efficient deep neural network inference through line-based sparsity explora
- elysium exploring object-level perception in videos via mllm | arXiv: 2403.16558
- emdm efficient motion diffusion model for fast and high-quality motion generatio | arXiv: 2312.02256
- energy-induced explicit quantification for multi-modality mri fusion
- enhancing diffusion models with text-encoder reinforcement learning | arXiv: 2311.15657
- enhancing optimization robustness in 1-bit neural networks through stochastic si
- enhancing perceptual quality in video super-resolution through temporally-consis | arXiv: 2311.15908
- enhancing vectorized map perception with historical rasterized maps | arXiv: 2409.00620
- equi-gspr equivariant se3 graph network model for sparse point cloud registratio | arXiv: 2410.05729
- equivariant spatio-temporal self-supervision for lidar object detection | arXiv: 2404.11737
- et the exceptional trajectories text-to-camera-trajectory generation with charac | arXiv: 2407.01516
- eta inversion designing an optimal eta function for diffusion-based real image e | arXiv: 2403.09468
- evaluating text-to-visual generation with image-to-text generation | arXiv: 2404.01291
- event trojan asynchronous event-based backdoor attacks | arXiv: 2407.06838
- event-based head pose estimation benchmark and method
- event-based mosaicing bundle adjustment | arXiv: 2409.07365
- evsign sign language recognition and translation with streaming events | arXiv: 2407.12593
- exemplar-free continual representation learning via learnable drift compensation | arXiv: 2407.08536
- explicitly guided information interaction network for cross-modal point cloud co | arXiv: 2407.02887
- exploiting dual-correlation for multi-frame time-of-flight denoising
- exploring guided sampling of conditional gans
- exploring pre-trained text-to-video diffusion models for referring video object | arXiv: 2403.12042
- exploring the feature extraction and relation modeling for light-weight transfor
- external knowledge enhanced 3d scene generation from sketch | arXiv: 2403.14121
- eyes closed safety on protecting multimodal llms via image-to-text transformatio | arXiv: 2403.09572
- facial affective behavior analysis with instruction tuning | arXiv: 2404.05052
- falip visual prompt as foveal attention boosts clip zero-shot performance | arXiv: 2407.05578
- fastcad real-time cad retrieval and alignment from scans and videos | arXiv: 2403.15161
- fine-grained scene graph generation via sample-level bias prediction | arXiv: 2407.19259
- finematch aspect-based fine-grained image and text mismatch detection and correc | arXiv: 2404.14715
- finepseudo improving pseudo-labelling through temporal-alignablity for semi-supe | arXiv: 2409.01448
- fisher calibration for backdoor-robust heterogeneous federated learning
- fisherrf active view selection and mapping with radiance fields using fisher inf
- flash cache reducing bias in radiance cache based inverse rendering | arXiv: 2409.05867
- flashsplat 2d to 3d gaussian splatting segmentation solved optimally | arXiv: 2409.08270
- flashtex fast relightable mesh texturing with lightcontrolnet | arXiv: 2402.13251
- flat flux-aware imperceptible adversarial attacks on 3d point clouds
- flexattention for efficient high-resolution vision-language models | arXiv: 2407.20228
- flowcon out-of-distribution detection using flow-based contrastive learning | arXiv: 2407.03489
- flying with photons rendering novel views of propagating light | arXiv: 2404.06493
- forest2seq revitalizing order prior for sequential indoor scene synthesis | arXiv: 2407.05388
- formula-supervised visual-geometric pre-training | arXiv: 2409.13535
- foster adaptivity and balance in learning with noisy labels | arXiv: 2407.02778
- foundpose unseen object pose estimation with foundation features | arXiv: 2311.18809
- fouriscale a frequency perspective on training-free high-resolution image synthe | arXiv: 2403.12963
- free-viewpoint video of outdoor sports using a flying camera
- freeaugment data augmentation search across all degrees of freedom | arXiv: 2409.04820
- freecompose generic zero-shot image composition with diffusion prior | arXiv: 2407.04947
- freediff progressive frequency truncation for image editing with diffusion model | arXiv: 2404.11895
- freeinit bridging initialization gap in video diffusion models | arXiv: 2312.07537
- freemotion a unified framework for number-free text-to-motion synthesis | arXiv: 2405.15763
- freemotion mocap-free human motion synthesis with multimodal large language mode | arXiv: 2406.10740
- freestyleret retrieving images from style-diversified queries | arXiv: 2312.02428
- frequency-spatial entanglement learning for camouflaged object detection | arXiv: 2409.01686
- frest feature restoration for semantic segmentation under multiple adverse condi | arXiv: 2407.13437
- fsd-bev foreground self-distillation for multi-view 3d object detection | arXiv: 2407.10135
- fully sparse 3d occupancy prediction | arXiv: 2312.17118
- functional transform-based low-rank tensor factorization for multi-dimensional d
- funqa towards surprising video comprehension | arXiv: 2306.14899
- futuredepth learning to predict the future improves video depth estimation | arXiv: 2403.12953
- g2fr frequency regularization in grid-based feature encoding neural radiance fie
- g3r gradient guided generalizable reconstruction | arXiv: 2409.19405
- garmentaligner text-to-garment generation via retrieval-augmented multi-level co | arXiv: 2408.12352
- gaura generalizable approach for unified restoration and rendering of arbitrary | arXiv: 2407.08221
- gaussctrl multi-view consistent text-driven 3d gaussian splatting editing | arXiv: 2403.08733
- gaussian grouping segment and edit anything in 3d scenes | arXiv: 2312.00732
- gaussianformer scene as gaussians for vision-based 3d semantic occupancy predict | arXiv: 2405.17429
- gaussianimage 1000 fps image representation and compression by 2d gaussian splat | arXiv: 2403.08551
- gaussreg fast 3d registration with gaussian splatting | arXiv: 2407.05254
- gaze target detection based on head-local-global coordination
- gazexplain learning to predict natural language explanations of visual scanpaths
- general and task-oriented video segmentation | arXiv: 2407.06540
- generalizable facial expression recognition | arXiv: 2408.10614
- generating 3d house wireframes with semantics | arXiv: 2407.12267
- generating human interaction motions in scenes with text control | arXiv: 2404.10685
- generative camera dolly extreme monocular dynamic novel view synthesis | arXiv: 2405.14868
- genixer empowering multimodal large language model as a powerful data generator | arXiv: 2312.06731
- genq quantization in low data regimes with generative synthetic data | arXiv: 2312.05272
- geometrysticker enabling ownership claim of recolorized neural radiance fields | arXiv: 2407.13390
- geowizard unleashing the diffusion priors for 3d geometry estimation from a sing | arXiv: 2403.12013
- getting it right improving spatial consistency in text-to-image models | arXiv: 2404.01197
- git towards generalist vision transformer through universal language interface | arXiv: 2403.09394
- gkgnet group k-nearest neighbor based graph convolutional network for multi-labe | arXiv: 2308.14378
- global-to-pixel regression for human mesh recovery
- goldfish vision-language understanding of arbitrarily long videos | arXiv: 2407.12679
- gpsformer a global perception and local structure fitting-based transformer for | arXiv: 2407.13519
- gra detecting oriented objects through group-wise rotating and attention | arXiv: 2403.11127
- grace graph-based contextual debiasing for fair visual question answering
- gradient-regularized out-of-distribution detection | arXiv: 2404.12368
- graphbev towards robust bev feature alignment for multi-modal 3d object detectio | arXiv: 2403.11848
- graspxl generating grasping motions for diverse objects at scale | arXiv: 2403.19649
- grm large gaussian reconstruction model for efficient 3d reconstruction and gene | arXiv: 2403.14621
- groma localized visual tokenization for grounding multimodal large language mode | arXiv: 2404.13013
- grounding language models for visual entity recognition | arXiv: 2402.18695
- gs-lrm large reconstruction model for 3d gaussian splatting | arXiv: 2404.19702
- gs-pose category-level object pose estimation via geometric and semantic corresp | arXiv: 2311.13777
- gtp-4o modality-prompted heterogeneous graph learning for omni-modal biomedical | arXiv: 2407.05540
- gvgen text-to-3d generation with volumetric representation | arXiv: 2403.12957
- h-v2x a large scale highway dataset for bev perception
- hac hash-grid assisted context for 3d gaussian splatting compression | arXiv: 2403.14530
- handling the non-smooth challenge in tensor svd a multi-objective tensor recover | arXiv: 2311.13958
- harnessing text-to-image diffusion models for category-agnostic pose estimation
- hat history-augmented anchor transformer for online temporal action localization | arXiv: 2408.06437
- headgas real-time animatable head avatars via 3d gaussian splatting | arXiv: 2312.02902
- heterogeneous graph learning for scene graph prediction in 3d point clouds
- hiding imperceptible noise in curvature-aware patches for 3d point cloud attack
- hiei a universal framework for generating high-quality emerging images from natu
- hierarchical temporal context learning for camera-based semantic scene completio | arXiv: 2407.02077
- hierarchically structured neural bones for reconstructing animatable objects fro | arXiv: 2408.00351
- high-fidelity 3d textured shapes generation by sparse encoding and adversarial d
- high-precision self-supervised monocular depth estimation with rich-resource pri | arXiv: 2408.00361
- high-resolution and few-shot view synthesis from asymmetric dual-lens inputs
- himo a new benchmark for full-body human interacting with multiple objects | arXiv: 2407.12371
- how video meetings change your expression | arXiv: 2406.00955
- hpe-li wifi-enabled lightweight dual selective kernel convolution for human pose
- hpff hierarchical locally supervised learning with patch feature fusion | arXiv: 2407.05638
- human hair reconstruction with strand-aligned 3d gaussians
- human motion forecasting in dynamic domain shifts a homeostatic continual test-t
- humos human motion model conditioned on body shape | arXiv: 2409.03944
- hybridbooth hybrid prompt inversion for efficient subject-driven generation | arXiv: 2410.08192
- hydra a hyper agent for dynamic compositional visual reasoning | arXiv: 2403.12884
- hyperion - a fast versatile symbolic gaussian belief propagation framework for c | arXiv: 2407.07074
- i canapost believe itaposs not scene flow
- i-medsam implicit medical image segmentation with segment anything | arXiv: 2311.17081
- i2-slam inverting imaging process for robust photorealistic dense slam
- iam-vfi interpolate any motion for video frame interpolation with motion complex
- idempotent unsupervised representation learning for skeleton-based action recogn | arXiv: 2410.20349
- idol unified dual-modal latent diffusion for human-centric joint video-depth gen | arXiv: 2407.10937
- image demoiréing in raw and srgb domains
- image-feature weak-to-strong consistency an enhanced paradigm for semi-supervise | arXiv: 2408.12614
- imaging interiors an implicit solution to electromagnetic inverse scattering pro | arXiv: 2407.09352
- implicit concept removal of diffusion models | arXiv: 2310.05873
- implicit filtering for learning neural signed distance functions from 3d point c | arXiv: 2407.13342
- implicit style-content separation using b-lora | arXiv: 2403.14572
- improving 2d feature representations by 3d-aware fine-tuning | arXiv: 2407.20229
- improving agent behaviors with rl fine-tuning for autonomous driving | arXiv: 2409.18343
- improving domain generalization in self-supervised monocular depth estimation vi | arXiv: 2411.02149
- improving intervention efficacy via concept realignment in concept bottleneck mo | arXiv: 2405.01531
- improving knowledge distillation via regularizing feature direction and norm
- improving medical multi-modal contrastive learning with expert annotations | arXiv: 2403.10153
- improving point-based crowd counting and localization based on auxiliary point g | arXiv: 2405.10589
- improving zero-shot generalization for clip with variational adapter
- infinite-id identity-preserved personalization via id-semantics decoupling parad | arXiv: 2403.11781
- infmae a foundation model in the infrared modality | arXiv: 2402.00407
- instance-dependent noisy-label learning with graphical model based noise-rate es | arXiv: 2305.19486
- integrating markov blanket discovery into causal representation learning for dom
- interactive 3d object detection with prompts
- interleaving one-class and weakly-supervised models with adaptive thresholding f | arXiv: 2401.13551
- intrinsic single-image hdr reconstruction | arXiv: 2409.13803
- invertible neural warp for nerf | arXiv: 2407.12354
- irgen generative modeling for image retrieval | arXiv: 2303.10126
- is retain set all you need in machine unlearning restoring performance of unlear | arXiv: 2404.12922
- is user feedback always informative retrieval latent defending for semi-supervis | arXiv: 2407.15383
- isomorphic pruning for vision models | arXiv: 2407.04616
- ittakestwo leveraging peer representations for semi-supervised lidar semantic se | arXiv: 2407.07171
- ivtp instruction-guided visual token pruning for large vision-language models
- joint rgb-spectral decomposition model guided image enhancement in mobile photog | arXiv: 2407.17996
- jointdreamer ensuring geometry consistency and text congruence in text-to-3d gen | arXiv: 2407.12291
- kalman-inspired feature propagation for video face super-resolution | arXiv: 2408.05205
- l-differ single image reflection removal with language-based diffusion model
- label-anticipated event disentanglement for audio-visual video parsing | arXiv: 2407.08126
- lagrangian hashing for compressed neural field representations | arXiv: 2409.05334
- lami-detr open-vocabulary detection with language model instruction | arXiv: 2407.11335
- language-driven 6-dof grasp detection using negative prompt guidance | arXiv: 2407.13842
- lapose laplacian mixture shape modeling for rgb-based category-level object pose | arXiv: 2409.15727
- lara efficient large-baseline radiance fields | arXiv: 2407.04699
- large motion model for unified multi-modal motion generation | arXiv: 2404.01284
- lass3d language-assisted semi-supervised 3d semantic segmentation with progressi
- latent guard a safety framework for text-to-image generation | arXiv: 2404.08031
- latent-inr a flexible framework for implicit representations of videos with disc | arXiv: 2408.02672
- layeredflow a real-world benchmark for non-lambertian multi-layer optical flow | arXiv: 2409.05688
- layoutdetr detection transformer is a good multimodal layout designer | arXiv: 2212.09877
- lazy diffusion transformer for interactive image editing | arXiv: 2404.12382
- lcm-lookahead for encoder-based text-to-image personalization | arXiv: 2404.03620
- learn from the learnt source-free active domain adaptation via contrastive sampl | arXiv: 2407.18899
- learning 3d geometry and feature consistent gaussian splatting for object remova | arXiv: 2404.13679
- learning 3d-aware gans from unposed images with template feature field | arXiv: 2404.05705
- learning anomalies with normality prior for unsupervised video anomaly detection
- learning camouflaged object detection from noisy pseudo label | arXiv: 2407.13157
- learning chain of counterfactual thought for bias-robust vision-language reasoni
- learning cross-hand policies of high-dof reaching and grasping | arXiv: 2404.09150
- learning differentially private diffusion models via stochastic adversarial dist | arXiv: 2408.14738
- learning exhaustive correlation for spectral super-resolution where spatial-spec | arXiv: 2312.12833
- learning from the web language drives weakly-supervised incremental learning for | arXiv: 2407.13363
- learning representations of satellite images from metadata supervision
- learning semantic latent directions for accurate and controllable human motion p | arXiv: 2407.11494
- learning to generate conditional tri-plane for 3d-aware expression controllable | arXiv: 2404.00636
- learning to obstruct few-shot image classification over restricted classes | arXiv: 2409.19210
- learning to robustly reconstruct dynamic scenes from low-light spike streams
- learning trimodal relation for audio-visual question answering with missing moda | arXiv: 2407.16171
- lego learning egocentric action frame generation via visual instruction tuning | arXiv: 2312.03849
- lego learning to disentangle and invert personalized concepts beyond object appe | arXiv: 2311.13833
- leia latent view-invariant embeddings for implicit 3d articulation | arXiv: 2409.06703
- leveraging hierarchical feature sharing for efficient dataset condensation | arXiv: 2310.07506
- leveraging temporal contextualization for video action recognition | arXiv: 2404.09490
- lgm large multi-view gaussian model for high-resolution 3d content creation | arXiv: 2402.05054
- lidar-event stereo fusion with hallucinations | arXiv: 2408.04633
- lift a surprisingly simple lightweight feature transform for dense vit descripto | arXiv: 2403.14625
- linearly controllable gan unsupervised feature categorization and decomposition
- listen to look into the future audio-visual egocentric gaze anticipation | arXiv: 2305.03907
- livehps robust and coherent motion capture in dynamic free environment | arXiv: 2407.09833
- livephoto real image animation with text-guided motion control | arXiv: 2312.02928
- llm as copilot for coarse-grained vision-and-language navigation
- ln3diff scalable latent neural fields diffusion for speedy 3d generation | arXiv: 2403.12019
- loa-trans enhancing visual grounding by location-aware transformers
- local action-guided motion diffusion model for text-to-motion generation | arXiv: 2407.10528
- local all-pair correspondence for point tracking
- long-tail temporal action segmentation with group-wise temporal logit adjustment | arXiv: 2408.09919
- m ampmaposs a benchmark to evaluate tool-use for multi-step multi-modal tasks
- m2d2m multi-motion generation from text with discrete diffusion models | arXiv: 2407.14502
- macdiff unified skeleton modeling with masked conditional diffusion | arXiv: 2409.10473
- magdiff multi-alignment diffusion for high-fidelity video generation and editing | arXiv: 2311.17338
- magiceraser erasing any objects via semantics-aware control | arXiv: 2410.10207
- magr manifold-aligned graph regularization for continual action quality assessme | arXiv: 2403.04398
- mahalanobis distance-based multi-view optimal transport for multi-view crowd loc | arXiv: 2409.01726
- mambair a simple baseline for image restoration with state-space model | arXiv: 2402.15648
- manikin biomechanically accurate neural inverse kinematics for human motion esti
- mapdistill boosting efficient camera-based hd map construction via camera-lidar | arXiv: 2407.11682
- maptracker tracking with strided memory fusion for consistent vector hd mapping | arXiv: 2403.15951
- marineinst a foundation model for marine image analysis with instance visual des
- mariner enhancing novel views by matching rendered images with nearby references | arXiv: 2407.13745
- marvelovd marrying object recognition and vision-language models for robust open | arXiv: 2407.21465
- masked angle-aware autoencoder for remote sensing images | arXiv: 2408.01946
- masked video and body-worn imu autoencoder for egocentric action recognition | arXiv: 2407.06628
- mathverse does your multi-modal llm truly see the diagrams in visual math proble | arXiv: 2403.14624
- megascenes scene-level view synthesis at scale | arXiv: 2406.11819
- membn robust test-time adaptation via batch norm with statistics memory
- memory-efficient fine-tuning for quantized diffusion model | arXiv: 2401.04339
- merlin empowering multimodal llms with foresight minds
- merlin single-shot material estimation and relighting for photometric stereo | arXiv: 2409.00674
- mesh2nerf direct mesh supervision for neural radiance field representation and g | arXiv: 2403.19319
- meshfeat multi-resolution features for neural fields on meshes | arXiv: 2407.13592
- meta-prompting for automating zero-shot visual recognition with llms | arXiv: 2403.11755
- metaaug meta-data augmentation for post-training quantization | arXiv: 2407.14726
- migs multi-identity gaussian splatting via tensor decomposition | arXiv: 2407.07284
- milliflow scene flow estimation on mmwave radar point cloud for human motion sen | arXiv: 2306.17010
- mixdq memory-efficient few-step text-to-image diffusion models with metric-decou | arXiv: 2405.17873
- mm1 methods analysis and insights from multimodal llm pre-training
- mmbench is your multi-modal model an all-around player | arXiv: 2307.06281
- modeling and driving human body soundfields through acoustic primitives | arXiv: 2407.13083
- moe-diffir task-customized diffusion priors for universal compressed image resto | arXiv: 2407.10833
- mofa-video controllable image animation via generative motion field adaptions in | arXiv: 2405.20222
- momentum auxiliary network for supervised local learning | arXiv: 2407.05623
- monocular occupancy prediction for scalable indoor scenes | arXiv: 2407.11730
- monowad weather-adaptive diffusion model for robust monocular 3d object detectio | arXiv: 2407.16448
- motion mamba efficient and long sequence motion generation | arXiv: 2403.07487
- motion-prior contrast maximization for dense continuous-time motion estimation | arXiv: 2407.10802
- motionchain conversational motion controllers via multimodal prompts | arXiv: 2404.01700
- motionlcm real-time controllable motion generation via latent consistency model | arXiv: 2404.19759
- multi-hmr multi-person whole-body human mesh recovery in a single shot | arXiv: 2402.14654
- multi-label cluster discrimination for visual representation learning | arXiv: 2407.17331
- multi-memory matching for unsupervised visible-infrared person re-identification | arXiv: 2401.06825
- multi-person pose forecasting with individual interaction perceptron and prior l
- multigen zero-shot image generation from multi-modal prompts
- mutdet mutually optimizing pre-training for remote sensing object detection | arXiv: 2407.09920
- mutual learning for acoustic matching and dereverberation via visual scene-drive | arXiv: 2407.10373
- mvdd multi-view depth diffusion models | arXiv: 2312.04875
- mvdiffusion a dense high-resolution multi-view diffusion model for single or spa | arXiv: 2402.12712
- MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo | arXiv: 2405.12218
- mvsplat efficient 3d gaussian splatting from sparse multi-view images | arXiv: 2403.14627
- myvlm personalizing vlms for user-specific queries | arXiv: 2403.14599
- navgpt-2 unleashing navigational reasoning capability for large vision-language | arXiv: 2407.12366
- navigation instruction generation with bev perception and large language models | arXiv: 2407.15087
- NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration | arXiv: 2309.07322
- neural volumetric world models for autonomous driving
- neuroncap photorealistic closed-loop safety testing for autonomous driving | arXiv: 2404.07762
- neusdfusion a spatial-aware generative model for 3d shape completion reconstruct | arXiv: 2403.18241
- ngp-rt fusing multi-level hash features with lightweight attention for real-time | arXiv: 2407.10482
- nl2contact natural language guided 3d hand-object contact modeling with diffusio | arXiv: 2407.12727
- noise-assisted prompt learning for image forgery detection and localization
- non-parametric sensor noise modeling and synthesis
- nonverbal interaction detection | arXiv: 2407.08133
- novum neural object volumes for robust object classification | arXiv: 2305.14668
- nucraft crafting high resolution 3d semantic occupancy for unified 3d scene unde
- nymeria a massive collection of multimodal egocentric daily motion in the wild | arXiv: 2406.09905
- oapt offset-aware partition transformer for double jpeg artifacts removal | arXiv: 2408.11480
- object-aware nir-to-visible translation
- occgen generative multi-modal 3d occupancy prediction for autonomous driving | arXiv: 2404.15014
- occluded gait recognition with mixture of experts an action detection perspectiv
- occlusion handling in 3d human pose estimation with perturbed positional encodin | arXiv: 2405.17397
- occlusion-aware seamless segmentation | arXiv: 2407.02182
- occworld learning a 3d occupancy world model for autonomous driving | arXiv: 2311.16038
- octopus embodied vision-language programmer from environmental feedback | arXiv: 2310.08588
- ogni-dc robust depth completion with optimization-guided neural iterations | arXiv: 2406.11711
- olaf a plug-and-play framework for enhanced multi-object multi-part scene parsin | arXiv: 2411.02858
- omg occlusion-friendly personalized multi-concept generation in diffusion models | arXiv: 2403.10983
- omni-recon harnessing image-based rendering for general-purpose neural radiance | arXiv: 2403.11131
- omni6d large-vocabulary 3d object dataset for category-level 6d object pose esti | arXiv: 2409.18261
- omnisat self-supervised modality fusion for earth observation | arXiv: 2404.08351
- omnissr zero-shot omnidirectional image super-resolution using stable diffusion | arXiv: 2404.10312
- omniview-tuning boosting viewpoint invariance of vision-language pre-training mo | arXiv: 2404.12139
- on calibration of object detectors pitfalls evaluation and baselines | arXiv: 2405.20459
- on the error analysis of 3d gaussian splatting and an optimal projection strateg | arXiv: 2402.00752
- on the utility of 3d hand poses for action recognition | arXiv: 2403.09805
- one-stage prompt-based continual learning | arXiv: 2402.16189
- onerestore a universal restoration framework for composite degradation | arXiv: 2407.04621
- onetrack demystifying the conflict between detection and tracking in end-to-end
- online temporal action localization with memory-augmented transformer | arXiv: 2408.02957
- open object-wise position embedding for multi-view 3d object detection | arXiv: 2407.10753
- open vocabulary 3d scene understanding via geometry guided self-distillation | arXiv: 2407.13362
- open-vocabulary 3d semantic segmentation with text-to-image diffusion models | arXiv: 2407.13642
- openkd opening prompt diversity for zero- and few-shot keypoint detection | arXiv: 2409.19899
- openpsg open-set panoptic scene graph generation via large multimodal models | arXiv: 2407.11213
- operational open-set recognition and postmax refinement
- ophnet a large-scale video benchmark for ophthalmic surgical workflow understand | arXiv: 2406.07471
- optimizing diffusion models for joint trajectory prediction and controllable gen | arXiv: 2408.00766
- optimizing factorized encoder models time and memory reduction for scalable and
- optimizing illuminant estimation in dual-exposure hdr imaging
- overcoming distribution mismatch in quantizing image super-resolution networks | arXiv: 2307.13337
- p2p-bridge diffusion bridges for 3d point cloud denoising | arXiv: 2408.16325
- pairwise distance distillation for unsupervised real-world image super-resolutio | arXiv: 2407.07302
- panofree tuning-free holistic multi-view image generation with cross-view self-g | arXiv: 2408.02157
- panovos bridging non-panoramic and panoramic views with transformer for video se | arXiv: 2309.12303
- papr training-free one-step patch pruning with lightweight convnets for faster i | arXiv: 2403.16020
- part2object hierarchical unsupervised 3d instance segmentation | arXiv: 2407.10084
- partcraft crafting creative objects by parts | arXiv: 2407.04604
- partstad 2d-to-3d part segmentation task adaptation | arXiv: 2401.05906
- pathology-knowledge enhanced multi-instance prompt learning for few-shot whole s | arXiv: 2407.10814
- pcf-lift panoptic lifting by probabilistic contrastive fusion | arXiv: 2410.10659
- per-gaussian embedding-based deformation for deformable 3d gaussian splatting | arXiv: 2404.03613
- petface a large-scale dataset and benchmark for animal identification | arXiv: 2407.13555
- physdreamer physics-based interaction with 3d objects via video generation | arXiv: 2404.13026
- pisr polarimetric neural implicit surface reconstruction for textureless and spe | arXiv: 2409.14331
- pite pixel-temporal alignment for large video-language model | arXiv: 2409.07239
- pixel-aware stable diffusion for realistic image super-resolution and personaliz | arXiv: 2308.14469
- pixel-gs density control with pixel-aware gradient for 3d gaussian splatting | arXiv: 2403.15530
- plain-det a plain multi-dataset object detector | arXiv: 2407.10083
- plan posture and go towards open-vocabulary text-to-motion generation
- plot text-based person search with part slot attention for corresponding part di | arXiv: 2409.13475
- poa pre-training once for models of all sizes | arXiv: 2408.01031
- point-supervised panoptic segmentation via estimating pseudo labels from learnab
- pointllm empowering large language models to understand point clouds | arXiv: 2308.16911
- ponymation learning articulated 3d animal motions from unlabeled online videos | arXiv: 2312.13604
- portrait4d-v2 pseudo multi-view data creates better 4d head synthesizer | arXiv: 2403.13570
- pose-aware self-supervised learning with viewpoint trajectory regularization | arXiv: 2403.14973
- posesor human pose can guide our attention
- posformer recognizing complex handwritten mathematical expression with position | arXiv: 2407.07764
- power variable projection for initialization-free large-scale bundle adjustment | arXiv: 2405.05079
- powerful and flexible personalized text-to-image generation via reinforcement le | arXiv: 2407.06642
- pq-sam post-training quantization for segment anything model
- prelar world model pre-training with learnable action representation
- preventing catastrophic overfitting in fast adversarial training a bi-level opti | arXiv: 2407.12443
- prioritized semantic learning for zero-shot instance navigation | arXiv: 2403.11650
- probabilistic weather forecasting with deterministic guidance-based diffusion mo
- prodepth boosting self-supervised multi-frame monocular depth with probabilistic | arXiv: 2407.09303
- progressive classifier and feature extractor adaptation for unsupervised domain | arXiv: 2311.16474
- progressive pretext task learning for human trajectory prediction | arXiv: 2407.11588
- projecting points to axes oriented object detection via point-axis representatio | arXiv: 2407.08489
- promerge prompt and merge for unsupervised instance segmentation | arXiv: 2409.18961
- promptccd learning gaussian mixture prompt pool for continual category discovery | arXiv: 2407.19001
- prompting future driven diffusion model for hand motion prediction
- prompting language-informed distribution for compositional zero-shot learning | arXiv: 2305.14428
- promptiqa boosting the performance and generalization for no-reference image qua | arXiv: 2403.04993
- propose assess search harnessing llms for goal-oriented planning in instructiona | arXiv: 2409.20557
- protecting nerfsapos copyright via plug-and-play watermarking base model
- pyra parallel yielding re-activation for training-inference efficient task adapt | arXiv: 2403.09192
- quantized prompt for efficient generalization of vision-language models | arXiv: 2407.10704
- quar-vla vision-language-action model for quadruped robots | arXiv: 2312.14457
- querycdr query-based controllable distortion rectification network for fisheye i | arXiv: 2412.13496
- r2-bench benchmarking the robustness of referring perception models under pertur
- radedit stress-testing biomedical vision models via diffusion image editing | arXiv: 2312.12865
- radiative gaussian splatting for efficient x-ray novel view synthesis | arXiv: 2403.04116
- raindrop clarity a dual-focused dataset for day and night raindrop removal | arXiv: 2407.16957
- random walk on pixel manifolds for anomaly segmentation of complex driving scene | arXiv: 2404.17961
- rapid-seg range-aware pointwise distance distribution networks for 3d lidar segm | arXiv: 2407.10159
- raw-adapter adapting pre-trained visual model to camera raw images | arXiv: 2408.14802
- ray-distance volume rendering for neural scene reconstruction | arXiv: 2408.15524
- real-data-driven 2000 fps color video from mosaicked chromatic spikes
- realfred an embodied instruction following benchmark in photo-realistic environm | arXiv: 2407.18550
- realistic human motion generation with cross-diffusion models | arXiv: 2312.10993
- realviformer investigating attention for real-world video super-resolution | arXiv: 2407.13987
- reason2drive towards interpretable and chain-based reasoning for autonomous driv | arXiv: 2312.03661
- rebalancing using estimated class distribution for imbalanced semi-supervised le
- reconstruction and simulation of elastic objects with spring-mass 3d gaussians | arXiv: 2403.09434
- rectify the regression bias in long-tailed object detection | arXiv: 2401.15885
- referring atomic video action recognition | arXiv: 2407.01872
- regiondrag fast region-based image editing with diffusion models | arXiv: 2407.18247
- reground improving textual and spatial grounding at no cost | arXiv: 2403.13589
- rejection sampling imle designing priors for better few-shot image synthesis | arXiv: 2409.17439
- reliability in semantic segmentation can we use synthetic data | arXiv: 2312.09231
- reliable spatial-temporal voxels for multi-modal test-time adaptation
- reloo reconstructing humans dressed in loose garments from monocular video in th | arXiv: 2409.15269
- remamber referring image segmentation with mamba twister
- removing distributional discrepancies in captions improves image-text alignment | arXiv: 2410.00905
- renoise real image inversion through iterative noising | arXiv: 2403.14602
- repaint123 fast and high-quality one image to 3d generation with progressive con
- repose 3d human pose estimation via spatio-temporal depth relational consistency
- representing topological self-similarity using fractal feature maps for accurate | arXiv: 2407.14754
- reprojection errors as prompts for efficient scene coordinate regression | arXiv: 2409.04178
- resilience of entropy model in distributed neural networks | arXiv: 2403.00942
- responsible visual editing | arXiv: 2404.05580
- restoring images in adverse weather conditions via histogram transformer | arXiv: 2407.10172
- rethinking data augmentation for robust lidar semantic segmentation in adverse w | arXiv: 2407.02286
- rethinking data bias dataset copyright protection via embedding class-wise hidde
- rethinking image super-resolution from training data perspectives | arXiv: 2409.00768
- rethinking lidar domain generalization single source as multiple density domains | arXiv: 2312.12098
- rethinking unsupervised outlier detection via multiple thresholding | arXiv: 2407.05382
- rethinking video-text understanding retrieval from counterfactually augmented da | arXiv: 2407.13094
- revision rendering tools enable spatial fidelity in vision-language models | arXiv: 2408.02231
- revisiting supervision for continual representation learning | arXiv: 2311.13321
- rgnet a unified clip retrieval and grounding network for long videos | arXiv: 2312.06729
- ringid rethinking tree-ring watermarking for enhanced multi-key identification | arXiv: 2404.14055
- risk-aware self-consistent imitation learning for trajectory planning in autonom
- risurconv rotation invariant surface attention-augmented convolutions for 3d poi | arXiv: 2408.06110
- roadpainter points are ideal navigators for topology transformer | arXiv: 2407.15349
- robust calibration of large vision-language adapters | arXiv: 2407.13588
- robust fitting on a gate quantum computer | arXiv: 2409.02006
- robust-wide robust watermarking against instruction-driven image editing | arXiv: 2402.12688
- rodinhd high-fidelity 3d avatar generation with diffusion models | arXiv: 2407.06938
- roguenerf a robust geometry-consistent universal enhancer for nerf | arXiv: 2403.11909
- roofdiffusion constructing roofs from severely corrupted point data via diffusio | arXiv: 2404.09290
- rotary position embedding for vision transformer | arXiv: 2403.13298
- rpbg towards robust neural point-based graphics in the wild | arXiv: 2405.05663
- R²-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | arXiv: 2404.00801
- s3d-nerf single-shot speech-driven neural radiance field for high fidelity talki
- sa-dvae improving zero-shot skeleton-based action recognition by disentangled va | arXiv: 2407.13460
- safe-sim safety-critical closed-loop traffic simulation with diffusion-controlla | arXiv: 2401.00391
- safnet selective alignment fusion network for efficient hdr imaging | arXiv: 2407.16308
- sags structure-aware 3d gaussian splatting | arXiv: 2404.19149
- sair learning semantic-aware implicit representation | arXiv: 2310.09285
- sapiens foundation for human vision models | arXiv: 2408.12569
- sc4d sparse-controlled video-to-4d generation and motion transfer | arXiv: 2404.03736
- scalable group choreography via variational phase manifold learning | arXiv: 2407.18839
- scaledreamer scalable text-to-3d synthesis with asynchronous score distillation | arXiv: 2407.02040
- scaling backwards minimal synthetic pre-training | arXiv: 2408.00677
- scanreason empowering 3d visual grounding with reasoning capabilities | arXiv: 2407.01525
- scantalk 3d talking heads from unregistered scans | arXiv: 2403.10942
- scape a simple and strong category-agnostic pose estimator | arXiv: 2407.13483
- scatterformer efficient voxel transformer with scattered linear attention | arXiv: 2401.00912
- scenegraphloc cross-modal coarse visual localization on 3d scene graphs | arXiv: 2404.00469
- sceneverse scaling 3d vision-language learning for grounded scene understanding | arXiv: 2401.09340
- sclip rethinking self-attention for dense vision-language inference | arXiv: 2312.01597
- scpnet unsupervised cross-modal homography estimation via intra-modal self-super
- sea-raft simple efficient accurate raft for optical flow | arXiv: 2405.14793
- sediff structure extraction for domain adaptive depth estimation via denoising d
- see and think embodied agent in virtual environment | arXiv: 2311.15209
- seed a simple and effective 3d detr in point clouds | arXiv: 2407.10749
- seeing the unseen a frequency prompt guided transformer for image restoration | arXiv: 2404.00288
- seflow a self-supervised scene flow method in autonomous driving | arXiv: 2407.01702
- seggen supercharging segmentation models with text2mask and mask2img synthesis | arXiv: 2311.03355
- segmentation-guided layer-wise image vectorization with gradient fills | arXiv: 2408.15741
- segpoint segment any point cloud via large language model | arXiv: 2407.13761
- seit masked token modeling improves storage-efficient training | arXiv: 2312.10105
- select and distill selective dual-teacher knowledge transfer for continual learn | arXiv: 2403.09296
- self-adapting large visual-language models to edge devices across visual modalit | arXiv: 2403.04908
- self-supervised any-point tracking by contrastive random walks | arXiv: 2409.16288
- self-supervised co-salient object detection via feature correspondences at multi | arXiv: 2403.11107
- self-supervised feature adaptation for 3d industrial anomaly detection | arXiv: 2401.03145
- self-supervised video copy localization with regional token representation
- semantically guided representation learning for action anticipation | arXiv: 2407.02309
- semantichuman-hd high-resolution semantic disentangled 3d human generation | arXiv: 2403.10166
- semgrasp semantic grasp generation via language aligned discretization | arXiv: 2404.03590
- semi-supervised video desnowing network via temporal decoupling experts and dist | arXiv: 2410.07901
- semtrack a large-scale dataset for semantic tracking in the wild
- senc handling self-collision in neural cloth simulation | arXiv: 2407.12479
- sfpnet sparse focal point network for semantic segmentation on general lidar poi | arXiv: 2407.11569
- sgs-slam semantic gaussian splatting for neural dense slam | arXiv: 2402.03246
- shape-guided configuration-aware learning for endoscopic-image-based pose estima
- shapefusion a 3d diffusion model for localized shape editing | arXiv: 2403.19773
- sharegpt4v improving large multi-modal models with better captions | arXiv: 2311.12793
- shedding more light on robust classifiers under the lens of energy-based models | arXiv: 2407.06315
- shifted autoencoders for point annotation restoration in object counting | arXiv: 2312.07190
- shine saliency-aware hierarchical negative ranking for compositional temporal gr | arXiv: 2407.05118
- siamese vision transformers are scalable audio-visual learners | arXiv: 2403.19638
- sigma sinkhorn-guided masked video modeling | arXiv: 2407.15447
- signavatars a large-scale 3d sign language holistic motion dataset and benchmark | arXiv: 2310.20436
- silc improving vision language pretraining with self-distillation | arXiv: 2310.13355
- simpb a single model for 2d and 3d object detection from multiple cameras | arXiv: 2403.10353
- simple unsupervised knowledge distillation with space similarity | arXiv: 2409.13939
- sinder repairing the singular defects of dinov2 | arXiv: 2407.16826
- skymask attack-agnostic robust federated learning with fine-grained learnable ma | arXiv: 2312.12484
- slack semantic location and appearance aware open-vocabulary tracking | arXiv: 2409.11235
- sledge synthesizing driving environments with generative models and rule-based t | arXiv: 2403.17933
- slotlifter slot-guided feature lifting for learning object-centric radiance fiel | arXiv: 2408.06697
- smoodi stylized motion diffusion model | arXiv: 2407.12783
- soft prompt generation for domain generalization | arXiv: 2404.19286
- sos segment object system for open-world instance segmentation with object prior | arXiv: 2409.14627
- source prompt disentangled inversion for boosting image editability with diffusi | arXiv: 2403.11105
- spacejam a lightweight and regularization-free method for fast joint alignment o | arXiv: 2407.11850
- spamming labels efficient annotations for the trackers of tomorrow | arXiv: 2404.11426
- sparsessp 3d subcellular structure prediction from sparse-view transmitted light | arXiv: 2407.02159
- spatialformer towards generalizable vision transformers with explicit spatial un
- spatially-variant degradation model for dataset-free super-resolution | arXiv: 2407.08252
- spatio-temporal proximity-aware dual-path model for panoramic activity recogniti | arXiv: 2403.14113
- spectral subsurface scattering for material classification
- spectram-ps spectrally multiplexed photometric stereo under unknown spectral com
- spherical linear interpolation and text-anchoring for zero-shot composed image r | arXiv: 2405.00571
- spherical world-locking for audio-visual localization in egocentric videos | arXiv: 2408.05364
- spin hierarchical segmentation with subpart granularity in natural images | arXiv: 2407.09686
- splatfields neural gaussian splats for sparse 3d and 4d reconstruction | arXiv: 2409.11211
- sq-llava self-questioning for large vision-language assistant | arXiv: 2403.11299
- stable preference redefining training paradigm of human preference model for tex
- stepwise multi-grained boundary detector for point-supervised temporal action lo
- stream query denoising for vectorized hd-map construction | arXiv: 2401.09112
- stripe observation guided inference cost-free attention mechanism
- stsp spatial-temporal subspace projection for video class-incremental learning
- styletokenizer defining image style by a single instance for controlling diffusi | arXiv: 2409.02543
- supergaussian repurposing video models for 3d super resolution | arXiv: 2406.00609
- superpixel-informed implicit neural representation for multi-dimensional data | arXiv: 2411.11356
- surface reconstruction from 3d gaussian splatting via local structural hints
- sv3d novel multi-view synthesis and 3d generation from a single image using late | arXiv: 2403.12008
- sync from the sea retrieving alignable videos from large-scale datasets | arXiv: 2409.01445
- synchronous diffusion for unsupervised smooth non-rigid 3d shape matching
- synergy of sight and semantics visual intention understanding with clip
- t-mae temporal masked autoencoders for point cloud representation learning | arXiv: 2312.10217
- talkinggaussian structure-persistent 3d talking head synthesis via gaussian spla | arXiv: 2404.15264
- taming latent diffusion model for neural radiance field inpainting | arXiv: 2404.09995
- taptr tracking any point with transformers as detection | arXiv: 2403.13042
- tcc-det temporarily consistent cues for weakly-supervised 3d detection
- teaching tailored to talent adverse weather restoration via prompt pool and dept | arXiv: 2409.15739
- tela text to layer-wise 3d clothed human generation | arXiv: 2404.16748
- temporally consistent stereo matching | arXiv: 2407.11950
- tensorial template matching for fast cross-correlation with rotations and its ap | arXiv: 2408.02398
- text-guided video masked autoencoder | arXiv: 2408.00759
- text2place affordance-aware text guided human placement | arXiv: 2407.15446
- textdiffuser-2 unleashing the power of language models for text rendering | arXiv: 2311.16465
- textual-visual logic challenge understanding and reasoning in text-to-image gene
- texture-gs disentangling the geometry and texture for 3d gaussian splatting edit | arXiv: 2403.10050
- tf-fas twofold-element fine-grained semantic guidance for generalizable face ant
- the fabrication of reality and fantasy scene generation with llm-assisted prompt | arXiv: 2407.12579
- the hard positive truth about vision-language compositionality | arXiv: 2409.17958
- the nerfect match exploring nerf features for visual localization | arXiv: 2403.09577
- thermal3d-gs physics-induced 3d gaussians for thermal infrared novel-view synthe | arXiv: 2409.08042
- timecraft navigate weakly-supervised temporal grounded video question answering
- tip tabular-image pre-training for multimodal classification with incomplete dat | arXiv: 2407.07582
- tod3cap towards 3d dense captioning in outdoor scenes | arXiv: 2403.19589
- token compensator altering inference cost of vision transformer without re-tunin | arXiv: 2408.06798
- topology-preserving downsampling of binary images | arXiv: 2407.17786
- toward tiny and high-quality facial makeup with data amplify learning | arXiv: 2403.15033
- towards model-agnostic dataset condensation by heterogeneous models | arXiv: 2409.14538
- towards multi-modal transformers in federated learning | arXiv: 2404.12467
- towards natural language-guided drones geotext-1652 benchmark with spatial relat | arXiv: 2311.12751
- towards open-ended visual quality comparison | arXiv: 2402.16641
- towards open-ended visual recognition with large language models | arXiv: 2311.08400
- towards real-world adverse weather image restoration enhancing clearness and sem | arXiv: 2409.02101
- towards real-world event-guided low-light video enhancement and deblurring | arXiv: 2408.14916
- towards reliable advertising image generation using human feedback | arXiv: 2408.00418
- towards unified representation of invariant-specific features in missing modalit
- tpa3d triplane attention for fast text-to-3d generation | arXiv: 2312.02647
- track everything everywhere fast and robustly | arXiv: 2403.17931
- tracking meets lora faster training larger model stronger performance | arXiv: 2403.05231
- tracknerf bundle adjusting nerf from sparse and noisy views via feature tracks | arXiv: 2408.10739
- train till you drop towards stable and robust source-free unsupervised 3d domain | arXiv: 2409.04409
- tram global trajectory and motion of 3d humans from in-the-wild videos | arXiv: 2403.17346
- transferable 3d adversarial shape completion using diffusion models | arXiv: 2407.10077
- ttt-mim test-time training with masked image modeling for denoising distribution
- u-cope taking a further step to universal 9d category-level object pose estimati
- udifftext a unified framework for high-quality text synthesis in arbitrary image | arXiv: 2312.04884
- umbrae unified multimodal brain decoding | arXiv: 2404.07202
- un-evimo unsupervised event-based independent motion segmentation | arXiv: 2312.00114
- uncertainty-driven spectral compressive imaging with spatial-frequency transform
- understanding physical dynamics with counterfactual world modeling | arXiv: 2312.06721
- uni3dl a unified model for 3d vision-language understanding
- unic universal classification models via multi-teacher distillation | arXiv: 2408.05088
- unicode learning a unified codebook for multimodal large language models | arXiv: 2403.09072
- unidream unifying diffusion priors for relightable text-to-3d generation | arXiv: 2312.08754
- unifs universal few-shot instance perception with point representations | arXiv: 2404.19401
- uniinr event-guided unified rolling shutter correction deblurring and interpolat | arXiv: 2305.15078
- unim2ae multi-modal masked autoencoders with unified 3d representation for 3d pe
- unitraj a unified framework for scalable vehicle trajectory prediction | arXiv: 2403.15098
- unleashing the power of prompt-driven nucleus instance segmentation | arXiv: 2311.15939
- unrolled decomposed unpaired learning for controllable low-light video enhanceme | arXiv: 2408.12316
- unsupervised exposure correction | arXiv: 2507.17252
- unsupervised moving object segmentation with atmospheric turbulence
- unsupervised multi-modal medical image registration via invertible translation
- unveiling advanced frequency disentanglement paradigm for low-light image enhanc | arXiv: 2409.01641
- unveiling privacy risks in stochastic neural networks training effective image r
- upose3d uncertainty-aware 3d human pose estimation with cross-view and temporal | arXiv: 2404.14634
- upper-body hierarchical graph for skeleton based emotion recognition in assistiv
- vamos versatile action models for video understanding | arXiv: 2311.13627
- vary scaling up the vision vocabulary for large vision-language model | arXiv: 2312.06109
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | arXiv: 2312.06109
- vcd-texture variance alignment based 3d-2d co-denoising for text-guided texturin | arXiv: 2407.04461
- versatile incremental learning towards class and domain-agnostic incremental lea | arXiv: 2409.10956
- versatilegaussian real-time neural rendering for versatile tasks using gaussian
- vfusion3d learning scalable 3d generative models from video diffusion models | arXiv: 2403.12034
- vic-mae self-supervised representation learning from images and video with contr | arXiv: 2303.12001
- videoagent a memory-augmented multimodal agent for video understanding | arXiv: 2403.11481
- videoclusternet self-supervised and adaptive face clustering for videos | arXiv: 2407.12214
- videomamba spatio-temporal selective state space model | arXiv: 2407.08476
- videomamba state space model for efficient video understanding | arXiv: 2403.06977
- videoshop localized semantic video editing with noise-extrapolated diffusion inv | arXiv: 2403.14617
- view selection for 3d captioning via diffusion ranking | arXiv: 2404.07984
- visa reasoning video object segmentation via large language models | arXiv: 2407.11325
- visage video instance segmentation with appearance-guided enhancement | arXiv: 2312.04885
- visfocus prompt-guided vision encoders for ocr-free dense document understanding | arXiv: 2407.12594
- visible and clear finding tiny objects in difference map | arXiv: 2405.11276
- visiontrap vision-augmented trajectory prediction guided by textual descriptions | arXiv: 2407.12345
- vista3d unravel the 3d darkside of a single image | arXiv: 2409.12193
- visual grounding for object-level generalization in reinforcement learning | arXiv: 2408.01942
- vp-sam taming segment anything model for video polyp segmentation via disentangl
- walker self-supervised multiple object tracking by walking on temporal appearanc | arXiv: 2409.17221
- wast-3d wasserstein-2 distance for scene-to-scene stylization on 3d gaussians | arXiv: 2409.17917
- wavelength-embedding-guided filter-array transformer for spectral demosaicing
- weak-to-strong compositional learning from generative models for language-based | arXiv: 2407.15296
- weakly supervised 3d object detection via multi-level visual guidance | arXiv: 2312.07530
- weakly-supervised camera localization by ground-to-satellite image registration | arXiv: 2409.06471
- wear-any-way manipulable virtual try-on via sparse correspondence alignment | arXiv: 2403.12965
- webrpg automatic web rendering parameters generation for visual presentation | arXiv: 2407.15502
- wecromcl weakly supervised cross-modality contrastive learning for transcription | arXiv: 2407.19507
- when do we not need larger vision models | arXiv: 2403.13043
- wildvidfit video virtual try-on in the wild via image-based controlled diffusion | arXiv: 2407.10625
- wordrobe text-guided generation of textured 3d garments | arXiv: 2403.17541
- worldpose a world cup dataset for global 3d human pose estimation | arXiv: 2501.02771
- x-former unifying contrastive and reconstruction learning for mllms | arXiv: 2407.13851
- xpsr cross-modal priors for diffusion-based image super-resolution | arXiv: 2403.05049
- yolov9 learning what you want to learn using programmable gradient information | arXiv: 2402.13616
- you only learn one query learning unified human query for single-stage multi-per | arXiv: 2312.05525
- you only need one step fast super-resolution with stable diffusion via scale dis | arXiv: 2401.17258
- zero-shot detection of ai-generated images | arXiv: 2409.15875
- zero-shot multi-object scene completion | arXiv: 2403.14628
- zero-shot object counting with good exemplars | arXiv: 2407.04948
- zest zero-shot material transfer from a single image | arXiv: 2404.06425
- zigma a dit-style zigzag mamba diffusion model | arXiv: 2403.13802
- ziplora any subject in any style by effectively merging loras | arXiv: 2311.13600
- ∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions | arXiv: 2407.14709
- 3dego 3d editing on the go | arXiv: 2407.10102
- crossscore towards multiview image evaluation and scori | arXiv: 2404.14409
- dgpic domain generalized pointincontext learning for po | arXiv: 2407.08801
- dreamdrone texttoimage diffusion models are zeroshot perpetu | arXiv: 2312.08746
- dreamview injecting viewspecific text guidance into textto3d | arXiv: 2404.06119
- falip visual prompt as foveal attention boosts clip zer | arXiv: 2407.05578
- jointdreamer ensuring geometry consistency and text congruen | arXiv: 2407.12291
- scenegraphloc crossmodal coarse visual localization on 3d sc | arXiv: 2404.00469
- sceneverse scaling 3d visionlanguage learning for grounded s | arXiv: 2401.09340
- t-mae temporal masked autoencoders for point cloud representation learning | arXiv: 2312.10217
- towards multimodal transformers in federated learning | arXiv: 2404.12467
- action2sound ambientaware generation of action sounds from e | arXiv: 2406.09272
- controlllm augment language models with tools | arXiv: 2310.17796
- 4d contrastive superflows are dense 3d representation learners | arXiv: 2407.06190
- dvlo deep visuallidar odometry with localtoglobal featu | arXiv: 2403.18274
- lidarevent stereo fusion with hallucinations | arXiv: 2408.04633
- navigation instruction generation with bev | arXiv: 2407.15087
- occgen generative multimodal 3d occupancy prediction for aut | arXiv: 2404.15014
- reason2drive towards interpretable and chainbased reasoning | arXiv: 2312.03661
- safe-sim safety-critical closed-loop traffic simulation with diffusion-cont | arXiv: 2401.00391
- visiontrap visionaugmented trajectory prediction guided | arXiv: 2407.12345
- dreamstruct understanding slides and user interfaces via synthetic data generati | arXiv: 2410.00201
- bi-mdrg bridging image history in multimodal dialogue response generation | arXiv: 2408.05926
- bimdrg bridging image history in multimodal dialogue respons | arXiv: 2408.05926
- synchronous diffusion for unsupervised smooth non-rigid 3d shape matching | arXiv: 2407.08244
- 3dgazenet generalizing 3d gaze estimation with weak-supervision from synthetic v | arXiv: 2212.02997
- large motion model for unified multimodal motion generation | arXiv: 2404.01284
- quarvla visionlanguageaction model for quadruped robots | arXiv: 2312.14457
- selfsupervised feature adaptation for 3d industrial ano | arXiv: 2401.03145
- wordrobe textguided generation of textured 3d garments | arXiv: 2403.17541
- a closer look at gan priors exploiting intermediate features | arXiv: 2407.13863
- a highquality robust diffusion framework for corrupted datas | arXiv: 2311.17101
- anycontrol create your artwork with versatile control on tex | arXiv: 2406.18958
- colorpeel color prompt learning with diffusion models v | arXiv: 2407.07197
- difftracker texttoimage diffusion models are unsupervised tr | arXiv: 2407.08394
- emdm efficient motion diffusion model for fast and high | arXiv: 2312.02256
- finematch aspectbased finegrained image and text mismat | arXiv: 2404.14715
- freediff progressive frequency truncation for image edi | arXiv: 2404.11895
- getting it right improving spatial consistency in texttoimag | arXiv: 2404.01197
- hybridbooth hybrid prompt inversion for efficient subje | arXiv: 2410.08192
- infiniteid identitypreserved personalization via idsema | arXiv: 2403.11781
- latent guard a safety framework for texttoimage generation | arXiv: 2404.08031
- lcmlookahead for encoderbased texttoimage personalization | arXiv: 2404.03620
- learning trimodal relation for audiovisual question answerin | arXiv: 2407.16171
- lego learning egocentric action frame generation via vi | arXiv: 2312.03849
- mixdq memoryefficient fewstep texttoimage diffusion models w | arXiv: 2405.17873
- motionchain conversational motion controllers via multimodal | arXiv: 2404.01700
- pixelaware stable diffusion for realistic image superre | arXiv: 2308.14469
- ponymation learning articulated 3d animal motions from | arXiv: 2312.13604
- powerful and flexible personalized texttoimage generation vi | arXiv: 2407.06642
- removing distributional discrepancies in captions improves i | arXiv: 2410.00905
- scaledreamer scalable textto3d synthesis with asynchronous s | arXiv: 2407.02040
- text2place affordanceaware text guided human placement | arXiv: 2407.15446
- textdiffuser2 unleashing the power of language models f | arXiv: 2311.16465
- towards reliable advertising image generation using human fe | arXiv: 2408.00418
- xpsr crossmodal priors for diffusionbased image superresolut | arXiv: 2403.05049
- artvlm attribute recognition through vision-based prefix language modeling | arXiv: 2408.04102
- artvlm attribute recognition through visionbased prefix lang | arXiv: 2408.04102
- grounding language models for visual entity recognition | arXiv: 2402.18695
- multi-label cluster discrimination for visual representation learning | arXiv: 2407.17331
- onerestore a universal restoration framework for composite degradation | arXiv: 2407.04621
- towards open-ended visual recognition with large language models | arXiv: 2311.08400
- detailsemnet elevating signature verification through detail-semantic integratio | arXiv: 2511.16364
- improving intervention efficacy via concept realignment in concept bottleneck mo | arXiv: 2405.01531
- plot text-based person search with part slot attention for corresponding part di | arXiv: 2409.13475
- poa pre-training once for models of all sizes | arXiv: 2408.01031
- colormnet a memory-based deep spatial-temporal feature propagation network for v | arXiv: 2404.06251
- deep cost ray fusion for sparse depth video completion | arXiv: 2409.14935
- distribution alignment for fully test-time adaptation with dynamic online data s | arXiv: 2407.12128
- eliminating warping shakes for unsupervised online video stitching | arXiv: 2403.06378
- gradient-regularized out-of-distribution detection | arXiv: 2404.12368
- image-feature weak-to-strong consistency an enhanced paradigm for semi-supervise | arXiv: 2408.12614
- imaging interiors an implicit solution to electromagnetic inverse scattering pro | arXiv: 2407.09352
- instance-dependent noisy-label learning with graphical model based noise-rate es | arXiv: 2305.19486
- ogni-dc robust depth completion with optimization-guided neural iterations | arXiv: 2406.11711
- r2-bench benchmarking the robustness of referring perception models under pertur
- sigma sinkhorn-guided masked video modeling | arXiv: 2407.15447
- sync from the sea retrieving alignable videos from large-scale datasets | arXiv: 2409.01445
- versatile incremental learning towards class and domain-agnostic incremental lea | arXiv: 2409.10956
- visfocus prompt-guided vision encoders for ocr-free dense document understanding | arXiv: 2407.12594
- visfocus promptguided vision encoders for ocrfree dense | arXiv: 2407.12594
- cultural value differences llms | arXiv: 2407.16891
- funqa towards surprising video comprehension | arXiv: 2306.14899
- zeroshot object counting with good exemplars | arXiv: 2407.04948
- cross-domain learning for video anomaly detection with limited supervision | arXiv: 2408.05191
- dragapart learning a part-level motion prior for articulated objects | arXiv: 2403.15382
- learning to obstruct few-shot image classification over restricted classes | arXiv: 2409.19210
- plan posture and go towards open-vocabulary text-to-motion generation | arXiv: 2312.14828
- prelar world model pre-training with learnable action representation
- prompting language-informed distribution for compositional zero-shot learning | arXiv: 2305.14428
- scaling backwards minimal synthetic pre-training | arXiv: 2408.00677
- scantalk 3d talking heads from unregistered scans | arXiv: 2403.10942
- controllable navigation instruction generation | arXiv: 2407.07433
- magr manifold-aligned graph regularization for continual action quality assessme | arXiv: 2403.04398
- gtp4o modalityprompted heterogeneous graph learning for | arXiv: 2407.05540
- improving medical multimodal contrastive learning with exper | arXiv: 2403.10153
- pathologyknowledge enhanced multiinstance prompt learni | arXiv: 2407.10814
- tip tabularimage pretraining for multimodal classification w | arXiv: 2407.07582
- genq quantization in low data regimes with generative synthetic data | arXiv: 2312.05272
- attention prompting on image for large visionlanguage models | arXiv: 2409.17143
- beaf observing beforeafter changes to evaluate hallucination | arXiv: 2407.13442
- brave broadening the visual encoding of visionlanguage model | arXiv: 2404.07204
- cat audio visual qa | arXiv: 2403.04640
- clap isolating content from style through contrastive learni | arXiv: 2311.16445
- classact active learning | arXiv: 2312.05328
- decoupling common and unique representations for multimodal | arXiv: 2309.05300
- elevating all zeroshot sketchbased image retrieval through m | arXiv: 2407.04207
- eyes closed safety on protecting multimodal llms via imageto | arXiv: 2403.09572
- flexattention for efficient highresolution visionlanguage mo | arXiv: 2407.20228
- freemotion mocapfree human motion synthesis with multimodal | arXiv: 2406.10740
- genixer empowering multimodal large language model as a powe | arXiv: 2312.06731
- groma localized visual tokenization for grounding multimodal | arXiv: 2404.13013
- marvelovd marrying object recognition and visionlanguage mod | arXiv: 2407.21465
- mathverse does your multimodal llm truly see the diagrams in | arXiv: 2403.14624
- metaprompting for automating zeroshot visual recognitio | arXiv: 2403.11755
- mmbench is your multimodal model an allaround player | arXiv: 2307.06281
- myvlm personalizing vlms for userspecific queries | arXiv: 2403.14599
- navgpt2 unleashing navigational reasoning capability | arXiv: 2407.12366
- nymeria a massive collection of multimodal egocentric daily motion in the wild | arXiv: 2406.09905
- omniviewtuning boosting viewpoint invariance of visionlangua | arXiv: 2404.12139
- quantized prompt for efficient generalization of visionlangu | arXiv: 2407.10704
- robust calibration of large visionlanguage adapters | arXiv: 2407.13588
- sharegpt4v improving large multimodal models with better cap | arXiv: 2311.12793
- sqllava selfquestioning for large visionlanguage assistant | arXiv: 2403.11299
- the hard positive truth about visionlanguage compositionalit | arXiv: 2409.17958
- towards openended visual quality comparison | arXiv: 2402.16641
- towards realworld adverse weather image restoration enhancin | arXiv: 2409.02101
- unicode learning a unified codebook for multimodal large lan | arXiv: 2403.09072
- xformer unifying contrastive and reconstruction learning for | arXiv: 2407.13851
- slimer zero shot ner | arXiv: 2407.01272
- a new dataset and framework for real-world blurred images super-resolution | arXiv: 2407.14880
- afreeca annotationfree counting for all | arXiv: 2403.04943
- be yourself bounded attention for multisubject texttoimage g | arXiv: 2403.16990
- i canapost believe itaposs not scene flow | arXiv: 2403.04739
- layoutdetr detection transformer is a good multimodal layout | arXiv: 2212.09877
- towards natural languageguided drones geotext1652 bench | arXiv: 2311.12751
- tracking meets lora faster training larger model strong | arXiv: 2403.05231
- docling pdf document conversion | arXiv: 2408.09869
- teaching tailored to talent adverse weather restoration | arXiv: 2409.15739
- adaglimpse active visual exploration with arbitrary glimpse position and scale | arXiv: 2404.03482
- octopus embodied visionlanguage programmer from environmental feedback | arXiv: 2310.08588
- adapting fine-grained cross-view localization to areas without fine ground truth | arXiv: 2406.00474
- disco embodied navigation and interaction | arXiv: 2407.14758
- prioritized semantic learning for zeroshot instance navigation | arXiv: 2403.11650
- semgrasp semantic grasp generation via language aligned | arXiv: 2404.03590
- adalog post-training quantization for vision transformers with adaptive logarith | arXiv: 2407.12951
- controlnet improving conditional controls with efficien | arXiv: 2404.07987
- densenets reloaded paradigm shift beyond resnets and vits | arXiv: 2403.19588
- openpsg openset panoptic scene graph generation via large mu | arXiv: 2407.11213
- sclip rethinking selfattention for dense visionlanguage infe | arXiv: 2312.01597
- distribution-aware robust learning from long-tailed data with noisy labels | arXiv: 2407.16802
- grace graph-based contextual debiasing for fair visual question answering
- blazebvd make scale-time equalization great again for blind video deflickering | arXiv: 2403.06243
- draganything motion control for anything using entity representation | arXiv: 2403.07420
- dreammotion space-time self-similar score distillation for zero-shot video editi | arXiv: 2403.12002
- evaluating text-to-visual generation with image-to-text generation | arXiv: 2404.01291
- exploring pre-trained text-to-video diffusion models for referring video object | arXiv: 2403.12042
- exploring pretrained texttovideo diffusion models for referr | arXiv: 2403.12042
- freeinit bridging initialization gap in video diffusion | arXiv: 2312.07537
- freeinit bridging initialization gap in video diffusion models | arXiv: 2312.07537
- kalman-inspired feature propagation for video face super-resolution | arXiv: 2408.05205
- magdiff multi-alignment diffusion for high-fidelity video generation and editing | arXiv: 2311.17338
- mofa-video controllable image animation via generative motion field adaptions in | arXiv: 2405.20222
- physdreamer physics-based interaction with 3d objects via video generation | arXiv: 2404.13026
- realviformer investigating attention for real-world video super-resolution | arXiv: 2407.13987
- sv3d novel multi-view synthesis and 3d generation from a single image using late | arXiv: 2403.12008
- vfusion3d learning scalable 3d generative models from video diffusion models | arXiv: 2403.12034
- videoshop localized semantic video editing with noise-extrapolated diffusion inv | arXiv: 2403.14617
- actionswitch class-agnostic detection of simultaneous actions in streaming video | arXiv: 2407.12987
- elysium exploring objectlevel perception in videos via mllm | arXiv: 2403.16558
- finepseudo improving pseudo-labelling through temporal-alignablity for semi-supe | arXiv: 2409.01448
- nymeria a massive collection of multimodal egocentric daily | arXiv: 2406.09905
- pite pixeltemporal alignment for large videolanguage mo | arXiv: 2409.07239