Language-Image Models with 3D Understanding
Author(s): Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone This paper presents Cube-LLM, a novel Multi-modal Large Language Model (MLLM) that extends the perceptual capabilities of MLLMs to understand and reason about images in three-dimensional space. Unlike traditional models that primarily focus on 2D vision and language tasks, Cube-LLM leverages a large-scale pre-training dataset [...]
An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas
Author(s): Mira Slavcheva, Dave Gausebeck, Kevin Chen, David Buchhofer, Azwad Sabik, Chen Ma, Sachal Dhillon, Olaf Brandt, Alan Dolhasz This paper introduces a novel pipeline designed to enhance inpainting outcomes in the specific task of defurnishing, which involves the removal of furniture items from indoor panorama images. The proposed method capitalizes on Stable Diffusion, a technique that significantly improves the quality of inpainting by incorporating increased context, domain[...]
Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
Author(s): Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan The paper introduces the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a new benchmark designed to rigorously evaluate the performance of Video Large Multi-modal Models (Video-LMMs) across various real-world video contexts. Recent advancements have enabled these models to support diverse applications, including robotics, AI as[...]
What matters when building vision-language models?
Author(s) : Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh The paper titled “Idefics2: An Efficient Foundational Vision-Language Model” discusses the burgeoning interest in vision-language models (VLMs), propelled by advancements in large language models and vision transformers. Despite the wealth of research in this area, the paper notes that crucial decisions in VLM design often lack justification. This lack of substantiation hinders progress in the field by ob[...]
On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?
Author(s) : Maxime Zanella, Ismail Ben Ayed The paper titled “MeanShift for Test-time Augmentation (MTA)” presents a robust method that outperforms prompt-based techniques without the need for intensive training. This method is ideal for both standalone and API-based applications. Unlike previous test-time augmentation techniques that rely on ad hoc rules, such as a confidence threshold, to filter the augmented views, MTA incorporates a quality assessment variable for each vie[...]
DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos
Author(s) : Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki The paper titled “DreamScene4D” introduces a novel approach to generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos. This is achieved by leveraging existing Video Language Models (VLMs) that can track 2D video objects and current generative models that provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. The key[...]
Training-Free Consistent Text-to-Image Generation
Author(s) : Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon, ConsiStory is a groundbreaking training-free approach that addresses the challenge of consistently portraying the same subject across diverse prompts in text-to-image models. While these models offer unprecedented creative flexibility by allowing users to guide the image generation process through natural language, maintaining subject consistency has been a significant hurdle. [...]
No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
Author(s) : Skander Moalla, Andrea Miele, Razvan Pascanu, Caglar Gulcehre Proximal Policy Optimization (PPO), a popular on-policy reinforcement learning (RL) method, is not immune to the challenges posed by non-stationarity in RL environments. Despite the common belief that on-policy methods can train indefinitely, this study reveals that PPO agents are also susceptible to feature rank deterioration and loss of plasticity, which can lead to a collapse in performance. The [...]
Spectrally Pruned Gaussian Fields with Neural Compensation
Author(s) : Runyi Yang, Zhenxin Zhu, Zhou Jiang, Baijun Ye, Xiaoxue Chen, Yifei Zhang, Yuantao Chen, Jian Zhao, Hao Zhao SUNDAE, a memory-efficient Gaussian field, addresses the high memory consumption issue associated with 3D Gaussian Splatting, a novel 3D representation known for its fast rendering speed and high rendering quality. The high memory footprint of well-trained Gaussian fields, which can utilize millions of Gaussian primitives and hundreds of megabytes of memo[...]
CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models
Author(s) : Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, Xu Jia CharacterFactory is a groundbreaking framework that enables the sampling of new characters with consistent identities in the latent space of Generative Adversarial Networks (GANs) for diffusion models. This innovative approach addresses the limitations of current text-to-image models, which cannot directly generate images with consistent, newly coined identities. The framework considers[...]
ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving
Author(s) : Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang ConsistentID is a groundbreaking method designed for diverse identity-preserving portrait generation using fine-grained multimodal facial prompts and a single reference image. This innovative approach addresses the limitations of existing diffusion-based technologies, which struggle to achieve high-fidelity and detailed id[...]
PuLID: Pure and Lightning ID Customization via Contrastive Alignment
Author(s) : Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen Qian He The paper introduces Pure and Lightning ID customization (PuLID), an innovative tuning-free method for customizing identities in text-to-image generation models. PuLID combines a Lightning T2I branch with a standard diffusion branch, enabling the incorporation of both contrastive alignment loss and accurate ID loss. This approach minimizes disruption to the original model while ensuring high fidelity in the gene[...]
Hallucination of Multimodal Large Language Models: A Survey
Author(s) : Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou Multimodal Large Language Models (MLLMs), also known as Large Vision-Language Models (LVLMs), have shown significant advancements and remarkable capabilities in multimodal tasks. Despite these promising developments, MLLMs often produce outputs that are inconsistent with the visual content. This inconsistency, known as hallucination, poses considerable challenges to their prac[...]
Stylus: Automatic Adapter Selection for Diffusion Models
Author(s) : Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica When it comes to making high-resolution, customized images, fine-tuned adapters have become a cheaper option for scaling the base models with more data or parameters. The open-source community has learned how to use adapters, and this has led to the creation of a large database with over 100,000 adapters, many of which are highly customize[...]
DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing
Author(s) : Minghao Chen, Iro Laina, Andrea Vedaldi The task of editing 3D objects and scenes based on open-ended language instructions presents a unique set of challenges. The conventional approach to address this problem involves using a 2D image generator or editor to guide the 3D editing process. However, this method often proves to be time-consuming due to the need to update computationally intensive 3D representations such as a neural radiance field. Moreover, it relies on [...]
Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos
Author(s) : Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao This paper tackles the challenge of video try-on, an area where previous research has yielded limited success. The core difficulty lies in simultaneously preserving intricate clothing details and generating realistic, coherent motions throughout the video. To address these challenges, the authors propose "Tunnel Try-on," a novel diffusion-based f[...]
MaPa: Text-driven Photorealistic Material Painting for 3D Shapes
Author(s) : Shangzhan Zhang, Sida Peng, Tao Xu, Yuanbo Yang, Tianrun Chen, Nan Xue, Yujun Shen, Hujun Bao, Ruizhen Hu, Xiaowei Zhou The generation of materials for 3D meshes from text descriptions is an innovative approach presented in this research paper. Unlike traditional methods that focus on texture map synthesis, the proposed method introduces the generation of segment-wise procedural material graphs, offering high-quality rendering and substantial flexibility in edi[...]
Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
Author(s) : Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang This paper investigates the ability of Large Vision-Language Models (LVLMs) to generate detailed and accurate descriptions of visual content. While LVLMs have become increasingly sophisticated in their ability to process and integrate visual and textual data, a less explored area is their potential to create fine-grained descriptions. This research addresses this gap in knowledge by examining how effectiv[...]
TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
Author(s) : Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu Radiance fields have demonstrated impressive capabilities in synthesizing lifelike 3D talking heads. However, the prevailing paradigm, which presents facial motions by directly modifying point appearance, may lead to distortions in dynamic regions due to the difficulty in fitting steep appearance changes. To address this challenge, the researchers introduce Talking Gaussian, a deformation-bas[...]
From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation
Author(s) : Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng Recent advancements in controllable human image generation have enabled zero-shot generation using structural signals, such as pose or depth information, or facial appearance. However, generating human images conditioned on multiple parts of human appearance remains a significant challenge in the field. To address this challenge, the researchers introduce Parts to Whole, a novel framework designed for generating cu[...]
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
Author(s) : Bin Wang, Zhuangcheng Gu, Chao Xu, Bo Zhang, Botian Shi, Conghui He This paper introduces UniMER, a groundbreaking dataset that provides the first comprehensive study on Mathematical Expression Recognition (MER) in complex real-world scenarios. The UniMER dataset consists of two distinct components: a large-scale training set, UniMER-1M, and a meticulously designed test set, UniMER-Test. UniMER-1M offers an unprecedented scale and diversity, comprising one million[...]
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation
Author(s) : Xiangyu Xu, Lijuan Liu, Shuicheng Yan Existing Transformer models for monocular 3D human shape and pose estimation often face computational and memory limitations due to their quadratic complexity with respect to feature length. This constraint hinders the effective utilization of fine-grained information present in high-resolution features, which is crucial for accurate 3D reconstruction. To address this challenge, the researchers propose SMPLer, an innovative SMP[...]
ID-Animator: Zero-Shot Identity-Preserving Human Video Generation
Author(s) : Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Man Zhou, Jie Zhang Generating high-fidelity human videos with specified identities has been a significant challenge in the content generation community. Existing techniques often struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or failing to accurately capture the identity details in the video generation[...]
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios
Author(s) : Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Le Lu, Jiebo Luo, Ling Zhang Medical Vision-Language Pretraining (Med-VLP) aims to bridge the gap between visual content from medical images and their corresponding textual descriptions. While existing Med-VLP methods have primarily focused on 2D images depicting single body parts, such as chest X-rays, this paper extends the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios by uti[...]
Hyp-OC: Hyperbolic One Class Classification for Face Anti-Spoofing
Author(s) : Kartik Narayan, Vishal M. Patel Face recognition technology has become an integral part of modern security systems and user authentication processes. However, these systems are vulnerable to spoofing attacks, where malicious actors attempt to circumvent the security measures by presenting fake or manipulated facial data. Most prior research in face anti-spoofing (FAS) approaches this challenge as a two-class classification task, where models are trained on real samples[...]
Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses
Author(s) : Inhee Lee, Byungjun Kim, Hanbyul Joo This paper introduces an innovative approach to reconstruct the 3D world and multiple dynamic humans from a single monocular video input. The authors leverage the recently developed 3D Gaussian Splatting (3D-GS) representation, which enables efficient composition and rendering of both the environment and human subjects. One of the key challenges addressed in this work is the scenario of limited and sparse 3D observations, a common[...]
AutoAD III: The Prequel — Back to the Pixels
Author(s) : Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman Generating high-quality audio descriptions (AD) for movies is a complex task that demands intricate visual comprehension and an awareness of characters and their identities. Current visual language models designed for AD generation face limitations due to a scarcity of suitable training data and a lack of specialized evaluation measures tailored to the AD domain. This paper presents a [...]
Data Alignment for Zero-Shot Concept Generation in Dermatology AI
Author(s) : Soham Gadgil, Mahtab Bigverdi The field of dermatology AI is rapidly advancing, but the scarcity of data with ground-truth concept-level labels, which are semantically meaningful meta-labels for humans, remains a significant limitation in training trustworthy classifiers. Foundation models like CLIP (Contrastive Language-Image Pre-training) offer a potential solution by leveraging their zero-shot capabilities and vast amounts of image-caption pairs available on the int[...]
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Author(s) : Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu The visual encoder plays a crucial role in determining the performance of multimodal large language models (MLLMs) in understanding diverse image content. While large-scale pretrained vision encoders, such as those in CLIP and DINOv2, have shown promising results, no single vision encoder consistently excels across various image content types. For example, the CLIP [...]
Unified Scene Representation and Reconstruction for 3D Large Language Models
Author(s) : Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qing Yu, Jiaqi Wang Integrating Large Language Models (LLMs) with three-dimensional environments presents significant challenges. Traditional methods rely on extracting point clouds from either accurate ground truth geometry or reconstructed 3D scenes using auxiliary models. These methods then elevate text-image aligned 2D features from models like CLIP to these point clouds, which are used as inputs for LLMs. However,[...]
Moving Object Segmentation: All You Need Is SAM (and Flow)
Author(s) : Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman Motion segmentation, the task of discovering and segmenting moving objects in a video, has been a widely studied area with various approaches and training schemes. This paper investigates the potential of the Segment Anything Model (SAM) in contributing to this task. The authors propose two models that combine SAM with optical flow to leverage SAM's segmentation capabilities and flow's ability to identify and gro[...]
On the Content Bias in Fréchet Video Distance
Author(s) : Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang The research paper "On the Content Bias in Fréchet Video Distance" delves into the intricacies of the Fréchet Video Distance (FVD), a widely used metric for evaluating video generation models. While FVD has gained prominence in the field, it has been observed to occasionally conflict with human perception. This paper aims to investigate the extent of FVD's bias toward per-frame quality ove[...]
G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
Author(s) : Yufei Ye, Abhinav Gupta, Kris Kitani, Shubham Tulsiani The research paper "G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis" introduces a groundbreaking approach to modeling hand-object interactions using a denoising diffusion-based generative prior. This innovative model, called G-HOP, enables the joint modeling of both the 3D object and a human hand, conditioned on the object category. To capture the joint distribution of [...]
VG4D: Vision-Language Model Goes 4D Video Recognition
Author(s) : Zhichao Deng, Xiangtai Li, Xia Li, Yunhai Tong, Shen Zhao, Mengyuan Liu The research paper "VG4D: Vision-Language Model Goes 4D Video Recognition" introduces a groundbreaking framework that addresses the limitations of current methods for 4D point cloud recognition. Understanding the real world through point cloud video is essential for robotics and autonomous driving systems, but prevailing methods often struggle with a lack of detailed information due to sens[...]
Dynamic Typography: Bringing Words to Life
Author(s) : Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu The research paper "Dynamic Typography: Bringing Words to Life" introduces a revolutionary automated text animation scheme that combines the challenging tasks of deforming letters to convey semantic meaning and infusing them with vibrant movements based on user prompts. This innovative approach, termed "Dynamic Typography," aims to transform static communication into dynamic experie[...]
Factorized Diffusion: Perceptual Illusions by Noise Decomposition
Author(s) : Daniel Geng, Inbum Park, Andrew Owens The research paper "Factorized Diffusion: Perceptual Illusions by Noise Decomposition" introduces a revolutionary zero-shot method for controlling individual components of an image during the diffusion model sampling process. This innovative approach allows for the creation of hybrid images that change appearance based on various factors such as viewing distance, lighting conditions, or motion blurring. The method works by deco[...]
in2IN: Leveraging individual Information to Generate Human Interactions
Author(s) : Pablo Ruiz Ponce, German Barquero, Cristina Palmero, Sergio Escalera, Jose Garcia-Rodriguez The generation of human-human motion interactions conditioned on textual descriptions is a highly useful application in various fields, including robotics, gaming, animation, and the metaverse. However, modeling the highly dimensional interpersonal dynamics and capturing the intra-personal diversity of interactions pose significant challenges. Current methods for generating [...]
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
.Appy-Papers-main {padding: 10px 4% 7% 4%;} Author(s) : Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, Cihang Xie A new dataset named HQ-Edit, which has transformed the field of instruction-based image editing through a recent research breakthrough, has been introduced. This dataset contains approximately 200,000 high quality edits falling in the new class of image editing datasets. Unlike previous approaches that relied on attribute gu[...]
EgoPet: Egomotion and Interaction Data from an Animal’s Perspective
Author(s) : Amir Bar, Arya Bakhtiar, Danny Tran, Yifei MingAntonio Loquercio, Jathushan Rajasegaran, Yann LeCun, Amir Globerson, Trevor Darrell The remarkable capabilities of animals in perceiving and interacting with their surroundings remain unmatched by even the most advanced AI systems. To bridge this gap and enhance our understanding of AI, a unique dataset called "EgoPet" has been introduced. EgoPet provides a window into the world of animal movement and multi-agent in[...]
MMInA: Benchmarking Multihop Multimodal Internet Agents
Author(s) : Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu Autonomous embodied agents exist in a world of multimedia websites. The question arises - can they navigate through multimodal websites to complete complex user tasks? Current benchmarks fall short in assessing them in a realistic, evolving environment for their embodiment across websites. To address this, MMInA, a multihop and multimodal benchmark, has been introduced to evaluate the embodied agents for compositi[...]
No More Ambiguity in 360° Room Layout via Bi-Layout Estimation
Author(s) : Yu-Ju Tsai, Jin-Cheng Jhang, Wei Wang, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang The task of 360° room layout estimation presents a unique challenge due to the inherent ambiguity in layout annotations. To address this issue, researchers have proposed an innovative model named Bi-Layout. This model takes a unique approach by predicting two distinct layout types, each serving a specific purpose. The first layout type stops at ambiguous regio[...]
Taming Latent Diffusion Model for Neural Radiance Field Inpainting
Author(s) : Chieh Hubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, Hung-Yu Tseng The Neural Radiance Field, or NeRF, has emerged as a powerful tool for 3D reconstruction from multiple images. While recent advancements have shown promising results in editing reconstructed NeRFs using diffusion priors, there are still challenges to overcome, especially in synthesizing coherent geometry in uncovered areas. One significant ch[...]
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams
Author(s) : Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik In a remarkable study, a team of researchers has taken on the complex problem of capturing 3D human motion from a single, egocentric viewpoint. The paper, titled “EventEgo3D: Egocentric 3D Human Motion Capture with an Event Camera,” presents a new approach that uses the special features of event cameras to overcome the limitations of current methods. Tra[...]
COCONut: Modernizing COCO Segmentation
Author(s) : Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen In the rapidly evolving field of computer vision, the research community has witnessed remarkable progress in visual recognition tasks, largely driven by advancements in dataset benchmarks like COCO. However, despite its significant contributions, the COCO segmentation benchmark has experienced relatively slow improvement over the past decade. Originally, the COCO dataset was equipped with co[...]
Connecting NeRFs, Images, and Text
Author(s) : Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano Neural Radiance Fields, or NeRFs, have revolutionized the way we represent 3D scenes and objects, introducing a unique data type for information exchange and storage. In parallel, significant advancements have been made in multimodal representation learning, particularly for text and image data. This paper explores an exciting new research direction that aims to bridge t[...]
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Author(s) : Moreno D'Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, Nicu Sebe As text-to-image generative models gain popularity and widespread accessibility, it is crucial to thoroughly examine their safety and fairness to prevent the dissemination and perpetuation of biases. While existing research focuses on detecting predefined sets of biases, limiting studies to well-known concepts, a new approach called Op[...]
GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
Author(s) : Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G. Schwing, Shenlong Wang GoMAvatar, a groundbreaking approach to animatable human modeling, has been introduced, offering real-time performance, memory efficiency, and high-quality results. This innovative method requires only a single monocular video to generate a digital avatar that can be re-articulated in new poses and rendered from novel viewpoints in real-time, seamlessly integrating with rasterization-based [...]
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Author(s) : Yiwen Tang, Jiaming Liu, Dong Wang, Zhigang Wang, Shanghang Zhang, Bin Zhao, Xuelong Li Large foundation models have recently gained significant attention due to their superior performance across a wide range of scenarios. However, the scarcity of 3D data has led researchers to adapt pre-trained transformers from vision to 3D domains. While these 2D-to-3D approaches have shown promise, they are limited by the potential loss of spatial geometries and high computat[...]
BRAVE: Broadening the visual encoding of vision-language models
Author(s) : Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari Vision-language models (VLMs) have made significant strides in recent years, combining vision encoders like CLIP with language models to tackle various downstream tasks. However, these models still face challenges due to the limitations of their vision encoders, such as inability to detect certain image features and tendency to hallucinate visual elements. To o[...]
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models
Author(s) : Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari Vision-language models (VLMs) have made significant strides in recent years, combining vision encoders like CLIP with language models to tackle various downstream tasks. However, these models still face challenges due to the limitations of their vision encoders, such as inability to detect certain image features and tendency to hallucinate visual elements. To ove[...]
UMBRAE: Unified Multimodal Decoding of Brain Signals
Author(s) : Weihao Xia, Raoul de Charette, Cengiz Öztireli, Jing-Hao Xue Brain-powered research has faced significant challenges in accurately recovering spatial information and the need for subject-specific models. To tackle these issues, a team of researchers has proposed UMBRAE, a unified multimodal decoding approach for brain signals. UMBRAE introduces an efficient universal brain encoder that aligns multimodal brain data, enabling the extraction of instance-level conceptu[...]
Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?
Author(s) : Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna The paper explores a novel approach to enhance the semantic grounding abilities of Vision-Language Models (VLMs) without relying on domain-specific training data, fine-tuning, or modifications to the network architectures. The authors propose a feedback mechanism composed of a binary signal, which, when prompted appropriately, allows VLMs to utilize feedback both in a single step and iteratively. This approach [...]
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Author(s) : Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid MoReVQA, a groundbreaking framework for video question answering (videoQA), that enhances interpretability and performance. Unlike traditional single-stage planning methods, MoReVQA employs a multi-stage, modular reasoning approach. It consists of three key stages: an event parser, a grounding stage, and a final reasoning stage, all integrated with an external memory. What sets MoReVQA apart is [...]
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Author(s) : Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah Abhinav Shrivastava Ser-Nam Lim Introducing MA-LMM, a Memory-Augmented Large Multimodal Model designed to revolutionize long-term video understanding. Unlike existing LLM-based multimodal models that are limited to processing only a small number of frames from short videos, MA-LMM tackles the challenge of understanding extended video content. It achieves this by processing videos in an on[...]
Finding Visual Task Vectors
Author(s) : Alberto Hojel, Yutong Bai, Amir Globerson, Amir Bar Visual Prompting, a technique that enables models to learn and perform visual tasks through in-context examples without requiring additional training, has gained attention recently. In this paper, we build upon this concept and make a significant leap forward. By analyzing the motivations of MAE-VQGAN, a state-of-the-art Visual Prompting model, we uncover task vectors: unique activations that encode task-specific [...]
A Large-Scale Exploration of μ-Transfer
Author : Lucas Lingle Large neural network models have revolutionized natural language processing and computer vision, but the process of setting their initialization and learning rates often relies on heuristic methods, leading to inconsistencies across different models and research papers. The μ-Parameterization (μP) approach offers a promising solution to these challenges, providing scaling rules for model initialization and learning rates. It also enables zero-shot hyperparam[...]
Watermark-based Detection and Attribution of AI-Generated Content
Author(s) : Zhengyuan Jiang, Moyang Guo, Yuepeng Hu, Neil Zhenqiang Gong With the increasing sophistication of AI-generated content, the need for effective detection and attribution methods has become crucial. Many prominent companies, such as Google, Microsoft, and Open AI, have recognized this and implemented watermarking techniques as a proactive measure to identify synthetic content. However, the current focus of most research in this field primarily centers on general de[...]
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models
Author(s) : Sangwon Jang, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang Text-to-image diffusion models have shown impressive results in generating personalized images of a single subject using just a few reference images. However, these models often struggle when trying to generate images with multiple subjects, leading to mixed identities and combined attributes from different individuals. To address this issue, they introduce MuDI, a new framework that enables the personalization[...]
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
Author(s) : Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang As text-to-image (T2I) models have improved, there is a growing interest in evaluating their prompt faithfulness, which refers to the semantic coherence between the generated images and the prompts they were based on. While various T2I faithfulness metrics have been proposed using cross-modal embeddings and vision-language models (VLMs), these metrics have not been thoroughly[...]
Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation
Author(s) : Zifu Wan, Yuhao Wang, Silong Yong, Pingping Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie "Sigma: Siamese Mamba Network for Multi-modal Semantic Segmentation" presents a novel approach for multi-modal semantic segmentation using a Siamese Mamba network. The authors propose the use of additional modalities, such as thermal and depth (X-modality), alongside traditional RGB to enhance AI agents' perception and scene understanding, particularly in challenging cond[...]