AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking
Author(s): Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro R. A. S. Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, Yutong Tang, Yining Cao, Haoqi Han, Zheyuan Zhang, Jiawei Liu, Tiezheng Zhang, Yujiu Ma, Jincheng Wang, Guang Zhang, Alan Yuille, Zongwei Zhou "AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking" introduces AbdomenAtlas, a comprehensive dataset designed to adv[...]
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Author(s): Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, Xiang Bai "PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects" introduces PartGLEE, a comprehensive framework designed to enhance object recognition and parsing across various contexts and categories. This research addresses the limitations of existing models, which often struggle with recognizing diverse and complex objects in varied environments. PartGLEE is constructed as a foundation model aimed at improvi[...]
Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions
Author(s): Fabio Tosi, Pierluigi Zama Ramirez, Matteo Poggi The paper titled "Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions" introduces an innovative approach to estimating depth from single images using diffusion models. This research addresses the significant challenges associated with monocular depth estimation, particularly in scenarios where traditional methods often fail, such as images with low texture, occlusions, or varying lighting conditio[...]
WayEx: Waypoint Exploration using a Single Demonstration
Author(s): Mara Levy, Nirat Saini, Abhinav Shrivastava The paper titled "WayEx: Waypoint Exploration using a Single Demonstration" introduces an innovative approach to robotic exploration that allows robots to learn navigation tasks from a single human demonstration. This research addresses the challenge of training robots to explore and understand environments efficiently, leveraging minimal input while maximizing learning outcomes. WayEx's core innovation lies in its ability to gen[...]
BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes
Author(s): Chih-Hai Su, Chih-Yao Hu, Shr-Ruei Tsai, Jie-Ying Lee, Chin-Yang Lin, Yu-Lun Liu "BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes" introduces BoostMVSNeRFs, an advanced framework designed to enhance the performance of Multi-View Stereo (MVS) based Neural Radiance Fields (NeRFs) for view synthesis tasks in expansive environments. Traditional NeRFs often require a dense set of input views to produce high-quality renderings, which ca[...]
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Author(s): Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman The paper titled "AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description" introduces AutoAD-Zero, an innovative approach designed to generate audio descriptions from visual content without requiring extensive training. This research addresses the critical need for accessibility solutions that provide automated audio narration for images and videos, particularly benefiting[...]
ViLLa: Video Reasoning Segmentation with Large Language Model
Author(s): Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao The paper titled "ViLLa: Video Reasoning Segmentation with Large Language Model" introduces ViLLa, a novel framework that enhances video perception models by integrating reasoning capabilities through large language models (LLMs). This research addresses the challenge of enabling models to comprehend and reason about user intentions via textual input, which is essential for advanced video segmentation [...]
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
Author(s): Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu The paper titled "T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-Video Generation" introduces T2V-CompBench, a novel benchmark specifically designed to evaluate the capabilities of text-to-video (T2V) generation models in handling compositional tasks. This research addresses the significant gap in existing benchmarks, which often overlook the ability of T2V models to compose diffe[...]
Internal Consistency and Self-Feedback in Large Language Models: A Survey
Author(s): Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Feiyu Xiong, Zhiyu Li "Internal Consistency and Self-Feedback in Large Language Models: A Survey" provides a thorough examination of the mechanisms that ensure reliable and coherent outputs in large language models (LLMs). This survey focuses on two critical aspects: internal consistency and self-feedback, both of which are essential for enhancing the performance and reliability of LLMs in [...]
GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model
Author(s): Abdelrahman Shaker, Syed Talal Wasim, Salman Khan, Juergen Gall, Fahad Shahbaz Khan "GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model" introduces GroupMamba, a novel approach designed to enhance the efficiency and accuracy of visual state space models (VSSMs) in handling group-based visual tasks. This research addresses the challenge of developing models that can efficiently process and analyze visual data in group settings, which is crucial for a[...]
Training-Free Model Merging for Multi-target Domain Adaptation
Author(s): Wenyi Li, Huan-ang Gao, Mingju Gao, Beiwen Tian, Rong Zhi, Hao Zhao "Training-Free Model Merging for Multi-target Domain Adaptation" introduces a novel approach to domain adaptation that enables the merging of multiple pre-trained models without the need for additional training. This research addresses the challenge of adapting models to new target domains efficiently, which is crucial for applications in machine learning and artificial intelligence where models must generali[...]
Visual Haystacks: Answering Harder Questions About Sets of Images
Author(s): Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan "Visual Haystacks: Answering Harder Questions About Sets of Images" introduces a novel framework designed to enhance the ability of vision-language models (VLMs) to handle complex queries about large sets of images. This research addresses the challenge of extracting relevant information from extensive visual contexts, which is crucial for applications in multimedia co[...]
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Author(s): Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu "LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models" presents a critical examination of the current evaluation practices for large multimodal models (LMMs). This research addresses the growing concern that existing evaluation methodologies may not adequately capture the true capabilities and limitations of LMMs, wh[...]
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control
Author(s): Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov The paper titled "Taming Large Video Diffusion Transformers for 3D Camera Control" introduces an innovative approach to enhancing the capabilities of video diffusion models for 3D camera control. This research addresses the challenge of effectively managing and controlling 3D[...]
SMooDi: Stylized Motion Diffusion Model
Author(s): Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, Huaizu Jiang "SMooDi: Stylized Motion Diffusion Model" introduces an innovative approach to generating stylized human motion using diffusion models. This research addresses the challenge of creating realistic and expressive human motion sequences that incorporate specific stylistic elements, which is crucial for applications in animation, virtual reality, and interactive media. SMooDi leverages the power of diffusion models[...]
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
Author(s): Mo Li, Songyang Zhang, Yunxin Liu, Kai Chen "NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?" introduces NeedleBench, a novel framework designed to evaluate the capabilities of large language models (LLMs) in handling extensive context windows up to one million tokens. This research addresses the challenge of determining whether LLMs can effectively perform retrieval and reasoning tasks when provided with exceptionally long contexts, which is cri[...]
Efficient Training with Denoised Neural Weights
Author(s): Yifan Gong, Zheng Zhan, Yanyu Li, Yerlan Idelbayev, Andrey Zharkov, Kfir Aberman, Sergey Tulyakov, Yanzhi Wang, Jian Ren The paper titled "Efficient Training with Denoised Neural Weights" introduces a novel approach aimed at enhancing the efficiency of training deep neural networks by utilizing denoised neural weights. This research addresses the challenge of improving the performance and convergence speed of neural networks, which is crucial for a wide range of applications [...]
Does Refusal Training in LLMs Generalize to the Past Tense?
Author(s): Maksym Andriushchenko, Nicolas Flammarion "Does Refusal Training in LLMs Generalize to the Past Tense?" explores an intriguing aspect of large language models (LLMs): their ability to generalize refusal behaviors across different grammatical tenses. Refusal training is a technique used to teach LLMs to decline generating content that might be harmful or inappropriate. This study specifically investigates whether LLMs trained to refuse certain prompts in the present tense can [...]
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Author(s): Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu "Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes" introduces a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which focuses on segmenting objects within visual scenes based on audio cues and textual references. This research addresses the challenge of integrating audio-visual information with natural language processing to enhance object segmentation, a critical task for ap[...]
No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations
Author(s): Walter Simoncini, Spyros Gidaris, Andrei Bursuc, Yuki M. Asano "No Train, All Gain: Self-Supervised Gradients Improve Deep Frozen Representations" introduces a novel approach to enhancing the performance of deep neural networks by leveraging self-supervised gradients without the need for additional training. This research addresses the challenge of improving pre-trained models, which are often used in various applications but may not always perform optimally out-of-the-box[...]
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
Author(s): Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee The paper titled "VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation" introduces VGBench, a comprehensive benchmark designed to assess the capabilities of large language models (LLMs) in understanding and generating vector graphics. This research addresses the challenge of evaluating LLMs in the context of vector graphics, which are crucial for applications in digital art, graphic design[...]
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts
Author(s): Amelia F. Hardy, Houjun Liu, Bernard Lange, Mykel J. Kochenderfer "ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts" introduces ASTPrompter, a novel framework designed to enhance the process of identifying toxic prompts in large language models (LLMs) through automated red-teaming. This research addresses the challenge of ensuring the safety and reliability of LLMs by systematically discovering prompts that could trigger[...]
Benchmarking Large Neighborhood Search for Multi-Agent Path Finding
Author(s): Jiaqi Tan, Yudong Luo, Jiaoyang Li, Hang Ma "Benchmarking Large Neighborhood Search for Multi-Agent Path Finding" presents a comprehensive evaluation of Large Neighborhood Search (LNS) algorithms applied to the Multi-Agent Path Finding (MAPF) problem. This research addresses the challenge of finding collision-free paths for multiple agents, which is crucial for applications in robotics, autonomous vehicles, and traffic management. MAPF involves planning paths for multipl[...]
StyleSplat: 3D Object Style Transfer with Gaussian Splatting
Author(s): Sahil Jain, Avik Kuthiala, Prabhdeep Singh Sethi, Prakanshul Saxena "StyleSplat: 3D Object Style Transfer with Gaussian Splatting" introduces StyleSplat, an innovative method designed to achieve efficient and high-quality style transfer for 3D objects using Gaussian splatting. This research addresses the challenge of stylizing 3D objects in a way that is both computationally efficient and visually compelling, which is crucial for applications in digital art, gaming, and vir[...]
Real-Time Anomaly Detection and Reactive Planning with Large Language Models
Author(s): Rohan Sinha, Amine Elhafsi, Christopher Agia, Matthew Foutter, Edward Schmerling, Marco Pavone "Real-Time Anomaly Detection and Reactive Planning with Large Language Models" introduces a novel framework that leverages the capabilities of large language models (LLMs) for real-time anomaly detection and reactive planning in dynamic environments. This research addresses the critical need for systems that can not only detect anomalies as they occur but also react promptly and eff[...]
Video Diffusion Alignment via Reward Gradients
Author(s): Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, Deepak Pathak "Video Diffusion Alignment via Reward Gradients" introduces a novel approach to enhancing video diffusion models by aligning them with specific downstream tasks using reward gradients. This research addresses the challenge of adapting pre-trained video diffusion models to perform well on particular tasks, leveraging the dense gradient information provided by vision discriminative models. [...]
MAVIS: Mathematical Visual Instruction Tuning
Author(s): Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li "MAVIS: Mathematical Visual Instruction Tuning" introduces MAVIS, a novel framework designed to enhance the capabilities of multimodal large language models (MLLMs) in understanding and solving mathematical problems that involve visual elements. This research addresses the challenge of integrating visual mathematical conten[...]
Learning In-Hand Translation Using Tactile Skin With Shear and Normal Force Sensing
Author(s): Jessica Yin, Haozhi Qi, Jitendra Malik, James Pikul, Mark Yim, Tess Hellebrekers "Learning In-Hand Translation Using Tactile Skin With Shear and Normal Force Sensing" introduces an innovative approach to robotic manipulation that leverages advanced tactile sensing technology. This research addresses the challenge of enabling robots to perform in-hand translation tasks, which involve manipulating objects within the hand, using tactile feedback to achieve precise control. [...]
AdaptiGraph: Material-Adaptive Graph-Based Neural Dynamics for Robotic Manipulation
Author(s): Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li "AdaptiGraph: Material-Adaptive Graph-Based Neural Dynamics for Robotic Manipulation" introduces AdaptiGraph, an innovative framework designed to enhance robotic manipulation by adapting to different material properties. This research addresses the challenge of enabling robots to handle a wide variety of objects with varying material characteristics, which is crucial for applications in manufacturing, logistics, and service robo[...]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Author(s): Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li "LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models" introduces LLaVA-NeXT-Interleave, an advanced framework designed to enhance the capabilities of large multimodal models (LMMs) by integrating multi-image, video, and 3D data. This research addresses the growing need for models that can handle diverse and complex data types, which is crucial for applicatio[...]
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Author(s): Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs" introduces the Situational Awareness Dataset (SAD), a benchmark designed to evaluate the situational awareness capabilities of large language models (LLMs). This research addresses the growing need to understand how LLMs perceive and interpret their own operational c[...]
RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation
Author(s): Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang "RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation" introduces a novel framework designed to enhance the generalizability of robotic manipulation in zero-shot scenarios. This framework, named RAM (Retrieval-Based Affordance Transfer), addresses the challenge of enabling robots to perform manipulation tasks on objects and in environments [...]
CountGD: Multi-Modal Open-World Counting
Author(s): Niki Amini-Naieni, Tengda Han, Andrew Zisserman "CountGD: Multi-Modal Open-World Counting" introduces a novel approach to object counting in diverse and dynamic environments using multi-modal data inputs. Authored by Niki Amini-Naieni, Tengda Han, and Andrew Zisserman, this research addresses the challenge of accurately counting objects in real-world scenarios where the variety and complexity of data can significantly hinder performance. CountGD leverages multiple data mod[...]
A Unified Framework for 3D Scene Understanding
Author(s): Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, Xiang Bai The paper titled "A Unified Framework for 3D Scene Understanding" introduces UniSeg3D, a comprehensive framework designed to enhance the understanding of 3D scenes. This framework aims to address the diverse and complex requirements of 3D scene segmentation, providing a unified solution that integrates multiple segmentation tasks into a single model. UniSeg3D is built to handle a wide range of segmentatio[...]
DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
Author(s): Yilun Xu, Gabriele Corso, Tommi Jaakkola, Arash Vahdat, Karsten Kreis The paper titled "DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents" introduces Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff), a novel approach designed to improve the performance and efficiency of diffusion models in generative learning tasks. This research addresses the challenge of balancing the complexity and computational demands of continuous diffusion model[...]
Neurocache: Efficient Vector Retrieval for Long-range Language Modeling
Author(s): Ali Safaya, Deniz Yuret "Neurocache: Efficient Vector Retrieval for Long-range Language Modeling" introduces Neurocache, a novel approach designed to extend the effective context size of large language models (LLMs). This method addresses the challenge of maintaining long-range dependencies in language models, which is crucial for tasks that require understanding and generating coherent text over extended sequences. Neurocache leverages an external vector memory to store p[...]
Value-Penalized Auxiliary Control from Examples for Learning without Rewards or Demonstrations
Author(s): Trevor Ablett, Bryan Chan, Jayce Haoran Wang, Jonathan Kelly "Value-Penalized Auxiliary Control from Examples for Learning without Rewards or Demonstrations" introduces a novel approach to reinforcement learning that does not rely on traditional reward signals or expert demonstrations. This method addresses the challenge of enabling agents to learn effective policies in environments where explicit rewards are unavailable or impractical to define. The core idea behind this [...]
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Author(s): Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang The paper titled "InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output" introduces InternLM-XComposer-[...]
Magic Insert: Style-Aware Drag-and-Drop
Author(s): Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael Rubinstein, David E. Jacobs, Shlomi Fruchter The paper titled "Magic Insert: Style-Aware Drag-and-Drop" introduces an innovative method for seamlessly integrating subjects from one image into a target image of a different style. This research addresses the challenge of maintaining both physical plausibility and style consistency when transferring elements between images, which is crucial for applications in digital[...]
E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness
Author(s): Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, Vicky Kalogeiton The paper titled "E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness" introduces a novel approach to generating camera trajectories based on textual descriptions, with a specific focus on character awareness. This research addresses the challenge of creating dynamic and contextually appropriate camera movements in response to narrative cues, which is essenti[...]
Open-TeleVision: Teleoperation with Immersive Active Visual Feedback
Author(s):Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, Xiaolong Wang "Open-TeleVision: Teleoperation with Immersive Active Visual Feedback" introduces Open-TeleVision, a cutting-edge teleoperation system designed to enhance the collection of on-robot data for robot learning from demonstrations. This system aims to improve the intuitiveness and ease of use of teleoperation, which are crucial for ensuring high-quality, diverse, and scalable data collection. Open-TeleVision leverages i[...]
Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
Author(s): Hao Dong, Eleni Chatzi, Olga Fink "Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-Supervision" introduces a novel framework aimed at enhancing the ability of models to generalize and adapt to new, unseen domains in a multimodal context. This research addresses the challenge of recognizing novel classes within unseen domains, a task known as open-set domain generalization (OSDG), which is particularly complex when dealing with multiple data mod[...]
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
Author(s): Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu "KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches" explores the trade-offs involved in compressing key-value (KV) caches for large language models (LLMs) to handle long-context tasks efficiently. This research addresses the significant challenge of mana[...]
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning
Author(s): Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, Masayoshi Tomizuka "Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning" introduces an innovative approach to robot learning that leverages sparse diffusion models to enhance efficiency and flexibility. This research addresses the challenges of developing robust and adaptable robot policies that can efficiently learn from[...]
Empowering 3D Visual Grounding with Reasoning Capabilities
Author(s): Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu "Empowering 3D Visual Grounding with Reasoning Capabilities" introduces a novel approach to enhance 3D visual grounding by integrating advanced reasoning capabilities. This research addresses the challenge of accurately identifying and localizing objects within 3D scenes based on textual descriptions, a task that is crucial for applications in robotics, augmented reality, and autonomous systems. The proposed method [...]
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Author(s): Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo The paper titled "LLaRA: Supercharging Robot Learning Data for Vision-Language Policy" introduces LLaRA (Large Language and Robotics Assistant), a novel framework designed to enhance robot learning by integrating vision and language data. This research addresses the challenge of developing robots that can understand[...]
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Author(s): Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen "Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs" introduces Web2Code, a comprehensive dataset and evaluation framework designed to advance the capabilities of multimodal large language models [...]
Odd-One-Out: Anomaly Detection by Comparing with Neighbors
Author(s): Ankan Bhunia, Changjian Li, Hakan Bilen "Odd-One-Out: Anomaly Detection by Comparing with Neighbors" introduces a novel approach to anomaly detection that leverages the concept of comparing data points with their neighbors to identify anomalies. This method addresses the challenge of detecting anomalies in datasets, where traditional methods may struggle due to the subtlety or complexity of the anomalies. The core idea behind this approach is to identify anomalies by exam[...]
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
Author(s): Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko The paper titled "TabReD: A Benchmark of Tabular Machine Learning in the Wild" introduces TabReD, a comprehensive benchmark designed to evaluate the performance of machine learning models on real-world tabular data. This benchmark addresses the need for robust evaluation frameworks that reflect the complexities and challenges encountered in practical applications of machine learning. TabReD is built to asses[...]
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Author(s): Ali Khaleghi Rahimian, Manish Kumar Govind, Subhajit Maity, Dominick Reilly, Christian Kümmerle, Srijan Das, Aritra Dutta "Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads" introduces a novel approach to visual representation learning by leveraging diverse attention mechanisms across multiple heads. This method aims to enhance the learning of visual features by incorporating a variety of attention patterns, which allows for a more [...]
Looking 3D: Anomaly Detection with 2D-3D Alignment
Author(s): Ankan Bhunia, Changjian Li, Hakan Bilen "Looking 3D: Anomaly Detection with 2D-3D Alignment" introduces a novel approach to anomaly detection by leveraging the alignment of 2D and 3D data. This method addresses the limitations of traditional 2D anomaly detection techniques, which often struggle to differentiate between subtle surface defects and normal textures due to the lack of depth information. The proposed approach integrates 2D images with 3D point cloud data to [...]
MatchTime: Towards Automatic Soccer Game Commentary Generation
Author(s): Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, Weidi Xie "MatchTime: Towards Automatic Soccer Game Commentary Generation" introduces an innovative approach to generating real-time commentary for soccer games using advanced machine learning techniques. This research addresses the challenge of creating dynamic and contextually relevant commentary that enhances the viewing experience for soccer fans. MatchTime leverages a combination of computer vision and natural languag[...]
Symbolic Learning Enables Self-Evolving Agents
Author(s): Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang "Symbolic Learning Enables Self-Evolving Agents" introduces a novel framework that leverages symbolic learning to create self-evolving agents capable of solving complex real-world tasks. This research addresses the challenge of developing agents that can adapt and improve over time without extensive human interven[...]
On Scaling Up 3D Gaussian Splatting Training
Author(s): Hexu Zhao, Haoyang Weng, Daohan Lu, Ang Li, Jinyang Li, Aurojit Panda, Saining Xie "On Scaling Up 3D Gaussian Splatting Training" explores the potential of training high-parameter 3D Gaussian Splatting (3DGS) models on large-scale, high-resolution datasets. This research addresses the challenges associated with scaling up 3DGS models to handle more complex scenes with higher spatial resolution and larger datasets, which are essential for achieving high-quality 3D scene reco[...]
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Author(s): Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang "MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning" introduces MG-LLaVA, an advanced multi-modal large language model (MLLM) designed to enhance visual processing capabilities by incorporating multi-granularity vision inputs. This innovative approach addresses the limitations of existing models that primarily process low-resolution images, which restricts their effectiveness in ta[...]
Fast and Uncertainty-Aware SVBRDF Recovery from Multi-View Capture using Frequency Domain Analysis
Author(s): Ruben Wiersma, Julien Philip, Miloš Hašan, Krishna Mullia, Fujun Luan, Elmar Eisemann, Valentin Deschaintre "Fast and Uncertainty-Aware SVBRDF Recovery from Multi-View Capture using Frequency Domain Analysis" presents a novel approach to recovering spatially-varying bidirectional reflectance distribution functions (SVBRDFs) from multi-view image captures. This method addresses the challenges of accurately and efficiently capturing the complex reflectance properties of surf[...]
Text-Animator: Controllable Visual Text Video Generation
Author(s): Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian The paper titled "Text-Animator: Controllable Visual Text Video Generation" presents an innovative approach to generating videos from textual descriptions, offering fine-grained control over both visual and motion aspects of the generated content. This research addresses the challenge of creating dynamic and visually coherent videos based solely on text inputs, which has significant[...]
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models
Author(s): Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, Ziwei Liu "FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models" introduces an innovative approach to controlling object trajectories in video generation without the need for extensive tuning or retraining. This method addresses the challenge of achieving precise and flexible control over the motion of objects in generated videos, which is crucial for applications in animation, virtual reality, and[...]
StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal
Author(s): Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Yushuang Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, Xiaoguang Han The paper titled "StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal" introduces a novel approach to improving the stability and sharpness of normal estimates in diffusion models. Diffusion models are widely used in various applications, including image synthesis and denoising, but they often suffer from high variance during the inference proce[...]
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Author(s): Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, Hongyang Zhang "Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models" addresses the evolving landscape of Referring Expression Comprehension (REC) in light of advancements in large multimodal models (LMMs). REC is a task that involves identifying and localizing objects in images based on natural language descriptions. Traditional REC methods[...]