ADEFGHIKLMNOPQRSTUVWX
1
Paper IDHow Paper is being presentedHow will paper be presented?Track ThemeTrack Theme for ProgramTitleAuthors NamesWho registered Paper? Presenter NameAbstractRoom LocationSession Whova Session TitlesSub-session (ex. ML 1, ML 2, etc.)Session DateSession timeTalk orderSession ChairStart Time
UAE
End Time
UAE
Gather Time Added to Whova
2
1PosterVirtual
Dialogue and Conversational Interaction:Ethical Reviewing
Dialogue and Conversational Interaction
PreAct: Prediction Enhances Agent's Planning Ability
Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He and Weiran Xu
Dayuan FuDayuan Fu
Addressing the disparity between predictions and actual results can enable individuals to expand their thought processes and stimulate self-reflection, thus promoting accurate planning.
In this research, we present **PreAct**, an agent framework that integrates **pre**diction, **rea**soning, and **act**ion. By utilizing the information derived from predictions, the large language model (LLM) agent can provide a wider range and more strategically focused reasoning. This leads to more efficient actions that aid the agent in accomplishing intricate tasks. Our experimental results show that PreAct surpasses the ReAct method in completing complex tasks and that PreAct's performance can be further improved when paired with other memory or selection strategy techniques. We presented the model with varying quantities of historical predictions and discovered that these predictions consistently enhance LLM planning.
The variances in single-step reasoning between PreAct and ReAct indicate that PreAct indeed has benefits in terms of diversity and strategic orientation over ReAct.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30
3
7PosterIn-person
Language Resources and Evaluation:Ethical Reviewing
The PRECOM-SM Corpus: Gambling in Spanish Social Media
Pablo Álvarez-Ojeda, María Victoria Cantero-Romero, Anastasia Semikozova and Arturo Montejo-Raez
Arturo Montejo-Ráez
Arturo Montejo-Ráez
Gambling addiction is a "silent problem'' in society, especially among young people in recent years due to the easy access to betting and gambling sites on the Internet through smartphones and personal computers. As online communities in messaging apps, forums and other ``teenagers gathering'' sites keep growing day by day, more textual information is available for its study. This work focuses on collecting text from online Spanish-speaking communities and analysing it in order to find patterns in written language from frequent and infrequent users on the collected platforms so that an emerging gambling addiction problem can be detected. In this paper, a newly built corpus is introduced, as well as an extensive description of how it has been made. Besides, some baseline experiments on the data have been carried on, employing the generated features after the analysis of the text with different machine learning approaches like the bag of words model or deep neural network encodings.
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
4
12PosterVirtual
Ethical Reviewing:Language Modeling
Language Modeling
How Well Can a Long Sequence Model Model Long Sequences? Comparing Architectural Inductive Biases on Long-Context Abilities
Jerry HuangJerry HuangJerry Huang
Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence length. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.
GatherGather Session 3GatherTue, Jan 28
13:00-14:30
13:0014:30Yes
5
16PosterVirtual
Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining
Sentiment Analysis, Opinion and Argument Mining
Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis
Kaiwei Sun and Mi Tian
Mi TianMi Tian
Multimodal Sentiment Analysis (MSA) aims to identify human attitudes from diverse modalities such as visual, audio and text modalities. Recent studies suggest that the text modality tends to be the most effective, which has encouraged models to consider text as its core modality. However, previous methods primarily concentrate on projecting modalities other than text into a space close to the text modality and learning an identical representation, which does not fully make use of the auxiliary information provided by audio and visual modalities. In this paper, we propose a framework, Sequential Fusion of Text-close and Text-far Representations (SFTTR), aiming to refine multimodal representations from multimodal data which should contain both representations close to and far from the text modality. Specifically, we employ contrastive learning to sufficiently explore the information similarities and differences between text and audio/visual modalities. Moreover, to fuse the extracted representations more effectively, we design a sequential cross-modal encoder to sequentially fuse representations that are close to and far from the text modality.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
6
22PosterVirtual
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Natural Language Generation, Summarization and Simplification
PoemBERT: A Dynamic Masking Content and Ratio Based Semantic Language Model For Chinese Poem Generation
Chihan Huang and Xiaobo Shen
Chihan Huang
Chihan Huang
Ancient Chinese poetry stands as a crucial treasure in Chinese culture. To address the absence of pre-trained models for ancient poetry, we introduced PoemBERT, a BERT-based model utilizing a corpus of classical Chinese poetry. Recognizing the unique emotional depth and linguistic precision of poetry, we incorporated sentiment and pinyin embeddings into the model, enhancing its sensitivity to emotional information and addressing challenges posed by the phenomenon of multiple pronunciations for the same Chinese character. Additionally, we proposed Character Importance-based masking and dynamic masking strategies, significantly augmenting the model's capability to extract imagery-related features and handle poetry-specific information. Fine-tuning our PoemBERT model on various downstream tasks, including poem generation and sentiment classification, resulted in state-of-the-art performance in both automatic and manual evaluations. We provided explanations for the selection of the dynamic masking rate strategy and proposed a solution to the issue of a small dataset size.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
7
29PosterIn-person
Sentiment Analysis, Opinion and Argument Mining:Ethical Reviewing
CDA^2: Counterfactual Diffusion Augmentation for Cross-Domain Adaptation in Low-Resource Sentiment Analysis
Dancheng Xin, Kaiqi Zhao, Jingyun Sun and Yang Li
Yang Li
Yang Li
Domain adaptation is widely employed in cross-domain sentiment analysis, enabling the transfer of models from label-rich source domains to target domain with fewer or no labels. However, concerns have been raised regarding their robustness and sensitivity to data distribution shift, particularly when encountering significant disparities in data distribution between the different domains. To tackle this problem, we introduce a framework CDA^2 for cross-domain adaptation in low-resource sentiment analysis, which utilizes counterfactual diffusion augmentation. Specifically, it employs samples derived from domain-relevant word substitutions in source domain samples to guide the diffusion model for generating high-quality counterfactual target domain samples. We adopt a soft absorbing state and MMD loss during the training stage, and use advanced ODE solvers to expedite the sampling process. Our experiments demonstrate that CDA^2 generates high-quality target samples and achieves state-of-the-art performance in cross-domain sentiment analysis.
AtriumSession 6: Oral/Poster EPoster Wed, Jan 22
11:00-12:30
11:0012:30
8
35PosterIn-person
Ethical Reviewing:Language Resources and Evaluation
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li and Jing Ma
Ziyang LuoZiyang Luo
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our benchmark is available at https://github.com/CodeLLM-Research/CodeJudge-Eval .
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
9
39OralIn-person
Information Extraction:Ethical Reviewing
Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching
Tianshu Wang, Xiaoyang Chen, Hongyu Lin, Xuanang Chen, Xianpei Han, Le Sun, Hao Wang and Zhenyu Zeng
Tianshu Wang
Tianshu Wang
Entity matching (EM) is a critical step in entity resolution (ER). Recently, entity matching based on large language models (LLMs) has shown great promise. However, current LLM-based entity matching approaches typically follow a binary matching paradigm that ignores the global consistency among record relationships. In this paper, we investigate various methodologies for LLM-based entity matching that incorporate record interactions from different perspectives. Specifically, we comprehensively compare three representative strategies: matching, comparing, and selecting, and analyze their respective advantages and challenges in diverse scenarios. Based on our findings, we further design a compound entity matching framework (ComEM) that leverages the composition of multiple strategies and LLMs. ComEM benefits from the advantages of different sides and achieves improvements in both effectiveness and efficiency. Experimental results on 8 ER datasets and 10 LLMs verify the superiority of incorporating record interactions through the selecting strategy, as well as the further cost-effectiveness brought by ComEM.
Hall B-BSession 4: Oral/Poster C
Information extraction and retrieval 1
Tue, Jan 21
16:00-17:30
0:00Ge Shi16:1516:30
10
40PosterVirtual
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Natural Language Generation, Summarization and Simplification
InstructGEC: Enhancing Unsupervised Grammatical Error Correction with Instruction Tuning
Jiayi Deng, Chen Chen, Chunyan Hou and Xiaojie Yuan
Jiayi Deng
Jiayi Deng
Recent works have proposed methods of generating synthetic data automatically for unsupervised Grammatical Error Correction (GEC). Although a large amount of synthetic data is generated at a low cost, it is unrealistic and of poor quality. The copying phenomenon of synthetic data prevents GEC models from learning the semantic knowledge of contextual language. In this paper, we design an instruction format and use the masking strategy in both an erroneous sentence and the corresponding instruction consistently to alleviate the impact of the copy phenomenon. We also propose a novel approach, InstructGEC, which integrates the knowledge of grammatical detection into GEC models with instruction tuning to address the low-quality issue. Experiments are conducted on English and Chinese GEC datasets and results demonstrate that our method outperforms state-of-the-art unsupervised GEC methods.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
11
41OralIn-person
Ethical Reviewing:Dialogue and Conversational Interaction
Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense Inference
Lanrui Wang, Jiangnan Li, Chenxu Yang, Zheng Lin, Hongyin Tang, Huan Liu, Yanan Cao, Jingang Wang and Weiping Wang
Lanrui WangLanrui Wang
Recently, there has been a heightened interest in building chatbots based on Large Language Models (LLMs) to emulate human-like qualities in multi-turn conversations. Despite having access to commonsense knowledge to better understand the psychological aspects and causality of dialogue context, even these powerful LLMs struggle to achieve the goals of empathy and emotional support. Current commonsense knowledge derived from dialogue contexts is inherently limited and often fails to adequately anticipate the future course of a dialogue. This lack of foresight can mislead LLMs and hinder their ability to provide effective support. In response to this challenge, we present an innovative framework named Sensible and Visionary Commonsense Knowledge (Sibyl). Designed to concentrate on the immediately succeeding dialogue, this paradigm equips LLMs with the capability to uncover the implicit requirements of the conversation, aiming to elicit more empathetic responses. Experimental results demonstrate that incorporating our paradigm for acquiring commonsense knowledge into LLMs comprehensively enhances the quality of their responses.
Hall B-CSession 4: Oral/Poster C
Dialogue and Conversational Interaction 1
Tue, Jan 21
16:00-17:30
0:00Li Zhang16:3016:45
12
42PosterVirtual
Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing
Multimodal and Grounded Language Acquisition, HRI
Noise-powered Multi-modal Knowledge Graph Representation Framework
Zhuo Chen, Yin Fang, Yichi Zhang, Lingbing Guo, Jiaoyan Chen, Jeff Z. Pan, Huajun Chen and Wen Zhang
Zhuo ChenZhuo Chen
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a framework is essential for embedding structured knowledge into multi-modal Large Language Models effectively, alleviating issues like knowledge misconceptions and multi-modal hallucinations. In this work, we explore the efficacy of models in accurately embedding entities within MMKGs through two pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking to robustly integrate multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility. Moreover, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Code and data are available at https://github.com/zjukg/SNAG.
GatherGather Session 3GatherTue, Jan 28
13:00-14:30
13:0014:30Yes
13
49PosterVirtual
Ethical Reviewing:Language Resources and Evaluation
Language Resources and Evaluation
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
Junjie Ye, Guanyu Li, SongYang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui and Xuanjing Huang
Junjie YeJunjie Ye
Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
14
52PosterVirtual
Ethical Reviewing:Information Extraction
Information Extraction
Federated Incremental Named Entity Recognition
Zesheng Liu, Qiannan Zhu, Cuiping Li and Hong Chen
Zesheng Liu
Zesheng Liu
Federated learning-based Named Entity Recognition (FNER) has attracted widespread attention through decentralized training on local clients. However, most FNER models assume that entity types are pre-fixed, so in practical applications, local clients constantly receive new entity types without enough storage to access old entity types, resulting in severe forgetting on previously learned knowledge. In addition, new clients collecting only new entity types may join the global training of FNER irregularly, further exacerbating catastrophic forgetting. To overcome the above challenges, we propose a Forgetting-Subdued Learning (FSL) model which solves the forgetting problem on old entity types from both intra-client and inter-client two aspects. Specifically, for intra-client aspect, we propose a prototype-guided adaptive pseudo labeling and a prototypical relation distillation loss to surmount catastrophic forgetting of old entity types with semantic shift. Furthermore, for inter-client aspect, we propose a task transfer detector. It can identify the arrival of new entity types that are protected by privacy and store the latest old global model for relation distillation. Qualitative experiments have shown that our model has made significant improvements compared to several baseline methods.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
15
54PosterIn-person
Natural Language Generation, Summarization and Simplification:Ethical Reviewing
Large Language Models are Good Annotators for Type-aware Data Augmentation in Grammatical Error Correction
Xinyuan Li and Yunshi Lan
Xinyuan LiXinyuan Li
Large Language Models (LLMs) have achieved outstanding performance across various NLP tasks.
Grammatical Error Correction (GEC) is a task aiming at automatically correcting grammatical errors in text, but it encounters a severe shortage of annotated data. Researchers have tried to make full use of the generalization capabilities of LLMs and prompt them to correct erroneous sentences, which however results in unexpected over-correction issues. In this paper, we rethink the role of LLMs in GEC tasks and propose a method, namely TypeDA, considering LLMs as the annotators for type-aware data augmentation in GEC tasks. Different from the existing data augmentation methods, our method prevents in-distribution corruption and is able to generate sentences with multi-granularity error types. Our experiments verify that our method can generally improve the GEC performance of different backbone models with only a small amount of augmented data. Further analyses verify the high consistency and diversity of the pseudo data generated via our method.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
16
55OralIn-person
Ethical Reviewing:Speech Recognition and Synthesis, and Spoken Language Understanding
Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication
Arif A. Ahmad, Khyathi Gayathri Mothika and Pushpak Bhattacharyya
Arif AhmadArif Ahmad
Reduplication and repetition, though similar in form, serve distinct linguistic purposes. Reduplication is a deliberate morphological process used to express grammatical, semantic, or pragmatic nuances, while repetition is often unintentional and indicative of disfluency. This paper presents the first large-scale study of reduplication and repetition in speech using computational linguistics. We introduce IndicRedRep, a new publicly available dataset containing Hindi, Telugu, and Marathi text annotated with reduplication and repetition at the word level. We evaluate transformer-based models for multi-class reduplication and repetition token classification, utilizing the Reparandum-Interregnum-Repair structure to distinguish between the two phenomena. Our models achieve macro F1 scores of up to 85.62% in Hindi, 83.95% in Telugu, and 84.82% in Marathi for reduplication-repetition classification.
Hall B-ASession 10: Oral/Poster H
Multimodal NLP 2
Thu, Jan 23
11:00-12:30
0:00Artem Shelmanov11:1511:30
17
56OralIn-person
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Learning to Verify Summary Facts with Fine-Grained LLM Feedback
Jihwan Oh, Jeonghwan Choi, Nicole Hee-Yoen Kim, Taewon Yun and Hwanjun Song
Hwanjun Song
Jihwan Oh
email: jh.oh@kaist.ac.kr
Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at https://github.com/DISL-Lab/FineSumFact.
Hall B-BSession 12: Oral/Poster I
Natural Language Generation and Summarization 2
Fri, Jan 24
10:30-12:00
0:00
Barbara Di Eugenio
10:3010:45
18
59PosterVirtual
Language Modeling:Ethical Reviewing
Language Modeling
FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models
Tao Fan, Guoqiang Ma, Yan Kang, Hanlin Gu, Yuanfeng Song, Lixin Fan, Kai Chen and Qiang Yang
Tao FanTao Fan
Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server's LLM and clients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server's LLM to clients' SLMs while concurrently enhancing the LLM with clients' unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, we evaluate the effectiveness of FedMKT by utilizing diverse public LLMs and SLMs on a variety of NLP text generation tasks. Empirical results demonstrate that FedMKT simultaneously boosts the performance of both LLMs and SLMs. Our code has been contributed to the FATE open-source project and is now publicly accessible at \textit{\url{https://github.com/FederatedAI/FATE-LLM/tree/main/python/fate_llm/algo/fedmkt}}
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
19
65PosterVirtual
Sentiment Analysis, Opinion and Argument Mining:Ethical Reviewing
Sentiment Analysis, Opinion and Argument Mining
Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation
Yuntao Shou, tao meng, wei ai and KEQIN LI
Yuntao ShouYuntao Shou
Multimodal emotion recognition in conversation (MERC) refers to identifying and classifying human emotional states by combining data from multiple different modalities (e.g., audio, images, text, video, etc.). Specifically, human emotional expressions are often complex and diverse, and these complex emotional expressions can be captured and understood more comprehensively through the fusion of multimodal information. Most existing graph-based multimodal emotion recognition methods can only use shallow GCNs to extract emotion features and fail to capture the temporal dependencies caused by dynamic changes in emotions. To address the above problems, we propose a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for multimodal emotion recognition in conversation, which combines the dynamic changes of emotions to capture the temporal dependency of speakers' emotions. Technically, the key idea of DGODE is to use the graph ODE evolution network to characterize the continuous dynamics of node representations over time and capture temporal dependencies. Extensive experiments on two publicly available multimodal emotion recognition datasets demonstrate that the proposed DGODE model has superior performance compared to various baselines. Furthermore, the proposed DGODE can also alleviate the over-smoothing problem, thereby enabling the construction of a deep GCN network.
GatherGather TBDGather
Gather TBD
TBDTBDTBD
20
68PosterVirtual
Ethical Reviewing:Multimodal and Grounded Language Acquisition, HRI
Multimodal and Grounded Language Acquisition, HRI
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding
Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan and Zongyuan Ge
Lie JuLie Ju
Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (**HGCLIP**) that effectively combines **CLIP** with a deeper exploitation of the **H**ierarchical class structure via **G**raph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https:
//github.com/richard-peng-xia/HGCLIP.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30
21
69PosterVirtual
Ethical Reviewing:Information Retrieval and Text Mining
Information Retrieval and Text Mining
Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement
Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai and Heng Ji
Chenkai SunChenkai Sun
The increasing demand for personalized interactions with large language models (LLMs) calls for methodologies capable of accurately and efficiently identifying user opinions and preferences. Retrieval augmentation emerges as an effective strategy, as it can accommodate a vast number of users without the costs from fine-tuning. Existing research, however, has largely focused on enhancing the retrieval stage and devoted limited exploration toward optimizing the representation of the database, a crucial aspect for tasks such as personalization. In this work, we examine the problem from a novel angle, focusing on how data can be better represented for more data-efficient retrieval in the context of LLM customization. To tackle this challenge, we introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts and collaborative refinement to effectively bridge knowledge gaps among users. In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size, a critical advantage in scenarios with extensive histories or limited context windows. Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data. Furthermore, our analysis reveals the increasing importance of collaborative knowledge as the retrieval capacity expands.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
22
72OralIn-person
Language Resources and Evaluation:Ethical Reviewing
Style Over Substance: Evaluation Biases for Large Language Models
Minghao Wu and Alham Fikri Aji
Alham Fikri Aji
Alham Fikri Aji
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human or LLM judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed, machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced evaluations, indicating the need for further investigation.
Suite 7Session 3: Oral/Poster B
Language Resources and Evaluation 1
Tue, Jan 21
14:00-15:30
0:00Firoj Alam14:3014:45
23
86PosterVirtual
Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining
Sentiment Analysis, Opinion and Argument Mining
Multimodal Aspect-Based Sentiment Analysis under Conditional Relation
Xinjing Liu, Ruifan Li, Shuqin Ye, guangwei zhang and Xiaojie WANG
Xinjing LiuXinjing Liu
Ruifan Li
Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms from text-image pairs and identify their sentiments. Previous methods are based on the premise that the image contains the objects referred by the aspects within the text. However, this condition cannot always be met, resulting in a suboptimal performance. In this paper, we propose COnditional Relation based \textit{S}entiment Analysis framework (CORSA). Specifically, we design a conditional relation detector (CRD) to mitigate the impact of the unmet conditional image. Moreover, we design a visual object localizer (VOL) to locate the exact condition-related visual regions associated with the aspects. With CRD and VOL, our CORSA framework takes a multi-task form. In addition, to effectively learn CORSA we conduct two types of annotations. One is the conditional relation using a pretrained referring expression comprehension model; the other is the bounding boxes of visual objects by a pretrained object detection model. Experiments on our built C-MABSA dataset show that CORSA consistently outperforms existing methods. The code and data are available at https://github.com/Liuxj-Anya/CORSA.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30
24
89PosterIn-person
Ethical Reviewing:Lexical Semantics
Semantic Role Labeling of NomBank Partitives
Adam Meyers, Advait Pravin Savant and John E. Ortega
John OrtegaJohn Ortega
This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using "gold" parses from the Penn Treebank and 91.12% when using the Berkeley Neural parser. This research includes both classroom and experimental settings for system development.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
25
99OralIn-person
Ethical Reviewing:NLP and LLM Applications
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
Dongjun Lee, Choongwon Park, Jaehyuk Kim and Heesoo Park
Dongjun Lee
Dongjun Lee
Recent advancements in large language models (LLMs) have enabled in-context learning (ICL)-based methods that significantly outperform fine-tuning approaches for text-to-SQL tasks. However, their performance is still considerably lower than that of human experts on benchmarks that include complex schemas and queries, such as BIRD. This study considers the sensitivity of LLMs to the prompts and introduces a novel approach that leverages multiple prompts to explore a broader search space for possible answers and effectively aggregate them. Specifically, we robustly refine the database schema through schema linking using multiple prompts. Thereafter, we generate various candidate SQL queries based on the refined schema and diverse prompts. Finally, the candidate queries are filtered based on their confidence scores, and the optimal query is obtained through a multiple-choice selection that is presented to the LLM. When evaluated on the BIRD and Spider benchmarks, the proposed method achieved execution accuracies of 65.5\% and 89.6\%, respectively, significantly outperforming previous ICL-based methods.
Suite 7Session 2: Oral/Poster A
NLP Applications 1
Tue, Jan 21
11:00-12:30
0:00Lingzi Hong11:0011:15
26
101PosterVirtual
Ethical Reviewing:NLP and LLM Applications
NLP and LLM Applications
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
He Cao, Zijing Liu, Xingyu Lu, Yuan Yao and Yu Li
He CAOHe CAO
The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialists, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
27
102PosterVirtual
Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining
Sentiment Analysis, Opinion and Argument Mining
Ambiguity-aware Multi-level Incongruity Fusion Network for Multi-Modal Sarcasm Detection
Kuntao Li, Yifan Chen, Qiaofeng Wu, Weixing Mai, Fenghuan Li and Yun Xue
Kuntao LiKuntao Li
Multi-modal sarcasm detection aims to identify whether a given image-text pair is sarcastic. The pivotal factor of the task lies in accurately capturing incongruities from different modalities. Although existing studies have achieved impressive success, they primarily committed to fusing the textual and visual information to establish cross-modal correlations, overlooking the significance of original unimodal incongruity information at the text-level and image-level. Furthermore, the utilized fusion strategies of cross-modal information neglected the effect of inherent ambiguity within text and image modalities on multimodal fusion. To overcome these limitations, we propose a novel Ambiguity-aware Multi-level Incongruity Fusion Network (AMIF) for multi-modal sarcasm detection. Our method involves a multi-level incongruity learning module to capture the incongruity information simultaneously at the text-level, image-level and cross-modal-level. Additionally, an ambiguity-based fusion module is developed to dynamically learn reasonable weights and interpretably aggregate incongruity features from different levels. Comprehensive experiments conducted on a publicly available dataset demonstrate the superiority of our proposed model over state-of-the-art methods.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
28
103PosterIn-person
Language Resources and Evaluation:Ethical Reviewing
AdminSet and AdminBERT: a Dataset and a Pre-trained Language Model to Explore the Unstructured Maze of French Administrative Documents
Thomas Sebbag, Solen Quiniou, Nicolas Stucky and Emmanuel Morin
Thomas Sebbag
Thomas Sebbag
In recent years, Pre-trained Language Models(PLMs) have been widely used to analyze various documents, playing a crucial role in Natural Language Processing (NLP). However, administrative texts have rarely been used in information extraction tasks, even though this resource is available as open data in many countries. Most of these texts contain many specific domain terms. Moreover, especially in France, they are unstructured because many administrations produce them without a standardized framework. Due to this fact, current language models do not process these documents correctly. In this paper, we propose AdminBERT, the first French pre-trained language models for the administrative domain. Since interesting information in such texts corresponds to named entities and the relations between them, we compare this PLM with general domain language models, fine-tuned on the Named Entity Recognition (NER) task applied to administrative texts, as well as to a Large Language Model (LLM) and to a language model with an architecture different from the BERT one. We show that taking advantage of a PLM for French administrative data increases the performance in the administrative and general domains, on these texts. We also release AdminBERT as well as AdminSet, the pre-training corpus of administrative texts in French and the subset AdminSet-NER, the first NER dataset consisting exclusively of administrative texts in French.
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
29
105PosterIn-person
Language Resources and Evaluation:Ethical Reviewing
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Thibaut Thonet, Laurent Besacier and Jos Rozen
laurent besacier
laurent besacier
Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending the models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, we propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario in which the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers, as well as noisy versions of meeting transcripts altered to target different Word Error Rate levels. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models, and point out their discrepancies in terms of robustness to transcript noise. We also provide a thorough analysis of our GPT-4-based evaluation, including insights from a crowdsourcing study. Our findings indicate that while GPT-4's scores align with human judges, its ability to distinguish beyond three score levels may be limited.
AtriumSession 6: Oral/Poster EPoster Wed, Jan 22
11:00-12:30
11:0012:30
30
106OralIn-person
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Positive Text Reframing under Multi-strategy Optimization
Shutong Jia, Biwei Cao, Qingqing Gao, Jiuxin Cao and Bo Liu
Biwei CaoBiwei Cao
Differing from sentiment transfer, positive reframing seeks to substitute negative perspectives with positive expressions while preserving the original meaning. With the emergence of pre-trained language models (PLMs), it is possible to achieve acceptable results by fine-tuning PLMs. Nevertheless, generating fluent, diverse and task-constrained reframing text remains a significant challenge. To tackle this issue, a **m**ulti-**s**trategy **o**ptimization **f**ramework (MSOF) is proposed in this paper. Starting from the objective of positive reframing, we first design positive sentiment reward and content preservation reward to encourage the model to transform the negative expressions of the original text while ensuring the integrity and consistency of the semantics. Then, different decoding optimization approaches are introduced to improve the quality of text generation. Finally, based on the modeling formula of positive reframing, we propose a multi-dimensional re-ranking method that further selects candidate sentences from three dimensions: strategy consistency, text similarity and fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate our framework achieves significant improvements on unconstrained and controlled positive reframing tasks.
Hall B-CSession 3: Oral/Poster B
Interpretability and Explainability
Tue, Jan 21
14:00-15:30
0:00
Jordan Kodner jordan.kodner@stonybrook.edu
14:4515:00
31
114PosterIn-person
NLP and LLM Applications:Ethical Reviewing
RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration
Haoyu Huang, Tong Niu, Rui Yang and Luping Shi
Haoyu Huang
Haoyu Huang
Recently, many studies focus on utilizing large language models (LLMs) into educational dialogues. Especially, within liberal arts dialogues, educators must balance Humanized communication, Teaching expertise, and Safety-ethics (HTS), besides the subject knowledge itself. However, due to collecting massive amounts of HTS-compliant teaching dialogues from real world as training corpus is expensive, the outputs of existing LLMs in teaching dialogues fall short of human standards. To address this, we design a Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C) framework to automatically generate such dialogues data. Specifically, we first establish HTS-guided knowledge bases, encompassing three domain knowledge in teaching skills, psychology, and safety ethics. Then, RAM2C organizes LLMs, which are retrieval-augmented by the above different knowledge bases, into multi-experts groups with distinct roles to generate the HTS-compliant educational dialogues dataset. We then fine-tuned the LLMs using this dataset. Empirical evaluations indicate that RAM2C-empowered LLMs excel in Chinese reading teaching, offering more personalized, and ethically safe teaching response, demonstrating RAM2C's practicality and high quality. We release the experiments at https://github.com/ram2c/ram2c.
AtriumSession 2: Oral/Poster APoster Tue, Jan 21
11:00-12:30
11:0012:30
32
119PosterIn-person
Information Extraction:Ethical Reviewing
SURE: Mutually Visible Objects and Self-generated Candidate Labels For Relation Extraction
Yuxuan Feng, Qian Chen, Qianyou Wu, Xin GUO and Suge Wang
Yuxuan Feng
Yuxuan Feng
Joint relation extraction models effectively mitigate the error propagation problem inherently present in pipeline models. Nevertheless, joint models face challenges including high computational complexity, complex network architectures, difficult parameter tuning, and notably, limited interpretability. In contrast, recent advances in pipeline relation extraction models (PURE, PL-Marker) have attracted considerable attention due to their lightweight design and high extraction accuracy. A key advancement is the introduction of a marker mechanism, which enhances relation extraction (RE) process by highlighting entities. However, these models primarily focus on generating correct labels. In doing so, they neglect the label selection process. Moreover, they fail to adequately capture the intricate interactions between entity pairs. To overcome these limitations, we develop a Candidate Label Markers (CLMs) mechanism that prioritizes strategic label selection over simple label generation. Furthermore, we facilitate interactions among diverse relation pairs, enabling the identification of more intricate relational patterns. Experimental results show that we achieve a new SOTA performance. Specifically, based on the same Named Entity Recognition (NER) results as theirs, we improve the SOTA methods by 2.5%, 1.9%, 1.2% in terms of strict F1 scores on SciERC, ACE05 and ACE04.
AtriumSession 2: Oral/Poster APoster Tue, Jan 21
11:00-12:30
11:0012:30
33
121PosterIn-person
Ethical Reviewing:Multilinguality and Machine Translation
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Yihong Liu, Chunlan Ma, Haotian Ye and Hinrich Schütze
Yihong LiuYihong Liu
Transliterating related languages that use different scripts into a common script is effective for improving crosslingual transfer in downstream tasks.
However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs).
This is undesirable because it requires a large computation budget.
A more promising way is to make full use of available mPLMs.
To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI).
TransMI can create strong baselines for data that is transliterated into a common script by exploiting an existing mPLM and its tokenizer without any training.
TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords.
We apply TransMI to three strong recent mPLMs.
Our experiments demonstrate that TransMI not only preserves the mPLM's ability to handle non-transliterated data, but also enables it to effectively process transliterated data, thereby facilitating crosslingual transfer across scripts.
The results show consistent improvements of 3% to 34% for different mPLMs and tasks.
We make our code and models publicly available at \url{https://github.com/cisnlp/TransMI}.
AtriumSession 5: Oral/Poster DPoster Wed, Jan 229:00-10:309:0010:30
34
123OralIn-person
Dialogue and Conversational Interaction:Ethical Reviewing
Two-stage Incomplete Utterance Rewriting on Editing Operation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu and Yaxin Fan
Zhiyu Cao
Zhiyu Cao
Previous work on Incomplete Utterance Rewriting (IUR) has primarily focused on generating rewritten utterances based solely on dialogue context, ignoring the widespread phenomenon of coreference and ellipsis in dialogues. To address this issue, we propose a novel framework called TEO (Two-stage approach on Editing Operation) for IUR, in which the first stage generates editing operations and the second stage rewrites incomplete utterances utilizing the generated editing operations and the dialogue context. Furthermore, an adversarial perturbation strategy is proposed to mitigate cascading errors and exposure bias caused by the inconsistency between training and inference in the second stage. Experimental results on three IUR datasets show that our TEO outperforms the SOTA models significantly.
Hall B-CSession 12: Oral/Poster I
Dialogue and Conversational Interaction 2
Fri, Jan 24
10:30-12:00
0:00Frederic Bechet10:3010:45
35
135PosterVirtual
Ethical Reviewing:Information Retrieval and Text Mining
Information Retrieval and Text Mining
QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
Jingyao Li, Han Shi, Sitong Wu, Chuanyang Zheng, Zhenguo Li, Xin Jiang, Hong Xu and Jiaya Jia
Jingyao LiJingyao Li
The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code is in https://github.com/dvlab-research/Q-LLM.
GatherGather TBDGather
Gather TBD
TBDTBDTBD
36
139PosterVirtual
Ethical Reviewing:Information Retrieval and Text Mining
Information Retrieval and Text Mining
SVD-GCL: A Noise-Augmented Hybrid Graph Contrastive Learning Framework for Recommendation
Liping Wang, Shichao Li, Hui Wang, Yuyan Gao and Mingyao Wei
Shichao LiShichao Li
Recently, deep graph neural networks (GNNs) have emerged as the predominant architecture for recommender systems based on collaborative filtering. Nevertheless, numerous GNN-based approaches confront challenges such as complex computations and skewed feature distributions, especially with high-dimensional, sparse, and noisy data, making it difficult to accurately capture user preferences. To tackle these issues, we introduce SVD-GCL, a streamlined graph contrastive learning recommendation model based on noise augmentation that integrates truncated singular value decomposition in the feature engineering stage. This hybrid optimization approach reduces the dimensionality and denoises the original data. Through extracting self-supervised signals and gradually adding noise to embeddings in the training phase to enrich data samples, the data sparsity is effectively alleviated. Experimental outcomes on three large public benchmark datasets illustrate that SVD-GCL effectively manages high-dimensional sparse data, remains stable in the presence of noise, and provides significant advantages in computational efficiency, recommendation performance, and robustness.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
37
147PosterVirtual
Ethical Reviewing:NLP and LLM Applications
NLP and LLM Applications
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL
Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, LinZheng Chai, Zhao Yan, Qian-Wen Zhang, di yin, Xing Sun and Zhoujun Li
Bing WangBing Wang
Recent LLM-based Text-to-SQL methods usually suffer from significant performance degradation on "huge" databases and complex user questions that require multi-step reasoning. Moreover, most existing methods neglect the crucial significance of LLMs utilizing external tools and model collaboration. To address these challenges, we introduce MAC-SQL, a novel LLM-based multi-agent collaborative framework. Our framework comprises a core decomposer agent for Text-to-SQL generation with few-shot chain-of-thought reasoning, accompanied by two auxiliary agents that utilize external tools or models to acquire smaller sub-databases and refine erroneous SQL queries. The decomposer agent collaborates with auxiliary agents, which are activated as needed and can be expanded to accommodate new features or tools for effective Text-to-SQL parsing. In our framework, We initially leverage GPT-4 as the strong backbone LLM for all agent tasks to determine the upper bound of our framework. We then fine-tune an open-sourced instruction-followed model, SQL-Llama, by leveraging Code Llama 7B, to accomplish all tasks as GPT-4 does. Experiments show that SQL-Llama achieves a comparable execution accuracy of 43.94, compared to the baseline accuracy of 46.35 for vanilla GPT-4. At the time of writing, MAC-SQL+GPT-4 achieves an execution accuracy of 59.59 when evaluated on the BIRD benchmark, establishing a new state-of-the-art (SOTA) on its holdout test set.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
38
149PosterVirtual
Ethical Reviewing:Interpretability and Explainability
Interpretability and Explainability
Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?
Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du and Yongfeng Zhang
Mingyu JinMingyu Jin
Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of ``Concept Depth'' to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding.
Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
39
155PosterIn-person
Ethical Reviewing:Machine Learning for CL/NLP
Knowledge Graph Entity Typing with Curriculum Contrastive Learning
hao wang, Minghua Nuo and shan jiang
hao wanghao wang
The Knowledge Graph Entity Typing (KGET) task aims to predict missing type annotations for entities in knowledge graphs. Most recent studies only focus on the structural information from an entity's neighborhood or semantic information from textual representations of entities or relations. In this paper, inspired by curriculum learning and contrastive learning, we propose the CCLET model using the Curriculum Contrastive Learning strategy for KGET, which uses the Pre-trained Language Model (PLM) and the graph model to fuse the entity related semantic and the structural information of the Knowledge Graph (KG) respectively. Our CCLET model consists of two main parts. In the Knowledge Fusion part, we design an Enhanced-MLP architecture to fuse the text of the entity's description, related triplet, and tuples; In the Curriculum Contrastive Learning part, we define the difficulty of the course by controlling the level of added noise, we aim to accurately learn with curriculum contrastive learning strategy from easy to difficult. Our extensive experiments demonstrate that the CCLET model outperforms recent state-of-the-art models, verifying its effectiveness in the KGET task.
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
40
160OralIn-person
Language Modeling:Ethical Reviewing
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
Zihui Wu, Haichang Gao, Jianping He and Ping Wang
Haichang Gao
Haichang Gao
Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90\% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures
Hall B-BSession 7: Oral/Poster F
Language Modeling 1
Wed, Jan 22
14:00-15:30
0:00
Djamé Seddah djame.seddah@inria.fr
14:3014:45
41
162OralIn-person
Ethical Reviewing:Language Modeling
Adapters Selector: Cross-domains and Multi-tasks LoRA Modules Integration Usage Method
Yimin Tian, Bolin Zhang, Zhiying Tu and Dianhui Chu
Bolin ZhangBolin Zhang
Parameter-Efficient Fine-Tuning (PEFT) adapts large language models (LLMs) to specific domains by updating only a small portion of the parameters. Although fine-tuning on a single task within a specific domain has demonstrated promising results, there remains limited exploration on how to effectively integrate these adapters for optimal performance. In this paper, we propose Adapters Selector (AS): a novel framework for better integrating usage of multiple adapters by training a middleman adapter to select the appropriate adapter for inference. Our approach utilizes PEFT to train a selector that determines which input content corresponds to which task in which domain, and subsequently selects the homologous adapter. By the way, The AS has developed the capability to execute cross-domain multi-tasks effectively through the utilization of a compact model in combination with multiple LoRA modules. Our code is publicly available.
Hall B-BSession 7: Oral/Poster F
Language Modeling 1
Wed, Jan 22
14:00-15:30
0:00
Djamé Seddah djame.seddah@inria.fr
14:0014:15
42
163PosterIn-person
Ethical Reviewing:Information Extraction
XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser
Xianfu Cheng, hang zhang, Jian Yang, Xiang Li, Weixiao Zhou, Fei Liu, Kui Wu, Xiangyuan Guan, Tao Sun, Xianjie Wu, Tongliang Li and Zhoujun Li
Xiang LiXiang Li
In the domain of Document AI, parsing semi-structured image form is a crucial Key Information Extraction (KIE) task. The advent of pre-trained multimodal models significantly empowers Document AI frameworks to extract key information from form documents in different formats such as PDF, Word, and images. Nonetheless, form parsing is still encumbered by notable challenges like subpar capabilities in multilingual parsing and diminished recall in industrial contexts in rich text and rich visuals. In this work, we introduce a simple but effective Multimodal and Multilingual semi-structured FORM PARSER (XFormParser), which is anchored on a comprehensive Transformer-based pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework. Combined with Bi-LSTM, the performance of multilingual parsing is significantly improved. Furthermore, we develop InDFormSFT, a pioneering supervised fine-tuning (SFT) industrial dataset that specifically addresses the parsing needs of forms in a variety of industrial contexts. Through rigorous testing on established benchmarks, XFormParser has demonstrated its unparalleled effectiveness and robustness. Compared to existing state-of-the-art (SOTA) models, XFormParser notably achieves up to 1.79% F1 score improvement on RE tasks in language-specific settings. It also exhibits exceptional improvements in cross-task performance in both multilingual and zero-shot settings.
AtriumSession 2: Oral/Poster APoster Tue, Jan 21
11:00-12:30
11:0012:30
43
165PosterVirtual
Ethical Reviewing:Ethics, Bias, and Fairness
Ethics, Bias, and Fairness
Debiasing by obfuscating with 007-classifiers promotes fairness in multi-community settings
Ingroj Shrestha and Padmini Srinivasan
Ingroj Shrestha
Ingroj Shrestha
While there has been considerable amount of research on bias mitigation algorithms, two properties: multi-community perspective and fairness to *all* communities have not been given sufficient attention. Focusing on these, we propose an obfuscation based data augmentation debiasing approach. In it we add to the training data *obfuscated* versions of *all* false positive instances irrespective of source community. We test our approach by debiasing toxicity classifiers built using 5 neural models (multi layer perceptron model and masked language models) and 3 datasets in a 4 communities setting. We also explore 4 different obfuscators for debiasing. Results demonstrate the merits of our approach: bias is reduced for almost all of our runs without sacrificing false positive rates or F1 scores for minority or majority communities. In contrast, the 4 state of the art baselines typically make performance sacrifices (often large) while reducing bias. Crucially, we demonstrate that it is possible to debias while maintaining standards for both minority and majority communities.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
44
176PosterVirtual
Machine Learning for CL/NLP:Ethical Reviewing
Machine Learning for CL/NLP
Graph Representation Learning in Hyperbolic Space via Dual-Masked
rui gong, zuyun jiang and daren zha
Rui GongRui Gong
Graph representation learning (GRL) in hyperbolic space has gradually emerged as a promising approach. Meanwhile, masking and reconstruction-based (MR-based) methods lead to state-of-the-art self-supervised graph representation. However, existing MR-based methods do not fully consider deep node and structural information. Inspired by the recent active and emerging field of self-supervised learning, we propose a novel node and edge dual-masked self-supervised graph representation learning framework in hyperbolic space, named HDM-GAE. We have designed a graph dual-masked module and a hyperbolic structural self-attention encoder module to mask nodes or edges and perform node aggregation within hyperbolic space, respectively. Comprehensive experiments and ablation studies on real-world multi-category datasets, demonstrate the superiority of our method in downstream tasks such as node classification and link prediction.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
45
177PosterVirtual
Information Retrieval and Text Mining:Ethical Reviewing
Information Retrieval and Text Mining
Perturbation-driven Dual Auxiliary Contrastive Learning for Collaborative Filtering Recommendation
Caihong Mu, Keyang Zhang, Jialiang Zhou and Yi Liu
Jialiang Zhou
Jialiang Zhou
Graph collaborative filtering has made great progress in the recommender systems, but these methods often struggle with the data sparsity issue in real-world recommendation scenarios. To mitigate the effect of data sparsity, graph collaborative filtering incorporates contrastive learning as an auxiliary task to improve model performance. However, existing contrastive learning-based methods generally use a single data augmentation graph to construct the auxiliary contrastive learning task, which has problems such as loss of key information and low robustness. To address these problems, this paper proposes a Perturbation-driven Dual Auxiliary Contrastive Learning for Collaborative Filtering Recommendation (PDACL). PDACL designs structure perturbation and weight perturbation to construct two data augmentation graphs. The Structure Perturbation Augmentation (SPA) graph perturbs the topology of the user-item interaction graph, while the Weight Perturbation Augmentation (WPA) graph reconstructs the implicit feedback unweighted graph into a weighted graph similar to the explicit feedback. These two data augmentation graphs are combined with the user-item interaction graph to construct the dual auxiliary contrastive learning task to extract the self-supervised signals without losing key information and jointly optimize it together with the supervised recommendation task, to alleviate the data sparsity problem and improve the performance. Experimental results on multiple public datasets show that PDACL outperforms numerous benchmark models, demonstrating that the dual-perturbation data augmentation graph in PDACL can overcome the shortcomings of a single data augmentation graph, leading to superior recommendation results. The implementation of our work will be found at https://github.com/zky77/PDACL.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
46
194OralIn-person
Ethical Reviewing:Information Retrieval and Text Mining
Enhancing Reranking for Recommendation with LLMs through User Preference Retrieval
Haobo Zhang, Qiannan Zhu and Zhicheng Dou
Haobo Zhang
Haobo Zhang
Recently, large language models (LLMs) have shown the potential to enhance recommendations due to their sufficient knowledge and remarkable summarization ability. However, the existing LLM-powered recommendation may create redundant output, which generates irrelevant information about the user's preferences on candidate items from user behavior sequences. To address the issues, we propose a framework UR4Rec that enhances reranking for recommendation with large language models through user preference retrieval. Specifically, UR4Rec develops a small transformer-based user preference retriever towards candidate items to build the bridge between LLMs and recommendation, which focuses on producing the essential knowledge through LLMs from user behavior sequences to enhance reranking for recommendation. Our experimental results on three real-world public datasets demonstrate the superiority of UR4Rec over existing baseline models.
Hall B-BSession 4: Oral/Poster C
Information extraction and retrieval 1
Tue, Jan 21
16:00-17:30
0:00Ge Shi16:4517:00
47
196PosterVirtual
Ethical Reviewing:NLP and LLM Applications
NLP and LLM Applications
SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task
Zijie Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin and Xiaofan Zhang
Zijie ZhongZijie Zhong
Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs' efficacy and mitigating their "hallucinations". Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), it is critical to connect LLMs with KG databases by automating the translation of natural language into Cypher queries (termed as "Text2Cypher" task). Prior efforts tried to bolster LLMs' proficiency in Cypher generation through Supervised Fine-Tuning (SFT). However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of such annotation. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based prompting and (2) template-filling. SyntheT2C is applied to two medical KG databases, culminating in the creation of a synthetic dataset, MedT2C. Comprehensive experiments demonstrate that the MedT2C dataset effectively enhances the performance of backbone LLMs on Text2Cypher task via SFT. Both the SyntheT2C codebase and the MedT2C dataset will be released.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
48
199PosterIn-person
Interpretability and Explainability:Ethical Reviewing
Language Models Encode the Value of Numbers Linearly
Fangwei Zhu, Damai Dai and Zhifang Sui
Fangwei Zhu
Fangwei Zhu
Large language models (LLMs) have exhibited impressive competence in various tasks, but their internal mechanisms on mathematical problems are still under-explored.
In this paper, we study a fundamental question: how language models encode the value of numbers, a basic element in math.
To study the question, we construct a synthetic dataset comprising addition problems and utilize linear probes to read out input numbers from the hidden states.
Experimental results support the existence of encoded number values in LLMs on different layers, and these values can be extracted via linear probes.
Further experiments show that LLMs store their calculation results in a similar manner, and we can intervene the output via simple vector additions, proving the causal connection between encoded numbers and language model outputs.
Our research provides evidence that LLMs encode the value of numbers linearly, offering insights for better exploring, designing, and utilizing numeric information in LLMs.
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
49
201PosterVirtual
Language Resources and Evaluation:Ethical Reviewing
Language Resources and Evaluation
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan and yang chong
Shu Liu
Shu Liu
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce FinDABench, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. The benchmark comprises 15,200 training instances and 8,900 test instances, all meticulously crafted by human experts. FinDABench assesses LLMs across three dimensions: 1) Core Ability, evaluating the models' ability to perform financial indicator calculation and corporate sentiment risk assessment; 2) Analytical Ability, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) Technical Ability, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release FinDABench, and the evaluation scripts at https://github.com/xxx. FinDABench aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.
GatherGather Session 1Gather Mon, Jan 27
19:00-20:30
19:0020:30Yes
50
213PosterVirtual
Ethical Reviewing:Language Resources and Evaluation
Language Resources and Evaluation
Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding
Nguyen Binh Nguyen and Yang He
Binh-Nguyen Nguyen
Binh-Nguyen Nguyen
Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on cross-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain examples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources.
GatherGather Session 3GatherTue, Jan 28
13:00-14:30
13:0014:30Yes
51
214PosterIn-person
Language Resources and Evaluation:Ethical Reviewing
SLARD: A Chinese Superior Legal Article Retrieval Dataset
Zhe Chen, Pengjie Ren, Fuhui Sun, Xiaoyan Wang, Yujun Li, Siwen Zhao and Tengyi Yang
Zhe ChenZhe Chen
Retrieving superior legal articles involves identifying relevant legal articles that hold higher legal effectiveness. This process is crucial in legislative work because superior legal articles form the legal basis for drafting new laws. However, most existing legal information retrieval research focuses on retrieving legal documents, with limited research on retrieving superior legal articles. This gap restricts the digitization of legislative work. To advance research in this area, we propose SLARD: A Chinese Superior Legal Article Retrieval Dataset, which filters 2,627 queries and 9,184 candidates from over 4.3 million effective Chinese regulations, covering 32 categories, such as environment, agriculture, and water resources. Each query is manually annotated, and the candidates include superior articles at both the provincial and national levels. We conducted detailed experiments and analyses on the dataset and found that existing retrieval methods struggle to achieve ideal results. The best method achieved a R@1 of only 0.4719. Additionally, we found that existing large language models (LLMs) lack prior knowledge of the content of superior legal articles. This indicates the necessity for further exploration and research in this field.
AtriumSession 6: Oral/Poster EPoster Wed, Jan 22
11:00-12:30
11:0012:30
52
223OralIn-person
Dialogue and Conversational Interaction:Ethical Reviewing
Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations
Nuo Chen, Hongguang Li, Jianhui Chang, juhua huang, Baoyuan Wang and Jia Li
hongguang li
hongguang li
Existing retrieval-based methods have made significant strides in maintaining long-term conversations. However, these approaches face challenges in memory database management and accurate memory retrieval, hindering their efficacy in dynamic, real-world interactions. This study introduces a novel framework, COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews traditional retrieval modules and memory databases. Instead, COMEDY~adopts a ``One-for-All'' approach, utilizing a single language model to manage memory generation, compression, and response generation. Central to this framework is the concept of compressive memory, which integrates session-specific summaries, user-bot dynamics, and past events into a concise memory format. To support COMEDY, we collect the biggest Chinese long-term conversation dataset, Dolphin, derived from real user-chatbot interactions. Comparative evaluations demonstrate COMEDY's superiority over traditional retrieval-based methods in producing more nuanced and human-like conversational experiences.
Hall B-CSession 4: Oral/Poster C
Dialogue and Conversational Interaction 1
Tue, Jan 21
16:00-17:30
0:00Li Zhang17:1517:30
53
227PosterIn-person
Ethical Reviewing:Language Resources and Evaluation
Refined Evaluation for End-to-End Grammatical Error Correction Using an Alignment-Based Approach
Junrui Wang, Mengyang Qiu, Yang Gu, Zihao Huang and Jungyeul Park
Mengyang Qiu
Mengyang Qiu
We propose a refined alignment-based method to assess end-to-end grammatical error correction (GEC) systems, aiming to reproduce and improve results from existing evaluation tools, such as \texttt{errant}, even when applied to raw text input—reflecting real-world language learners' writing scenarios. Our approach addresses challenges arising from sentence boundary detection deviations in text preprocessing, a factor overlooked by current GEC evaluation metrics. We demonstrate its effectiveness by replicating results through a re-implementation of \texttt{errant}, utilizing \texttt{stanza} for error annotation and simulating end-to-end evaluation from raw text. Additionally, we propose a potential multilingual \texttt{errant}, presenting Chinese and Korean GEC results. Previously, Chinese and Korean \texttt{errant} were implemented independently for each language, with different annotation formats. Our approach generates consistent error annotations across languages, establishing a basis for standardized grammatical error annotation and evaluation in multilingual GEC contexts.
AtriumSession 6: Oral/Poster EPoster Wed, Jan 22
11:00-12:30
11:0012:30
54
228PosterVirtual
Ethical Reviewing:Dialogue and Conversational Interaction
Dialogue and Conversational Interaction
LLMs on interactive feature collections with implicit dynamic decision strategy
Juyeon Heo, Vihari Piratla, Kyunghyun Lee, Hyonkeun Joh and Adrian Weller
Juyeon HeoJuyeon Heo
In real-world contexts such as medical diagnosis and business consulting, effective problem-solving often requires gathering relevant information through interactions and targeted questioning to pinpoint the root cause of a problem.
However, Large Language Models (LLMs) often struggle to efficiently narrow down the search space, leading to either missing key information or asking redundant questions when guided by implicit methods like Chain-of-Thought (CoT). Some approaches employ external engineered systems to guide reasoning paths, but these methods may not fully utilize the inherent problem-solving capabilities of LLMs and often require multiple expensive API calls.
This study explores how we can implicitly guide LLMs to enhance their interactive feature collection abilities within a single prompt. Instead of employing explicit search algorithms or step-by-step external guidance, we provide high-level guidelines that allow LLMs to dynamically adjust their strategies and iteratively refine their decision-making processes independently. Evaluations on synthetic 20-Questions games and real-world scenarios, including business and medical diagnosis cases, demonstrate that LLMs guided by these strategies perform more effective interactive feature collection, asking fewer and more strategic questions and achieving better problem-solving efficiency.
GatherGather TBDGather
Gather TBD
TBDTBDTBD
55
232PosterVirtual
Information Retrieval and Text Mining:Ethical Reviewing
Information Retrieval and Text Mining
Pre-trained Semantic Interaction based Inductive Graph Neural Networks for Text Classification
Shiyu Wang, Gang Zhou, Jicang Lu, Jing Chen and Ningbo Huang
Shiyu WangShiyu Wang
Nowadays, research of Text Classification (TC) based on graph neural networks (GNNs) is on the rise. Both inductive methods and transductive methods have made significant progress. For transductive methods, the semantic interaction between texts plays a crucial role in the learning of effective text representations. However, it is difficult to perform inductive learning while modeling interactions between texts on the graph. To give a universal solution, we propose the graph neural network based on pre-trained semantic interaction called PaSIG. Firstly, we construct a text-word heterogeneity graph and design an asymmetric structure to ensure one-way message passing from words to the test texts. Meanwhile, we use the context representation capability of the pre-trained language model to construct node features that contain classification semantic information. Afterward, we explore the adaptative aggregation methods with a gated fusion mechanism. Extensive experiments on five datasets have shown the effectiveness of PaSIG, with the accuracy exceeding the baseline by 2.7% on average. While achieving state-of-the-art performance, we have also taken measures of subgraph sampling and intermediate state preservation to achieve fast inference.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
56
233PosterIn-person
Natural Language Generation, Summarization and Simplification:Ethical Reviewing
From Superficial to Deep: Integrating External Knowledge for Follow-up Question Generation Using Knowledge Graph and LLM
Jianyu Liu, Yi Huang, Sheng Bi, Junlan Feng and Guilin Qi
hongguang li
Jianyu Liu
hongguang li
Jianyu Liu
In a conversational system, dynamically generating follow-up questions based on context can help users explore information and provide a better user experience. Humans are usually able to ask questions that involve some general life knowledge and demonstrate higher order cognitive skills. However, the questions generated by existing methods are often limited to shallow contextual questions that are uninspiring and have a large gap to the human level. In this paper, we propose a three-stage external knowledge-enhanced follow-up question generation method, which generates questions by identifying contextual topics, constructing a knowledge graph (KG) online, and finally combining these with a large language model to generate the final question. The model generates information-rich and exploratory follow-up questions by introducing external common sense knowledge and performing a knowledge fusion operation. Experiments show that compared to baseline models, our method generates questions that are more informative and closer to human questioning levels while maintaining contextual relevance.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
57
236PosterVirtual
Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining
Sentiment Analysis, Opinion and Argument Mining
AGCL: Aspect Graph Construction and Learning for Aspect-level Sentiment Classification
Zhongquan Jian, Daihang Wu, Shaopan Wang, Yancheng Wang, Junfeng Yao, Meihong Wang and Qingqiang Wu
Zhongquan Jian
Zhongquan Jian
Prior studies on Aspect-level Sentiment Classification (ALSC) emphasize modeling interrelationships among aspects and contexts but overlook the crucial role of aspects themselves as essential domain knowledge. To this end, we propose AGCL, a novel Aspect Graph Construction and Learning method, aimed at furnishing the model with finely tuned aspect information to bolster its task-understanding ability. AGCL's pivotal innovations reside in Aspect Graph Construction (AGC) and Aspect Graph Learning (AGL), where AGC harnesses intrinsic aspect connections to construct the domain aspect graph, and then AGL iteratively updates the introduced aspect graph to enhance its domain expertise, making it more suitable for the ALSC task. Hence, this domain aspect graph can serve as a bridge connecting unseen aspects with seen aspects, thereby enhancing the model's generalization capability. Experiment results on three widely used datasets demonstrate the significance of aspect information for ALSC and highlight AGL's superiority in aspect learning, surpassing state-of-the-art baselines greatly. Code is available at https://github.com/jian-projects/agcl.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
58
237PosterVirtual
Language Modeling:Ethical Reviewing
Language Modeling
TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution
Jiuding Yang, Shengyao Lu, Weidong Guo, Xiangyang Li, Kaitong Yang, Yu Xu and Di Niu
Jiuding Yang
Jiuding Yang
The fine-tuning of Large Language Models (LLMs) specialized in code generation has seen notable advancements through the use of open-domain coding queries. Despite the successes, existing methodologies like \textit{Evol-Instruct} encounter performance limitations, impeding further enhancements in code generation tasks. This paper examines the constraints of existing prompt evolution techniques and introduces a novel approach, Instruction Fusion (IF). IF innovatively combines two distinct prompts through a hybridization process, thereby enhancing the evolution of training prompts for code LLMs. Our experimental results reveal that the proposed novel method effectively addresses the shortcomings of prior methods, significantly improving the performance of Code LLMs across five code generation benchmarks, namely HumanEval, HumanEval+, MBPP, MBPP+ and MultiPL-E, which underscore the effectiveness of Instruction Fusion in advancing the capabilities of LLMs in code generation.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30TBD
59
242WithdrawnWithdrawn
Ethical Reviewing:Machine Learning for CL/NLP
withdrawnwithdrawnwithdrawnTBAWithdrawn
60
243PosterVirtual
NLP and LLM Applications:Ethical Reviewing
NLP and LLM Applications
LLaMA-E: Empowering E-commerce Authoring with Object-Interleaved Instruction Following
Kaize Shi, Xueyao Sun, Dingxian Wang, Yinlin Fu, Guandong Xu and Qing Li
Kaize ShiKaize Shi
E-commerce authoring entails creating engaging, diverse, and targeted content to enhance preference elicitation and retrieval experience. While Large Language Models (LLMs) have revolutionized content generation, they often fall short in e-commerce applications due to their limited memorization of domain-specific features. This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms, the essential objects in e-commerce operation. We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A. The instruction formulation ensures the interleaved cover of the presented and required object features, allowing the alignment of base models to parameterize e-commerce knowledge comprehensively. The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications. To our knowledge, this is the first LLM tailored to empower authoring applications with comprehensive scenario understanding by integrating features focused on participated objects.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30
61
249OralIn-person
Information Retrieval and Text Mining:Ethical Reviewing
LLMTreeRec: Unleashing the Power of Large Language Models for Cold-Start Recommendations
Wenlin Zhang, Chuhan Wu, Xiangyang Li, Yuhao Wang, Kuicai Dong, Yichao Wang, Xinyi Dai, Xiangyu Zhao, Huifeng Guo and Ruiming Tang
Wenlin Zhang
Wenlin Zhang
The lack of training data gives rise to the system cold-start problem in recommendation systems, making them struggle to provide effective recommendations. To address this problem, Large Language Models(LLMs) can model recommendation tasks as language analysis tasks and provide zero-shot results based on their vast open-world knowledge. However, the large scale of the item corpus poses a challenge to LLMs, leading to substantial token consumption that makes it impractical to deploy in real-world recommendation systems. To tackle this challenge, we introduce a tree-based LLM recommendation framework LLMTreeRec, which structures all items into an item tree to improve the efficiency of LLM's item retrieval. LLMTreeRec achieves state-of-the-art performance under the system cold-start setting in two widely used datasets, which is even competitive with conventional deep recommendation systems that use substantial training data. Furthermore, LLMTreeRec outperforms the baseline model in the A/B test on Huawei industrial system. Consequently, LLMTreeRec demonstrates its effectiveness as an industry-friendly solution that has been successfully deployed online.
Hall B-BSession 5: Oral/Poster D
Information extraction and retrieval 2
Wed, Jan 229:00-10:300:00Barbara Plank9:4510:00
62
253PosterIn-person
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Collaborative Document Simplification Using Multi-Agent Systems
Dengzhao Fang, Jipeng Qiang, Xiaoye Ouyang, Yi Zhu, Yunhao Yuan and Yun Li
Jipeng Qiang
Jipeng Qiang
Research on text simplification has been ongoing for many years. However, the task of document simplification (DS) remains a significant challenge due to the need to consider complex factors such as technical terminology, metaphors, and overall coherence. In this work, we introduce a novel multi-agent framework for document simplification (\textit{AgentSimp}) based on large language models (LLMs). This framework emulates the collaborative process of a human expert team through the roles played by multiple agents, addressing the intricate demands of document simplification. We explore two communication strategies among agents (pipeline-style and synchronous) and two document reconstruction strategies (Direct and Iterative ). According to both automatic evaluation metrics and human evaluation results, the documents simplified by AgentSimp are deemed to be more thoroughly simplified and more coherent on a variety of articles across different types and styles.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
63
255PosterVirtual
Ethical Reviewing:Language Modeling
Language Modeling
Distilling Rule-based Knowledge into Large Language Models
Wenkai Yang, Yankai Lin, Jie Zhou and Ji-Rong Wen
Wenkai Yang
Wenkai Yang
Large language models (LLMs) have shown incredible performance in completing various real-world tasks. The current paradigm of knowledge learning for LLMs is mainly based on learning from examples, in which LLMs learn the internal rule implicitly from a certain number of supervised examples. However, this learning paradigm may not well learn those complicated rules, especially when the training examples are limited. We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules. That is, humans can learn new tasks or grasp new knowledge quickly and generalize well given only a detailed rule and a few optional examples. Therefore, in this paper, we aim to explore the feasibility of this new learning paradigm, which targets on encoding rule-based knowledge into LLMs. We further propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules, and then explicitly encode the knowledge into the parameters of LLMs by learning from the above in-context signals produced inside the model. Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability. Warning: This paper may contain examples with offensive content.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
64
257PosterVirtual
Ethical Reviewing:Machine Learning for CL/NLP
Machine Learning for CL/NLP
Exploring Backdoor Vulnerabilities of Chat Models
Wenkai Yang, Yunzhuo Hao and Yankai Lin
Wenkai Yang
Wenkai Yang
Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on single-turn instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90\% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor cannot be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic examples.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
65
259OralIn-person
Ethical Reviewing:Phonology, Morphology, and Word Segmentation
Towards the Machine Translation of Scientific Neologisms
Paul Lerner and François Yvon
Paul LernerPaul Lerner
Scientific research continually discovers and invents new concepts, which are then referred to by new terms, neologisms, or neonyms in this context. As the vast majority of publications are written in English, disseminating this new knowledge to the general public often requires translating these terms. However, by definition, no parallel data exist to provide such translations. Therefore, we propose to leverage term definitions as a useful source of information for the translation process. As we discuss, Large Language Models are well suited for this task and can benefit from in-context learning with co-hyponyms and terms sharing the same derivation paradigm. These models, however, are sensitive to the superficial and morphological similarity between source and target terms. Their predictions are also impacted by subword tokenization, especially for prefixed terms.
Hall B-ASession 3: Oral/Poster B
Discourse, phonology and syntax
Tue, Jan 21
14:00-15:30
0:00Owen Rambow14:3014:45
66
260PosterVirtual
Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining
Sentiment Analysis, Opinion and Argument Mining
HyperIDP: Customizing Temporal Hypergraph Neural Networks for Multi-Scale Information Diffusion Prediction
Haowei Xu, Chao Gao, Xianghua Li and Zhen Wang
Haowei XuHaowei Xu
Information diffusion prediction is crucial for understanding how information spreads within social networks, addressing both macroscopic and microscopic prediction tasks. Macroscopic prediction assesses the overall impact of diffusion, while microscopic prediction focuses on identifying the next user likely to be influenced. However, few studies have focused on both scales of diffusion. This paper presents HyperIDP, a novel Hypergraph-based model designed to manage both macroscopic and microscopic Information Diffusion Prediction tasks. The model captures interactions and dynamics of cascades at the macro level with hypergraph neural networks (HGNNs) while integrating social homophily at the micro level. Considering the diverse data distributions across social media platforms, which necessitate extensive tuning of HGNN architectures, a search space is constructed to accommodate diffusion hypergraphs, with optimal architectures derived through differentiable search strategies. Additionally, cooperative-adversarial loss, inspired by multi-task learning, is introduced to ensure that the model can leverage the advantages of the shared representation when handling both tasks, while also avoiding potential conflicts. Experimental results show that the proposed model significantly outperforms baselines.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30
67
264PosterVirtual
Information Extraction:Ethical Reviewing
Information Extraction
Enhancing multi-modal Relation Extraction with Reinforcement Learning Guided Graph Diffusion Framework
Rui Yang and Rajiv Gupta
Rui YangRui Yang
With the massive growth of multi-modal information such as text, images, and other data, how should we analyze and align these data becomes very important. In our work, we introduce a new framework based on Reinforcement Learning Guided Graph Diffusion to address the complexity of multi-modal graphs and enhance the interpretability, making it clearer to understand the alignment of multi-modal information. Our approach leverages pre-trained models to encode multi-modal data into scene graphs and combines them into a cross-modal graph (CMG). We design a reinforcement learning agent to filter nodes and modify edges based on the observation of the graph state to dynamically adjust the graph structure, providing coarse-grained refinement. Then we will iteratively optimize edge weights and node selection to achieve fine-grained adjustment. We conduct extensive experimental results on multi-modal relation extraction task datasets and show that our model significantly outperforms existing multi-modal methods such as MEGA and MKGFormer. We also conduct an ablation study to demonstrate the importance of each key component, showing that performance drops significantly when any key element is removed. Our method uses reinforcement learning methods to better mine potential multi-modal information relevance, and adjustments based on graph structure make our method more interpretable.
GatherGather Session 3GatherTue, Jan 28
13:00-14:30
13:0014:30Yes
68
267OralIn-person
Dialogue and Conversational Interaction:Ethical Reviewing
Non-Emotion-Centric Empathetic Dialogue Generation
Yuanxiang Huangfu, Peifeng Li, Yaxin Fan and Qiaoming Zhu
Yuanxiang Huangfu
Yuanxiang Huangfu
email: hfyx0111@163.com
Previous work on empathetic response generation mainly focused on utilizing the speaker's emotions to generate responses. However, the performance of identifying fine-grained emotions is limited, introducing cascading errors to empathetic response generation. Moreover, due to the conflict between the information in the dialogue history and the recognized emotions, previous work often generated general and uninformative responses. To address the above issues, we propose a novel framework NEC (Non-Emotion-Centric empathetic dialogue generation) based on contrastive learning and context-sensitive entity and social commonsense, in which the frequent replies and sentences with incorrect emotions are punished through contrastive learning, thereby improving the empathy, diversity and information of the responses. The experimental results demonstrate that our NEC enhances the quality of empathetic generation and generates more diverse responses in comparison with the state-of-the-art baselines.The code will be available at https://github.com/huangfu170/NEC-empchat
Hall B-CSession 4: Oral/Poster C
Dialogue and Conversational Interaction 1
Tue, Jan 21
16:00-17:30
0:00Li Zhang16:0016:15
69
270PosterVirtual
Ethical Reviewing:Reasoning, Question Answering, and Sentence-level Semantics
Reasoning, Question Answering, and Sentence-level Semantics
Aligning Retrieval with Reader Needs: Reader-Centered Passage Selection for Open-Domain Question Answering
Chunlei Xin, Shuheng Zhou, Xuanang Chen, Yaojie Lu, Huijia Zhu, weiqiang wang, Zhongyi Liu, Xianpei Han and Le Sun
Chunlei XinChunlei Xin
Open-Domain Question Answering (ODQA) systems often struggle with the quality of retrieved passages, which may contain conflicting information and be misaligned with the reader's needs. Existing retrieval methods aim to gather relevant passages but often fail to prioritize consistent and useful information for the reader. In this paper, we introduce a novel Reader-Centered Passage Selection (R-CPS) method, which enhances the performance of the retrieve-then-read pipeline by re-ranking and clustering passages from the reader's perspective. Our method re-ranks passages based on the reader's prediction probability distribution and clusters passages according to the predicted answers, prioritizing more useful and relevant passages to the top and reducing inconsistent information. Experiments on ODQA datasets demonstrate the effectiveness of our approach in improving the quality of evidence passages under zero-shot settings.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
70
274PosterVirtual
Ethical Reviewing:Interpretability and Explainability
Interpretability and Explainability
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng and Kai-Wei Chang
Cheng Wang
Cheng Wang
The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
71
279PosterIn-person
Ethical Reviewing:Ethics, Bias, and Fairness
Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields
Jan Philip Wahle, Terry Lima Ruas, Mohamed Abdalla, Bela Gipp and Saif M. Mohammad
Jan Philip Wahle
Jan Philip Wahle
This study examines the tendency to cite older work across 20 fields of study over 43 years (1980--2023). We put NLP's propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of ~240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science).
The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) --- even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community's engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.
AtriumSession 5: Oral/Poster DPoster Wed, Jan 229:00-10:309:0010:30
72
284PosterVirtual
Low-resourced and Less Studied Languages:Ethical Reviewing
Low-resourced and Less Studied Languages
Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation
Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu and Datao You
Yanxu MaoYanxu Mao
In recent years, text classification methods based on neural networks and pre-trained models have gained increasing attention and demonstrated excellent performance. However, these methods still have some limitations in practical applications: (1) They typically focus only on the matching similarity between sentences. However, there exists implicit high-value information both within sentences of the same class and across different classes, which is very crucial for classification tasks. (2) Existing methods such as pre-trained language models and graph-based approaches often consume substantial memory for training and text-graph construction. (3) Although some low-resource methods can achieve good performance, they often suffer from excessively long processing times. To address these challenges, we propose a low-resource and fast text classification model called LFTC. Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data. We then remove redundant information irrelevant to the target classification to reduce processing time. Finally, we compute the similarity distance between text pairs for classification. We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time, especially under limited computational and data resources, highlighting its superior advantages.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
73
299PosterVirtual
Machine Learning for CL/NLP:Ethical Reviewing
Machine Learning for CL/NLP
Monte Carlo Tree Search Based Prompt Autogeneration for Jailbreak Attacks against LLMs
Suhuang WU, Huimin Wang, Yutian Zhao, Xian Wu, yefeng zheng, Wei Li, Hui Li and rongrong ji
SUHUANG WU
SUHUANG WU
Jailbreak attacks craft specific prompts or append adversarial suffixes to prompts, thereby inducing language models to generate harmful or unethical content and bypassing the model's safety guardrails. With the recent blossom of large language models (LLMs), there's a growing focus on jailbreak attacks to probe their safety. While current white-box attacks typically focus on meticulously identifying adversarial suffixes for specific models, their effectiveness and efficiency diminish when applied to different LLMs. In this paper, we propose a Monte Carlo Tree Search (MCTS) based Prompt Auto-generation (MPA) method to enhance the effectiveness and efficiency of attacks across various models. MPA automatically searches for and generates adversarial suffixes for valid jailbreak attacks. Specifically, we first identify a series of action candidates that could potentially trick LLMs into providing harmful responses. To streamline the exploration of adversarial suffixes, we design a prior confidence probability for each MCTS node. We then iteratively auto-generate adversarial prompts using the MCTS framework. Extensive experiments on multiple open-source models (like Llama, Gemma, and Mistral) and closed-source models (such as ChatGPT) show that our proposed MPA surpasses existing methods in search efficiency as well as attack effectiveness. The codes are available at https://github.com/KDEGroup/MPA.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
74
301PosterVirtual
Ethical Reviewing:Reasoning, Question Answering, and Sentence-level Semantics
Reasoning, Question Answering, and Sentence-level Semantics
LogiGraph: Logical Reasoning with Contrastive Learning and Lightweight Graph Networks
Xiang Li, Chen Shi, Yong Xu and jun huang
Xiang LiXiang Li
Logical reasoning is a crucial factor in machine reading comprehension tasks (MRC). Existing methods suffer from the balance between semantic and explicit logical relation representations, in which some emphasize contextual semantics, while others pay more attention to explicit logical features. Additionally, previous methods utilize graph convolutional networks (GCN) for node updates, still exhibiting some shortcomings.
To address these challenges, in this paper, we propose a logical reasoning method with contrastive learning and lightweight graph networks (LogiGraph).
Our method focuses on the \textit{lightweight} aspect of the GCN, which greatly improves the shortcomings of the GCN, and employs conjunction and punctuation marks as two types of edges to construct a dual graph.
Besides, we combine contrastive learning with graph reasoning, which changes the logical expression's content as the negative sample of the original context, enabling the model to capture negative logical relationships and improving generalization ability.
We conduct extensive experiments on two public datasets, ReClor and LogiQA. Experimental results demonstrate that LogiGraph can achieve state-of-the-art performance on both datasets.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
75
304PosterVirtual
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Natural Language Generation, Summarization and Simplification
Explaining Relationships Among Research Papers
Xiangci Li and Jessica Ouyang
Xiangci LiXiangci Li
The rapid pace of research publications makes it challenging for researchers to stay up to date. There is a growing need for automatically generated, concise literature reviews to help researchers quickly identify papers relevant to their interests. Prior work over the past decade has focused on summarizing individual research papers, typically in the context of citation generation, while the relationships among multiple papers have largely been overlooked. Existing approaches primarily generate standalone citation sentences without addressing the need for expository and transition sentences to explain the relationships among multiple citations. In this work, we propose a feature-based, LLM-prompting approach to generate richer citation texts and simultaneously capture the complex relationships among multiple papers. Our expert evaluation reveals a strong correlation between human preference and integrative writing styles, indicating that readers favor high-level, abstract citations with transition sentences that weave them into a coherent narrative.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
76
305PosterIn-person
Ethical Reviewing:NLP and LLM Applications
From Generalist to Specialist: A Survey of Large Language Models for Chemistry
Yang Han, Ziping Wan, Lu Chen, Kai Yu and Xin Chen
Yang HanYang Han
Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP).
However, the predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry.
The scarcity of specialized chemistry data, coupled with the complexity of multi-modal data such as 2D graph, 3D structure and spectrum, present distinct challenges.
Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs.
In this paper, we outline methodologies for incorporating domain-specific chemistry knowledge and multi-modal information into LLMs, we also conceptualize chemistry LLMs as agents using chemistry tools and investigate their potential to accelerate scientific research.
Additionally, we conclude the existing benchmarks to evaluate chemistry ability of LLMs. Finally, we critically examine the current challenges and identify promising directions for future research.
Through this comprehensive survey, we aim to assist researchers in staying at the forefront of developments in chemistry LLMs and to inspire innovative applications in the field.
AtriumSession 2: Oral/Poster APoster Tue, Jan 21
11:00-12:30
11:0012:30
77
306PosterIn-person
Ethical Reviewing:Interpretability and Explainability
Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution
Milad Alshomary, Narutatsu Ri, Marianna Apidianaki, Ajay Patel, Smaranda Muresan and Kathleen McKeown
Milad Alshomary
Milad Alshomary
Recent state-of-the-art authorship attribution methods learn authorship representations of text in a latent, uninterpretable space, which hinders their usability in real-world applications. We propose a novel approach for interpreting learned embeddings by identifying representative points in the latent space and leveraging large language models to generate informative natural language descriptions of the writing style associated with each point. We evaluate the alignment between our interpretable and latent spaces and demonstrate superior prediction agreement over baseline methods.
Additionally, we conduct a human evaluation to assess the quality of these style descriptions and validate their utility in explaining the latent space. Finally, we show that human performance on the challenging authorship attribution task improves by +20% on average when aided with explanations from our method.
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
78
310PosterVirtual
Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing
Multimodal and Grounded Language Acquisition, HRI
Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step Parsing
HaiXiang Zhu, Lixian Su, ShuangMing Mao and Jing Ye
Haixiang Zhu
Haixiang Zhu
Visual grounding (VG) is an important task in vision and language that involves understanding the mutual relationship between query terms and images. However, existing VG datasets typically use simple and intuitive textual descriptions, with limited attribute and spatial information between images and text. Recently, the Scene Knowledge Visual Grounding (SK-VG) task has been introduced, which constructs VG datasets using visual knowledge and relational referential expressions. Due to the length of textual visual knowledge and the complexity of the referential relationships between entities, previous models have struggled with this task. Therefore, we propose ReadVG, a zero-shot, plug-and-play method that leverages the robust language understanding capabilities of Large Language Models (LLMs) to transform long visual knowledge texts into concise, information-dense visual descriptions. To improve the accuracy of target localisation, we employ a multi-step parsing algorithm that can progressively extract the query targets and their features from the visual knowledge and relational referencing expressions, thereby assisting multimodal models to more accurately localise the target for grounding purposes. Extensive experiments and case studies show that our approach can significantly improve the performance of multimodal grounding models.
GatherGather TBDGather
Gather TBD
TBDTBDTBD
79
314PosterIn-person
Ethical Reviewing:Natural Language Generation, Summarization and Simplification
Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem
Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller and Vera Schmitt
Qianli WangQianli Wang
Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
80
316PosterVirtual
Ethical Reviewing:Language Modeling
Language Modeling
BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
Minchong Li, Feng Zhou and Xiaohui Song
Minchong Li
Minchong Li
mincoolee@gmail.com
In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-k teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
81
327PosterVirtual
Ethical Reviewing:Low-resourced and Less Studied Languages
Low-resourced and Less Studied Languages
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz, Satak Kumar Dey, Ruwad Naswan, Hasnaen Adil, Khondker Salman Sayeed and Haz Sameen Shahgir
Tamzeed Mahfuz
Tamzeed Mahfuz
Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia.

We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30
82
332PosterVirtual
Ethical Reviewing:Ethics, Bias, and Fairness
Ethics, Bias, and Fairness
Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs
Julia Watson, Sophia S. Lee, Barend Beekhuizen and Suzanne Stevenson
Julia Watson
Julia Watson
We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is "correct" or "natural", LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs' metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
83
338PosterVirtual
NLP and LLM Applications:Ethical Reviewing
NLP and LLM Applications
T-MES: Trait-Aware Mix-of-Experts Representation Learning for Multi-trait Essay Scoring
Jiong Wang and Jie Liu
Jiong WangJiong Wang
In current research on automatic essay scoring, related work tends to focus more on evaluating the overall quality or a single trait of prompt-specific essays. However, when scoring essays in an educational context, it is essential not only to consider the overall score but also to provide feedback on various aspects of the writing. This helps students clearly identify areas for improvement, enabling them to engage in targeted practice. Although many methods have been proposed to address the scoring issue, they still suffer from insufficient learning of trait representations and overlook the diversity and correlations between trait scores in the scoring process. To address this problem, we propose a novel multi-trait essay scoring method based on Trait-Aware Mix-of-Experts Representation Learning. Our method obtains trait-specific essay representations using a Mix-of-Experts scoring architecture. Furthermore, based on this scoring architecture, we propose a diversified trait-expert method to learn distinguishable expert weights. And to facilitate multi-trait scoring, we introduce two trait correlation learning strategies that achieve learning the correlations among traits. Experimental results demonstrate the effectiveness of our method, and compared to existing methods, it achieves a further improvement in computational efficiency.
GatherGather TBDGather
Gather TBD
TBDTBDTBD
84
344PosterVirtual
Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing
Multimodal and Grounded Language Acquisition, HRI
A Graph Interaction Framework on Relevance for Multimodal Named Entity Recognition with Multiple Images
Jiachen Zhao, Shizhou Huang and xin Lin
Jiachen Zhao
Jiachen Zhao
Posts containing multiple images have significant research potential in Multimodal Named Entity Recognition nowadays. The previous methods determine whether the images are related to named entities in the text through similarity computation, such as using CLIP. However, it is not effective in some cases and not conducive to task transfer, especially in multi-image scenarios. To address the issue, we propose a graph interaction framework on relevance (GIFR) for Multimodal Named Entity Recognition with multiple images. For humans, they have the abilities to distinguish whether an image is relevant to named entities, but human capabilities are difficult to model. Therefore, we propose using reinforcement learning based on human preference to integrate human abilities into the model to determine whether an image-text pair is relevant, which is referred to as relevance. To better leverage relevance, we construct a heterogeneous graph and introduce graph transformer to enable information interaction. Experiments on benchmark datasets demonstrate that our method achieves the state-of-the-art performance.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
85
347PosterVirtual
Ethical Reviewing:Phonology, Morphology, and Word Segmentation
Phonology, Morphology, and Word Segmentation
Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation
Xuebin Wang, Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong and Yang Hou
Xuebin Wang
Xuebin Wang
Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from parallel speech-text data.
We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries.
Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries.
To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy.
We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2.
We have annotated about 1K sentences as the evaluation data of AISHELL2.
Experiments demonstrate the effectiveness of our proposed approach.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
86
351OralIn-person
NLP and LLM Applications:Ethical Reviewing
RoBGuard: Enhancing LLMs to Assess Risk of Bias in Clinical Trial Documents
Changkai Ji, Bowen Zhao, Zhuoyao Wang, Yingwen Wang, Yuejie Zhang, Ying Cheng, Rui Feng and Xiaobo Zhang
Changkai JiChangkai Ji
Randomized Controlled Trials (RCTs) are rigorous clinical studies crucial for reliable decision-making, but their credibility can be compromised by bias. The Cochrane Risk of Bias tool (RoB 2) assesses this risk, yet manual assessments are time-consuming and labor-intensive. Previous approaches have employed Large Language Models (LLMs) to automate this process. However, they typically focus on manually crafted prompts and a restricted set of simple questions, limiting their accuracy and generalizability. Inspired by the human bias assessment process, we propose RoBGuard, a novel framework for enhancing LLMs to assess the risk of bias in RCTs. Specifically, RoBGuard integrates medical knowledge-enhanced question reformulation, multimodal document parsing, and multi-expert collaboration to ensure both completeness and accuracy. Additionally, to address the lack of suitable datasets, we introduce two new datasets: RoB-Item and RoB-Domain. Experimental results demonstrate RoBGuard's effectiveness on the RoB-Item dataset, outperforming existing methods.
Suite 7Session 6: Oral/Poster E
Applications 3
Wed, Jan 22
11:00-12:30
0:00Marco Rovera12:0012:15
87
352PosterVirtual
Ethical Reviewing:Information Extraction
Information Extraction
A Compressive Memory-based Retrieval Approach for Event Argument Extraction
Wanlong Liu, Enqi Zhang, shaohuan cheng, Dingyi Zeng, Li Zhou, Chen Zhang, Malu Zhang and Wenyu Chen
Wanlong LiuWanlong Liu
Recent works have demonstrated the effectiveness of retrieval augmentation in the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE methods have two main limitations: (1) input length constraints and (2) the gap between the retriever and the inference model. These issues limit the diversity and quality of the retrieved information. In this paper, we propose a Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the two limitations mentioned above. Our compressive memory, designed as a dynamic matrix that effectively caches retrieved information and supports continuous updates, overcomes the limitations of input length. Additionally, after pre-loading all candidate demonstrations into the compressive memory, the model further retrieves and filters relevant information from the memory based on the input query, bridging the gap between the retriever and the inference model. Extensive experiments show that our method achieves new state-of-the-art performance on three public datasets (RAMS, WikiEvents, ACE05), significantly outperforming existing retrieval-based EAE methods.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
88
355PosterIn-person
Ethical Reviewing:Machine Learning for CL/NLP
FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics
Yupei Du, Albert Gatt and Dong Nguyen
Yupei DuYupei Du
Despite the massive success of fine-tuning Pre-trained Language Models (PLMs),
they remain susceptible to out-of-distribution input.
Dataset cartography is a simple yet effective dual-model approach
that improves the robustness of fine-tuned PLMs. It involves
fine-tuning a model on the original training set (i.e. reference model),
selecting a subset of important training instances
based on the training dynamics,
% of the reference model,
and fine-tuning again only on these selected examples (i.e. main model).
However, this approach requires fine-tuning the same model twice,
which is computationally expensive for large PLMs.
In this paper, we show that
1) training dynamics are highly transferable across model sizes
and pre-training methods, and that
2) fine-tuning main models using these selected training instances
achieves higher training efficiency than empirical risk minimization (ERM).
Building on these observations,
we propose a novel fine-tuning approach:
Fine-Tuning by transFerring Training dynamics (FTFT).
Compared with dataset cartography,
FTFT uses more efficient reference models and aggressive early stopping.
FTFT achieves robustness improvements over ERM
while lowering the training cost by up to ~50%
AtriumSession 4: Oral/Poster CPoster Tue, Jan 21
16:00-17:30
16:0017:30
89
357PosterIn-person
Low-resourced and Less Studied Languages:Ethical Reviewing
PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation
Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka and Masao Utiyama
Hour KaingHour Kaing
This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
90
358PosterVirtual
Reasoning, Question Answering, and Sentence-level Semantics:Ethical Reviewing
Reasoning, Question Answering, and Sentence-level Semantics
Relation Logical Reasoning and Relation-aware Entity Encoding for Temporal Knowledge Graph Reasoning
Longzhou Liu, Chenglong Xiao, Shanshan Wang and Tingwen Liu
Longzhou Liu
Longzhou Liu
Temporal Knowledge Graph Reasoning (TKGR) aims to predict future facts based on historical data. Current mainstream models primarily use embedding techniques, which predict missing facts by representing entities and relations as low-dimensional vectors. However, these models often consider only the structural information of individual entities and relations, overlooking the broader structure of the entire TKG. To address these limitations, we propose a novel model called Relation Logical Reasoning and Relation-aware Entity Encoding (RLEE), drawing inspiration from attention mechanisms and logical rule-based techniques. RLEE introduces a two-layer representation of the TKG: an entity layer and a relation layer. At the relation layer, we extract relation paths to mine potential logical correlations between different relations, learning relation embeddings through a process of relation logical reasoning. At the entity layer, we use the relation-aware attention mechanism to learn the entity embeddings specific to the predicted query relations. These learned relation and entity embeddings are then used to predict facts at future timestamps. When evaluated on five commonly used public datasets, RLEE consistently outperforms state-of-the-art baselines.
GatherGather Session 2GatherTue, Jan 28
07:00-08:30
7:008:30Yes
91
359PosterIn-person
Ethical Reviewing:Reasoning, Question Answering, and Sentence-level Semantics
Awakening Augmented Generation: Learning to Awaken Internal Knowledge of Large Language Models for Question Answering
Huanxuan Liao, Shizhu He, Yao Xu, Yuanzhe Zhang, Shengping Liu, Kang Liu and Jun Zhao
Huanxuan Liao
Huanxuan Liao
Retrieval-Augmented-Generation and Generation-Augmented-Generation have been proposed to enhance the knowledge required for question answering with Large Language Models (LLMs) by leveraging richer context. However, the former relies on external resources, and both require incorporating explicit documents into the context, which increases execution costs and susceptibility to noise data during inference. Recent works indicate that LLMs model rich knowledge, but it is often not effectively activated and awakened. Inspired by this, we propose a novel knowledge-augmented framework, Awakening-Augmented-Generation (AAG), which mimics the human ability to answer questions using only thinking and recalling to compensate for knowledge gaps, thereby awaking relevant knowledge in LLMs without relying on external resources. AAG consists of two key components for awakening richer context. Explicit awakening fine-tunes a context generator to create a synthetic, compressed document that functions as symbolic context. Implicit awakening utilizes a hypernetwork to generate adapters based on the question and synthetic document, which are inserted into LLMs to serve as parameter context. Experimental results on three datasets demonstrate that AAG exhibits significant advantages in both open-domain and closed-book settings, as well as in out-of-distribution generalization. Our code will be available at https://github.com/Xnhyacinth/IAG.
AtriumSession 3: Oral/Poster BPoster Tue, Jan 21
14:00-15:30
14:0015:30
92
368PosterVirtual
Discourse and Pragmatics:Ethical Reviewing
Discourse and Pragmatics
Dying or Departing? Euphemism Detection for Death Discourse in Historical Texts
Ali Al-Laith, Alexander Conroy, Jens Bjerring-Hansen, Bolette Pedersen, Carsten Levisen and Daniel Hershcovich
Ali Mohammed Ali Allaith
Ali Mohammed Ali Allaith
Euphemisms are a linguistic device used to soften discussions of sensitive or uncomfortable topics, with death being a prominent example. In this paper, we present a study on the detection of death-related euphemisms in historical literary texts from a corpus containing Danish and Norwegian novels from the late 19th century. We introduce an annotated dataset of euphemistic and literal references to death, including both common and rare euphemisms, ranging from well-established terms to more culturally nuanced expressions. We evaluate the performances of state-of-the-art pre-trained language models fine-tuned for euphemism detection. Our findings show that fixed, literal expressions of death became less frequent over time, while metaphorical euphemisms grew in prevalence. Additionally, euphemistic language was more common in historical novels, whereas contemporary novels tended to refer to death more literally, reflecting the rise of secularism. These results shed light on the shifting discourse on death during a period when the concept of death as final became prominent.
GatherGather Session 3GatherTue, Jan 28
13:00-14:30
13:0014:30Yes
93
369OralIn-person
Information Retrieval and Text Mining:Ethical Reviewing
ITERATE: Image-Text Enhancement, Retrieval, and Alignment for Transmodal Evolution with LLMs
Chenhan Fu, Guoming Wang, Juncheng Li, Wenqiao Zhang, Rongxing Lu and Siliang Tang
Chenhan Fu
Guoming Wang
Chenhan Fu
Guoming Wang
Inspired by human cognitive behavior, we introduce visual modality to enhance the performance of pure text-based question-answering tasks with the development of multimodal models. However, obtaining corresponding images through manual annotation often entails high costs. Faced with this challenge, an intuitive strategy is to use search engines or use web scraping techniques to automatically obtain relevant image information. However, the images obtained by this strategy may be of low quality and may not match the context of the original task, which could fail to improve or even decrease performance on downstream tasks. In this paper, we propose a novel framework named "ITERATE", aimed at retrieving and optimizing the quality of images to improve the alignment between text and images. Inspired by evolutionary algorithms in reinforcement learning and driven by the synergy of large language models (LLMs) and multimodal models, ITERATE employs a series of strategic actions such as filtering, optimizing, and retrieving to acquire higher quality images, and repeats this process over multiple generations to enhance the quality of the entire image cluster. Our experimental results on the ScienceQA, ARC-Easy, and OpenDataEval datasets also verify the effectiveness of our method, showing improvements of 3.5%, 5%, and 7%, respectively.
Hall B-BSession 5: Oral/Poster D
Information extraction and retrieval 2
Wed, Jan 229:00-10:300:00Barbara Plank10:0010:15
94
371PosterVirtual
Information Retrieval and Text Mining:Ethical Reviewing
Information Retrieval and Text Mining
Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation
zhe yang and Tiantian Liang
Tiantian Liang
Tiantian Liang
Session-based recommendation focuses on predicting the next item a user will interact with based on sequences of anonymous user sessions. A significant challenge in this field is data sparsity due to the typically short-term interactions. Most existing methods rely heavily on users' current interactions, overlooking the wealth of auxiliary information available. To address this, we propose a novel model, the Multi-Graph Co-Training model (MGCOT), which leverages not only the current session graph but also similar session graphs and a global item relation graph. This approach allows for a more comprehensive exploration of intrinsic relationships and better captures user intent from multiple views, enabling session representations to complement each other. Additionally, MGCOT employs multi-head attention mechanisms to effectively capture relevant session intent and uses contrastive learning to form accurate and robust session representations. Extensive experiments on three datasets demonstrate that MGCOT significantly enhances the performance of session-based recommendations, particularly on the Diginetica dataset, achieving improvements up to 2.00% in P@20 and 10.70% in MRR@20. Resources have been made publicly available in our GitHub repository https://github.com/liang-tian-tian/MGCOT.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
95
372PosterVirtual
Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing
Multimodal and Grounded Language Acquisition, HRI
CAST: Cross-modal Alignment Similarity Test for Vision Language Models
Gautier Dagan, Olga Loginova and Anil Batra
Gautier Dagan
Gautier Dagan
Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.
GatherGather Session 1GatherMon, Jan 27
19:00-20:30
19:0020:30Yes
96
380PosterIn-person
Ethical Reviewing:Language Modeling
Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models
Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley and Lina Yao
Chengkai Huang
Chengkai Huang
Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks. However, it was observed by previous works that retrieval is not always helpful, especially when the LLM is already knowledgable on the query to answer. Motivated by this, Adaptive Retrieval-Augmented Generation (ARAG) studies retrieving only when the knowledge asked by the query is absent in the LLM. Previous works of ARAG either require accessing the pre-training corpus or prompting with additional model inferences. Aiming to avoid such drawbacks, we propose to determine whether the model is knowledgeable on a query via inspecting the (contextualized) pre-trained token embeddings of LLMs. We hypothesize that such embeddings capture rich information on the model's intrinsic knowledge base, which enables an efficient way of judging the necessity to retrieve from an external corpus. Extensive experiments demonstrate our ARAG approach's superior performance across various benchmarks.
AtriumSession 6: Oral/Poster EPoster Wed, Jan 22
11:00-12:30
11:0012:30
97
386PosterIn-person
Ethical Reviewing:Lexical Semantics
Investigating the Contextualised Word Embedding Dimensions Specified for Contextual and Temporal Semantic Changes
Taichi Aida and Danushka Bollegala
Taichi AidaTaichi Aida
The sense-aware contextualised word embeddings (SCWEs) encode semantic changes of words within the contextualised word embedding (CWE) spaces.
Despite the superior performance of (SCWE) in contextual/temporal semantic change detection (SCD) benchmarks, it remains unclear as to how the meaning changes are encoded in the embedding space.
To study this, we compare pre-trained CWEs and their fine-tuned versions on contextual and temporal semantic change benchmarks under Principal Component Analysis (PCA) and Independent Component Analysis (ICA) transformations.
Our experimental results reveal
(a) although there exist a smaller number of axes that are specific to semantic changes of words in the pre-trained CWE space, this information gets distributed across all dimensions when fine-tuned, and
(b) in contrast to prior work studying the geometry of CWEs, we find that PCA to better represent semantic changes than ICA within the top 10% of axes.
These findings encourage the development of more efficient SCD methods with a small number of SCD-aware dimensions.
AtriumSession 6: Oral/Poster EPoster Wed, Jan 22
11:00-12:30
11:0012:30
98
387OralIn-person
Low-resourced and Less Studied Languages:Ethical Reviewing
Uncertainty Modelling in Under-Represented Languages with Bayesian Deep Gaussian Processes
Ubaid Azam, Imran Razzak, Shelly Vishwakarma and Shoaib Jameel
Imran Razzak
Imran Razzak
NLP models often face challenges with under-represented languages due to a lack of sufficient training data and language complexities. This can result in inaccurate predictions and a failure to capture the inherent uncertainties within these languages. This paper introduces a new method for modelling uncertainty in under-represented languages by employing deep Bayesian Gaussian Processes. We develop a novel framework that integrates prior knowledge and leverages kernel functions. This helps enable the quantification of uncertainty in predictions to overcome the data limitations in under-represented languages. The efficacy of our approach is validated through various experiments, and the results are benchmarked against existing methods to highlight the enhancements in prediction accuracy and measurement of uncertainty.
Suite 7Session 5: Oral/Poster D
Low-resource languages 1
Wed, Jan 229:00-10:300:00Maite Melero10:0010:15
99
388OralIn-person
Ethical Reviewing:Low-resourced and Less Studied Languages
Cross-lingual Text Classification Transfer: The Case of Ukrainian
Daryna Dementieva, Valeriia Khylenko and Georg Groh
Daryna Dementieva
Daryna Dementieva
Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks---toxicity classification, formality classification, and natural language inference (NLI)---providing the ``recipe'' for the optimal setups for each task.
Suite 7Session 5: Oral/Poster D
Low-resource languages 1
Wed, Jan 229:00-10:300:00Maite Melero9:009:15
100
390OralIn-person
Ethical Reviewing:NLP and LLM Applications
LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots
Dongge Han, Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Peter Bell and Amos Storkey
Dongge Han
Dongge Han
Large language models (LLMs) have shown significant potential for robotics applications, particularly task planning, by harnessing their language comprehension and text generation capabilities. However, in applications such as household robotics, a critical gap remains in the personalization of these models to household preferences. For example, an LLM planner may find it challenging to perform tasks that require personalization, such as deciding where to place mugs in a kitchen based on specific household preferences. We introduce LLM-Personalize, a novel framework designed to personalize LLM planners for household robotics. LLM-Personalize uses an LLM planner to perform iterative planning in multi-room, partially-observable household environments, utilizing a scene graph built dynamically from local observations. To personalize the LLM planner towards user preferences, our optimization pipeline integrates imitation learning and reinforced Self-Training. We evaluate LLM-Personalize on Housekeep, a challenging simulated real-world 3D benchmark for household rearrangements, demonstrating a more than 30 percent increase in success rate over existing LLM planners, showcasing significantly improved alignment with human preferences.
Suite 7Session 9: Oral/Poster G
Applications 4
Thu, Jan 239:00-10:300:00Yi Feng9:309:45