COLING 25 Accepted Papers and Assignments

	A	D	E	F	G	H	I	K	L	M	N	O	P	Q	R	S	T	U	V	W	X
1	Paper ID	How Paper is being presented	How will paper be presented?	Track Theme	Track Theme for Program	Title	Authors Names	Who registered Paper?	Presenter Name	Abstract	Room Location	Session	Whova Session Titles	Sub-session (ex. ML 1, ML 2, etc.)	Session Date	Session time	Talk order	Session Chair	Start Time UAE	End Time UAE	Gather Time Added to Whova
2	1	Poster	Virtual	Dialogue and Conversational Interaction:Ethical Reviewing	Dialogue and Conversational Interaction	PreAct: Prediction Enhances Agent's Planning Ability	Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He and Weiran Xu	Dayuan Fu	Dayuan Fu	Addressing the disparity between predictions and actual results can enable individuals to expand their thought processes and stimulate self-reflection, thus promoting accurate planning. In this research, we present PreAct, an agent framework that integrates prediction, reasoning, and action. By utilizing the information derived from predictions, the large language model (LLM) agent can provide a wider range and more strategically focused reasoning. This leads to more efficient actions that aid the agent in accomplishing intricate tasks. Our experimental results show that PreAct surpasses the ReAct method in completing complex tasks and that PreAct's performance can be further improved when paired with other memory or selection strategy techniques. We presented the model with varying quantities of historical predictions and discovered that these predictions consistently enhance LLM planning. The variances in single-step reasoning between PreAct and ReAct indicate that PreAct indeed has benefits in terms of diversity and strategic orientation over ReAct.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30
3	7	Poster	In-person	Language Resources and Evaluation:Ethical Reviewing		The PRECOM-SM Corpus: Gambling in Spanish Social Media	Pablo Álvarez-Ojeda, María Victoria Cantero-Romero, Anastasia Semikozova and Arturo Montejo-Raez	Arturo Montejo-Ráez	Arturo Montejo-Ráez	Gambling addiction is a "silent problem'' in society, especially among young people in recent years due to the easy access to betting and gambling sites on the Internet through smartphones and personal computers. As online communities in messaging apps, forums and other ``teenagers gathering'' sites keep growing day by day, more textual information is available for its study. This work focuses on collecting text from online Spanish-speaking communities and analysing it in order to find patterns in written language from frequent and infrequent users on the collected platforms so that an emerging gambling addiction problem can be detected. In this paper, a newly built corpus is introduced, as well as an extensive description of how it has been made. Besides, some baseline experiments on the data have been carried on, employing the generated features after the analysis of the text with different machine learning approaches like the bag of words model or deep neural network encodings.	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
4	12	Poster	Virtual	Ethical Reviewing:Language Modeling	Language Modeling	How Well Can a Long Sequence Model Model Long Sequences? Comparing Architectural Inductive Biases on Long-Context Abilities	Jerry Huang	Jerry Huang	Jerry Huang	Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence length. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.	Gather	Gather Session 3	Gather		Tue, Jan 28	13:00-14:30			13:00	14:30	Yes
5	16	Poster	Virtual	Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining	Sentiment Analysis, Opinion and Argument Mining	Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis	Kaiwei Sun and Mi Tian	Mi Tian	Mi Tian	Multimodal Sentiment Analysis (MSA) aims to identify human attitudes from diverse modalities such as visual, audio and text modalities. Recent studies suggest that the text modality tends to be the most effective, which has encouraged models to consider text as its core modality. However, previous methods primarily concentrate on projecting modalities other than text into a space close to the text modality and learning an identical representation, which does not fully make use of the auxiliary information provided by audio and visual modalities. In this paper, we propose a framework, Sequential Fusion of Text-close and Text-far Representations (SFTTR), aiming to refine multimodal representations from multimodal data which should contain both representations close to and far from the text modality. Specifically, we employ contrastive learning to sufficiently explore the information similarities and differences between text and audio/visual modalities. Moreover, to fuse the extracted representations more effectively, we design a sequential cross-modal encoder to sequentially fuse representations that are close to and far from the text modality.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
6	22	Poster	Virtual	Ethical Reviewing:Natural Language Generation, Summarization and Simplification	Natural Language Generation, Summarization and Simplification	PoemBERT: A Dynamic Masking Content and Ratio Based Semantic Language Model For Chinese Poem Generation	Chihan Huang and Xiaobo Shen	Chihan Huang	Chihan Huang	Ancient Chinese poetry stands as a crucial treasure in Chinese culture. To address the absence of pre-trained models for ancient poetry, we introduced PoemBERT, a BERT-based model utilizing a corpus of classical Chinese poetry. Recognizing the unique emotional depth and linguistic precision of poetry, we incorporated sentiment and pinyin embeddings into the model, enhancing its sensitivity to emotional information and addressing challenges posed by the phenomenon of multiple pronunciations for the same Chinese character. Additionally, we proposed Character Importance-based masking and dynamic masking strategies, significantly augmenting the model's capability to extract imagery-related features and handle poetry-specific information. Fine-tuning our PoemBERT model on various downstream tasks, including poem generation and sentiment classification, resulted in state-of-the-art performance in both automatic and manual evaluations. We provided explanations for the selection of the dynamic masking rate strategy and proposed a solution to the issue of a small dataset size.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
7	29	Poster	In-person	Sentiment Analysis, Opinion and Argument Mining:Ethical Reviewing		CDA^2: Counterfactual Diffusion Augmentation for Cross-Domain Adaptation in Low-Resource Sentiment Analysis	Dancheng Xin, Kaiqi Zhao, Jingyun Sun and Yang Li	Yang Li	Yang Li	Domain adaptation is widely employed in cross-domain sentiment analysis, enabling the transfer of models from label-rich source domains to target domain with fewer or no labels. However, concerns have been raised regarding their robustness and sensitivity to data distribution shift, particularly when encountering significant disparities in data distribution between the different domains. To tackle this problem, we introduce a framework CDA^2 for cross-domain adaptation in low-resource sentiment analysis, which utilizes counterfactual diffusion augmentation. Specifically, it employs samples derived from domain-relevant word substitutions in source domain samples to guide the diffusion model for generating high-quality counterfactual target domain samples. We adopt a soft absorbing state and MMD loss during the training stage, and use advanced ODE solvers to expedite the sampling process. Our experiments demonstrate that CDA^2 generates high-quality target samples and achieves state-of-the-art performance in cross-domain sentiment analysis.	Atrium	Session 6: Oral/Poster E	Poster		Wed, Jan 22	11:00-12:30			11:00	12:30
8	35	Poster	In-person	Ethical Reviewing:Language Resources and Evaluation		CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?	Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li and Jing Ma	Ziyang Luo	Ziyang Luo	Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our benchmark is available at https://github.com/CodeLLM-Research/CodeJudge-Eval .	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
9	39	Oral	In-person	Information Extraction:Ethical Reviewing		Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching	Tianshu Wang, Xiaoyang Chen, Hongyu Lin, Xuanang Chen, Xianpei Han, Le Sun, Hao Wang and Zhenyu Zeng	Tianshu Wang	Tianshu Wang	Entity matching (EM) is a critical step in entity resolution (ER). Recently, entity matching based on large language models (LLMs) has shown great promise. However, current LLM-based entity matching approaches typically follow a binary matching paradigm that ignores the global consistency among record relationships. In this paper, we investigate various methodologies for LLM-based entity matching that incorporate record interactions from different perspectives. Specifically, we comprehensively compare three representative strategies: matching, comparing, and selecting, and analyze their respective advantages and challenges in diverse scenarios. Based on our findings, we further design a compound entity matching framework (ComEM) that leverages the composition of multiple strategies and LLMs. ComEM benefits from the advantages of different sides and achieves improvements in both effectiveness and efficiency. Experimental results on 8 ER datasets and 10 LLMs verify the superiority of incorporating record interactions through the selecting strategy, as well as the further cost-effectiveness brought by ComEM.	Hall B-B	Session 4: Oral/Poster C	Information extraction and retrieval 1		Tue, Jan 21	16:00-17:30	0:00	Ge Shi	16:15	16:30
10	40	Poster	Virtual	Ethical Reviewing:Natural Language Generation, Summarization and Simplification	Natural Language Generation, Summarization and Simplification	InstructGEC: Enhancing Unsupervised Grammatical Error Correction with Instruction Tuning	Jiayi Deng, Chen Chen, Chunyan Hou and Xiaojie Yuan	Jiayi Deng	Jiayi Deng	Recent works have proposed methods of generating synthetic data automatically for unsupervised Grammatical Error Correction (GEC). Although a large amount of synthetic data is generated at a low cost, it is unrealistic and of poor quality. The copying phenomenon of synthetic data prevents GEC models from learning the semantic knowledge of contextual language. In this paper, we design an instruction format and use the masking strategy in both an erroneous sentence and the corresponding instruction consistently to alleviate the impact of the copy phenomenon. We also propose a novel approach, InstructGEC, which integrates the knowledge of grammatical detection into GEC models with instruction tuning to address the low-quality issue. Experiments are conducted on English and Chinese GEC datasets and results demonstrate that our method outperforms state-of-the-art unsupervised GEC methods.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
11	41	Oral	In-person	Ethical Reviewing:Dialogue and Conversational Interaction		Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense Inference	Lanrui Wang, Jiangnan Li, Chenxu Yang, Zheng Lin, Hongyin Tang, Huan Liu, Yanan Cao, Jingang Wang and Weiping Wang	Lanrui Wang	Lanrui Wang	Recently, there has been a heightened interest in building chatbots based on Large Language Models (LLMs) to emulate human-like qualities in multi-turn conversations. Despite having access to commonsense knowledge to better understand the psychological aspects and causality of dialogue context, even these powerful LLMs struggle to achieve the goals of empathy and emotional support. Current commonsense knowledge derived from dialogue contexts is inherently limited and often fails to adequately anticipate the future course of a dialogue. This lack of foresight can mislead LLMs and hinder their ability to provide effective support. In response to this challenge, we present an innovative framework named Sensible and Visionary Commonsense Knowledge (Sibyl). Designed to concentrate on the immediately succeeding dialogue, this paradigm equips LLMs with the capability to uncover the implicit requirements of the conversation, aiming to elicit more empathetic responses. Experimental results demonstrate that incorporating our paradigm for acquiring commonsense knowledge into LLMs comprehensively enhances the quality of their responses.	Hall B-C	Session 4: Oral/Poster C	Dialogue and Conversational Interaction 1		Tue, Jan 21	16:00-17:30	0:00	Li Zhang	16:30	16:45
12	42	Poster	Virtual	Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing	Multimodal and Grounded Language Acquisition, HRI	Noise-powered Multi-modal Knowledge Graph Representation Framework	Zhuo Chen, Yin Fang, Yichi Zhang, Lingbing Guo, Jiaoyan Chen, Jeff Z. Pan, Huajun Chen and Wen Zhang	Zhuo Chen	Zhuo Chen	The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a framework is essential for embedding structured knowledge into multi-modal Large Language Models effectively, alleviating issues like knowledge misconceptions and multi-modal hallucinations. In this work, we explore the efficacy of models in accurately embedding entities within MMKGs through two pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking to robustly integrate multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility. Moreover, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Code and data are available at https://github.com/zjukg/SNAG.	Gather	Gather Session 3	Gather		Tue, Jan 28	13:00-14:30			13:00	14:30	Yes
13	49	Poster	Virtual	Ethical Reviewing:Language Resources and Evaluation	Language Resources and Evaluation	ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios	Junjie Ye, Guanyu Li, SongYang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui and Xuanjing Huang	Junjie Ye	Junjie Ye	Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
14	52	Poster	Virtual	Ethical Reviewing:Information Extraction	Information Extraction	Federated Incremental Named Entity Recognition	Zesheng Liu, Qiannan Zhu, Cuiping Li and Hong Chen	Zesheng Liu	Zesheng Liu	Federated learning-based Named Entity Recognition (FNER) has attracted widespread attention through decentralized training on local clients. However, most FNER models assume that entity types are pre-fixed, so in practical applications, local clients constantly receive new entity types without enough storage to access old entity types, resulting in severe forgetting on previously learned knowledge. In addition, new clients collecting only new entity types may join the global training of FNER irregularly, further exacerbating catastrophic forgetting. To overcome the above challenges, we propose a Forgetting-Subdued Learning (FSL) model which solves the forgetting problem on old entity types from both intra-client and inter-client two aspects. Specifically, for intra-client aspect, we propose a prototype-guided adaptive pseudo labeling and a prototypical relation distillation loss to surmount catastrophic forgetting of old entity types with semantic shift. Furthermore, for inter-client aspect, we propose a task transfer detector. It can identify the arrival of new entity types that are protected by privacy and store the latest old global model for relation distillation. Qualitative experiments have shown that our model has made significant improvements compared to several baseline methods.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
15	54	Poster	In-person	Natural Language Generation, Summarization and Simplification:Ethical Reviewing		Large Language Models are Good Annotators for Type-aware Data Augmentation in Grammatical Error Correction	Xinyuan Li and Yunshi Lan	Xinyuan Li	Xinyuan Li	Large Language Models (LLMs) have achieved outstanding performance across various NLP tasks. Grammatical Error Correction (GEC) is a task aiming at automatically correcting grammatical errors in text, but it encounters a severe shortage of annotated data. Researchers have tried to make full use of the generalization capabilities of LLMs and prompt them to correct erroneous sentences, which however results in unexpected over-correction issues. In this paper, we rethink the role of LLMs in GEC tasks and propose a method, namely TypeDA, considering LLMs as the annotators for type-aware data augmentation in GEC tasks. Different from the existing data augmentation methods, our method prevents in-distribution corruption and is able to generate sentences with multi-granularity error types. Our experiments verify that our method can generally improve the GEC performance of different backbone models with only a small amount of augmented data. Further analyses verify the high consistency and diversity of the pseudo data generated via our method.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
16	55	Oral	In-person	Ethical Reviewing:Speech Recognition and Synthesis, and Spoken Language Understanding		Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication	Arif A. Ahmad, Khyathi Gayathri Mothika and Pushpak Bhattacharyya	Arif Ahmad	Arif Ahmad	Reduplication and repetition, though similar in form, serve distinct linguistic purposes. Reduplication is a deliberate morphological process used to express grammatical, semantic, or pragmatic nuances, while repetition is often unintentional and indicative of disfluency. This paper presents the first large-scale study of reduplication and repetition in speech using computational linguistics. We introduce IndicRedRep, a new publicly available dataset containing Hindi, Telugu, and Marathi text annotated with reduplication and repetition at the word level. We evaluate transformer-based models for multi-class reduplication and repetition token classification, utilizing the Reparandum-Interregnum-Repair structure to distinguish between the two phenomena. Our models achieve macro F1 scores of up to 85.62% in Hindi, 83.95% in Telugu, and 84.82% in Marathi for reduplication-repetition classification.	Hall B-A	Session 10: Oral/Poster H	Multimodal NLP 2		Thu, Jan 23	11:00-12:30	0:00	Artem Shelmanov	11:15	11:30
17	56	Oral	In-person	Ethical Reviewing:Natural Language Generation, Summarization and Simplification		Learning to Verify Summary Facts with Fine-Grained LLM Feedback	Jihwan Oh, Jeonghwan Choi, Nicole Hee-Yoen Kim, Taewon Yun and Hwanjun Song	Hwanjun Song	Jihwan Oh email: jh.oh@kaist.ac.kr	Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at https://github.com/DISL-Lab/FineSumFact.	Hall B-B	Session 12: Oral/Poster I	Natural Language Generation and Summarization 2		Fri, Jan 24	10:30-12:00	0:00	Barbara Di Eugenio	10:30	10:45
18	59	Poster	Virtual	Language Modeling:Ethical Reviewing	Language Modeling	FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models	Tao Fan, Guoqiang Ma, Yan Kang, Hanlin Gu, Yuanfeng Song, Lixin Fan, Kai Chen and Qiang Yang	Tao Fan	Tao Fan	Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server's LLM and clients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server's LLM to clients' SLMs while concurrently enhancing the LLM with clients' unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, we evaluate the effectiveness of FedMKT by utilizing diverse public LLMs and SLMs on a variety of NLP text generation tasks. Empirical results demonstrate that FedMKT simultaneously boosts the performance of both LLMs and SLMs. Our code has been contributed to the FATE open-source project and is now publicly accessible at \textit{\url{https://github.com/FederatedAI/FATE-LLM/tree/main/python/fate_llm/algo/fedmkt}}	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
19	65	Poster	Virtual	Sentiment Analysis, Opinion and Argument Mining:Ethical Reviewing	Sentiment Analysis, Opinion and Argument Mining	Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation	Yuntao Shou, tao meng, wei ai and KEQIN LI	Yuntao Shou	Yuntao Shou	Multimodal emotion recognition in conversation (MERC) refers to identifying and classifying human emotional states by combining data from multiple different modalities (e.g., audio, images, text, video, etc.). Specifically, human emotional expressions are often complex and diverse, and these complex emotional expressions can be captured and understood more comprehensively through the fusion of multimodal information. Most existing graph-based multimodal emotion recognition methods can only use shallow GCNs to extract emotion features and fail to capture the temporal dependencies caused by dynamic changes in emotions. To address the above problems, we propose a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for multimodal emotion recognition in conversation, which combines the dynamic changes of emotions to capture the temporal dependency of speakers' emotions. Technically, the key idea of DGODE is to use the graph ODE evolution network to characterize the continuous dynamics of node representations over time and capture temporal dependencies. Extensive experiments on two publicly available multimodal emotion recognition datasets demonstrate that the proposed DGODE model has superior performance compared to various baselines. Furthermore, the proposed DGODE can also alleviate the over-smoothing problem, thereby enabling the construction of a deep GCN network.	Gather	Gather TBD	Gather		Gather TBD	TBD			TBD	TBD
20	68	Poster	Virtual	Ethical Reviewing:Multimodal and Grounded Language Acquisition, HRI	Multimodal and Grounded Language Acquisition, HRI	HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding	Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan and Zongyuan Ge	Lie Ju	Lie Ju	Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (HGCLIP) that effectively combines CLIP with a deeper exploitation of the Hierarchical class structure via Graph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https: //github.com/richard-peng-xia/HGCLIP.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30
21	69	Poster	Virtual	Ethical Reviewing:Information Retrieval and Text Mining	Information Retrieval and Text Mining	Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement	Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai and Heng Ji	Chenkai Sun	Chenkai Sun	The increasing demand for personalized interactions with large language models (LLMs) calls for methodologies capable of accurately and efficiently identifying user opinions and preferences. Retrieval augmentation emerges as an effective strategy, as it can accommodate a vast number of users without the costs from fine-tuning. Existing research, however, has largely focused on enhancing the retrieval stage and devoted limited exploration toward optimizing the representation of the database, a crucial aspect for tasks such as personalization. In this work, we examine the problem from a novel angle, focusing on how data can be better represented for more data-efficient retrieval in the context of LLM customization. To tackle this challenge, we introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts and collaborative refinement to effectively bridge knowledge gaps among users. In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size, a critical advantage in scenarios with extensive histories or limited context windows. Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data. Furthermore, our analysis reveals the increasing importance of collaborative knowledge as the retrieval capacity expands.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
22	72	Oral	In-person	Language Resources and Evaluation:Ethical Reviewing		Style Over Substance: Evaluation Biases for Large Language Models	Minghao Wu and Alham Fikri Aji	Alham Fikri Aji	Alham Fikri Aji	As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human or LLM judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed, machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced evaluations, indicating the need for further investigation.	Suite 7	Session 3: Oral/Poster B	Language Resources and Evaluation 1		Tue, Jan 21	14:00-15:30	0:00	Firoj Alam	14:30	14:45
23	86	Poster	Virtual	Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining	Sentiment Analysis, Opinion and Argument Mining	Multimodal Aspect-Based Sentiment Analysis under Conditional Relation	Xinjing Liu, Ruifan Li, Shuqin Ye, guangwei zhang and Xiaojie WANG	Xinjing Liu	Xinjing Liu Ruifan Li	Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms from text-image pairs and identify their sentiments. Previous methods are based on the premise that the image contains the objects referred by the aspects within the text. However, this condition cannot always be met, resulting in a suboptimal performance. In this paper, we propose COnditional Relation based \textit{S}entiment Analysis framework (CORSA). Specifically, we design a conditional relation detector (CRD) to mitigate the impact of the unmet conditional image. Moreover, we design a visual object localizer (VOL) to locate the exact condition-related visual regions associated with the aspects. With CRD and VOL, our CORSA framework takes a multi-task form. In addition, to effectively learn CORSA we conduct two types of annotations. One is the conditional relation using a pretrained referring expression comprehension model; the other is the bounding boxes of visual objects by a pretrained object detection model. Experiments on our built C-MABSA dataset show that CORSA consistently outperforms existing methods. The code and data are available at https://github.com/Liuxj-Anya/CORSA.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30
24	89	Poster	In-person	Ethical Reviewing:Lexical Semantics		Semantic Role Labeling of NomBank Partitives	Adam Meyers, Advait Pravin Savant and John E. Ortega	John Ortega	John Ortega	This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using "gold" parses from the Penn Treebank and 91.12% when using the Berkeley Neural parser. This research includes both classroom and experimental settings for system development.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
25	99	Oral	In-person	Ethical Reviewing:NLP and LLM Applications		MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation	Dongjun Lee, Choongwon Park, Jaehyuk Kim and Heesoo Park	Dongjun Lee	Dongjun Lee	Recent advancements in large language models (LLMs) have enabled in-context learning (ICL)-based methods that significantly outperform fine-tuning approaches for text-to-SQL tasks. However, their performance is still considerably lower than that of human experts on benchmarks that include complex schemas and queries, such as BIRD. This study considers the sensitivity of LLMs to the prompts and introduces a novel approach that leverages multiple prompts to explore a broader search space for possible answers and effectively aggregate them. Specifically, we robustly refine the database schema through schema linking using multiple prompts. Thereafter, we generate various candidate SQL queries based on the refined schema and diverse prompts. Finally, the candidate queries are filtered based on their confidence scores, and the optimal query is obtained through a multiple-choice selection that is presented to the LLM. When evaluated on the BIRD and Spider benchmarks, the proposed method achieved execution accuracies of 65.5\% and 89.6\%, respectively, significantly outperforming previous ICL-based methods.	Suite 7	Session 2: Oral/Poster A	NLP Applications 1		Tue, Jan 21	11:00-12:30	0:00	Lingzi Hong	11:00	11:15
26	101	Poster	Virtual	Ethical Reviewing:NLP and LLM Applications	NLP and LLM Applications	InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery	He Cao, Zijing Liu, Xingyu Lu, Yuan Yao and Yu Li	He CAO	He CAO	The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialists, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
27	102	Poster	Virtual	Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining	Sentiment Analysis, Opinion and Argument Mining	Ambiguity-aware Multi-level Incongruity Fusion Network for Multi-Modal Sarcasm Detection	Kuntao Li, Yifan Chen, Qiaofeng Wu, Weixing Mai, Fenghuan Li and Yun Xue	Kuntao Li	Kuntao Li	Multi-modal sarcasm detection aims to identify whether a given image-text pair is sarcastic. The pivotal factor of the task lies in accurately capturing incongruities from different modalities. Although existing studies have achieved impressive success, they primarily committed to fusing the textual and visual information to establish cross-modal correlations, overlooking the significance of original unimodal incongruity information at the text-level and image-level. Furthermore, the utilized fusion strategies of cross-modal information neglected the effect of inherent ambiguity within text and image modalities on multimodal fusion. To overcome these limitations, we propose a novel Ambiguity-aware Multi-level Incongruity Fusion Network (AMIF) for multi-modal sarcasm detection. Our method involves a multi-level incongruity learning module to capture the incongruity information simultaneously at the text-level, image-level and cross-modal-level. Additionally, an ambiguity-based fusion module is developed to dynamically learn reasonable weights and interpretably aggregate incongruity features from different levels. Comprehensive experiments conducted on a publicly available dataset demonstrate the superiority of our proposed model over state-of-the-art methods.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
28	103	Poster	In-person	Language Resources and Evaluation:Ethical Reviewing		AdminSet and AdminBERT: a Dataset and a Pre-trained Language Model to Explore the Unstructured Maze of French Administrative Documents	Thomas Sebbag, Solen Quiniou, Nicolas Stucky and Emmanuel Morin	Thomas Sebbag	Thomas Sebbag	In recent years, Pre-trained Language Models(PLMs) have been widely used to analyze various documents, playing a crucial role in Natural Language Processing (NLP). However, administrative texts have rarely been used in information extraction tasks, even though this resource is available as open data in many countries. Most of these texts contain many specific domain terms. Moreover, especially in France, they are unstructured because many administrations produce them without a standardized framework. Due to this fact, current language models do not process these documents correctly. In this paper, we propose AdminBERT, the first French pre-trained language models for the administrative domain. Since interesting information in such texts corresponds to named entities and the relations between them, we compare this PLM with general domain language models, fine-tuned on the Named Entity Recognition (NER) task applied to administrative texts, as well as to a Large Language Model (LLM) and to a language model with an architecture different from the BERT one. We show that taking advantage of a PLM for French administrative data increases the performance in the administrative and general domains, on these texts. We also release AdminBERT as well as AdminSet, the pre-training corpus of administrative texts in French and the subset AdminSet-NER, the first NER dataset consisting exclusively of administrative texts in French.	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
29	105	Poster	In-person	Language Resources and Evaluation:Ethical Reviewing		ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models	Thibaut Thonet, Laurent Besacier and Jos Rozen	laurent besacier	laurent besacier	Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending the models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, we propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario in which the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers, as well as noisy versions of meeting transcripts altered to target different Word Error Rate levels. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models, and point out their discrepancies in terms of robustness to transcript noise. We also provide a thorough analysis of our GPT-4-based evaluation, including insights from a crowdsourcing study. Our findings indicate that while GPT-4's scores align with human judges, its ability to distinguish beyond three score levels may be limited.	Atrium	Session 6: Oral/Poster E	Poster		Wed, Jan 22	11:00-12:30			11:00	12:30
30	106	Oral	In-person	Ethical Reviewing:Natural Language Generation, Summarization and Simplification		Positive Text Reframing under Multi-strategy Optimization	Shutong Jia, Biwei Cao, Qingqing Gao, Jiuxin Cao and Bo Liu	Biwei Cao	Biwei Cao	Differing from sentiment transfer, positive reframing seeks to substitute negative perspectives with positive expressions while preserving the original meaning. With the emergence of pre-trained language models (PLMs), it is possible to achieve acceptable results by fine-tuning PLMs. Nevertheless, generating fluent, diverse and task-constrained reframing text remains a significant challenge. To tackle this issue, a multi-strategy optimization framework (MSOF) is proposed in this paper. Starting from the objective of positive reframing, we first design positive sentiment reward and content preservation reward to encourage the model to transform the negative expressions of the original text while ensuring the integrity and consistency of the semantics. Then, different decoding optimization approaches are introduced to improve the quality of text generation. Finally, based on the modeling formula of positive reframing, we propose a multi-dimensional re-ranking method that further selects candidate sentences from three dimensions: strategy consistency, text similarity and fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate our framework achieves significant improvements on unconstrained and controlled positive reframing tasks.	Hall B-C	Session 3: Oral/Poster B	Interpretability and Explainability		Tue, Jan 21	14:00-15:30	0:00	Jordan Kodner jordan.kodner@stonybrook.edu	14:45	15:00
31	114	Poster	In-person	NLP and LLM Applications:Ethical Reviewing		RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration	Haoyu Huang, Tong Niu, Rui Yang and Luping Shi	Haoyu Huang	Haoyu Huang	Recently, many studies focus on utilizing large language models (LLMs) into educational dialogues. Especially, within liberal arts dialogues, educators must balance Humanized communication, Teaching expertise, and Safety-ethics (HTS), besides the subject knowledge itself. However, due to collecting massive amounts of HTS-compliant teaching dialogues from real world as training corpus is expensive, the outputs of existing LLMs in teaching dialogues fall short of human standards. To address this, we design a Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C) framework to automatically generate such dialogues data. Specifically, we first establish HTS-guided knowledge bases, encompassing three domain knowledge in teaching skills, psychology, and safety ethics. Then, RAM2C organizes LLMs, which are retrieval-augmented by the above different knowledge bases, into multi-experts groups with distinct roles to generate the HTS-compliant educational dialogues dataset. We then fine-tuned the LLMs using this dataset. Empirical evaluations indicate that RAM2C-empowered LLMs excel in Chinese reading teaching, offering more personalized, and ethically safe teaching response, demonstrating RAM2C's practicality and high quality. We release the experiments at https://github.com/ram2c/ram2c.	Atrium	Session 2: Oral/Poster A	Poster		Tue, Jan 21	11:00-12:30			11:00	12:30
32	119	Poster	In-person	Information Extraction:Ethical Reviewing		SURE: Mutually Visible Objects and Self-generated Candidate Labels For Relation Extraction	Yuxuan Feng, Qian Chen, Qianyou Wu, Xin GUO and Suge Wang	Yuxuan Feng	Yuxuan Feng	Joint relation extraction models effectively mitigate the error propagation problem inherently present in pipeline models. Nevertheless, joint models face challenges including high computational complexity, complex network architectures, difficult parameter tuning, and notably, limited interpretability. In contrast, recent advances in pipeline relation extraction models (PURE, PL-Marker) have attracted considerable attention due to their lightweight design and high extraction accuracy. A key advancement is the introduction of a marker mechanism, which enhances relation extraction (RE) process by highlighting entities. However, these models primarily focus on generating correct labels. In doing so, they neglect the label selection process. Moreover, they fail to adequately capture the intricate interactions between entity pairs. To overcome these limitations, we develop a Candidate Label Markers (CLMs) mechanism that prioritizes strategic label selection over simple label generation. Furthermore, we facilitate interactions among diverse relation pairs, enabling the identification of more intricate relational patterns. Experimental results show that we achieve a new SOTA performance. Specifically, based on the same Named Entity Recognition (NER) results as theirs, we improve the SOTA methods by 2.5%, 1.9%, 1.2% in terms of strict F1 scores on SciERC, ACE05 and ACE04.	Atrium	Session 2: Oral/Poster A	Poster		Tue, Jan 21	11:00-12:30			11:00	12:30
33	121	Poster	In-person	Ethical Reviewing:Multilinguality and Machine Translation		TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data	Yihong Liu, Chunlan Ma, Haotian Ye and Hinrich Schütze	Yihong Liu	Yihong Liu	Transliterating related languages that use different scripts into a common script is effective for improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is undesirable because it requires a large computation budget. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI). TransMI can create strong baselines for data that is transliterated into a common script by exploiting an existing mPLM and its tokenizer without any training. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We apply TransMI to three strong recent mPLMs. Our experiments demonstrate that TransMI not only preserves the mPLM's ability to handle non-transliterated data, but also enables it to effectively process transliterated data, thereby facilitating crosslingual transfer across scripts. The results show consistent improvements of 3% to 34% for different mPLMs and tasks. We make our code and models publicly available at \url{https://github.com/cisnlp/TransMI}.	Atrium	Session 5: Oral/Poster D	Poster		Wed, Jan 22	9:00-10:30			9:00	10:30
34	123	Oral	In-person	Dialogue and Conversational Interaction:Ethical Reviewing		Two-stage Incomplete Utterance Rewriting on Editing Operation	Zhiyu Cao, Peifeng Li, Qiaoming Zhu and Yaxin Fan	Zhiyu Cao	Zhiyu Cao	Previous work on Incomplete Utterance Rewriting (IUR) has primarily focused on generating rewritten utterances based solely on dialogue context, ignoring the widespread phenomenon of coreference and ellipsis in dialogues. To address this issue, we propose a novel framework called TEO (Two-stage approach on Editing Operation) for IUR, in which the first stage generates editing operations and the second stage rewrites incomplete utterances utilizing the generated editing operations and the dialogue context. Furthermore, an adversarial perturbation strategy is proposed to mitigate cascading errors and exposure bias caused by the inconsistency between training and inference in the second stage. Experimental results on three IUR datasets show that our TEO outperforms the SOTA models significantly.	Hall B-C	Session 12: Oral/Poster I	Dialogue and Conversational Interaction 2		Fri, Jan 24	10:30-12:00	0:00	Frederic Bechet	10:30	10:45
35	135	Poster	Virtual	Ethical Reviewing:Information Retrieval and Text Mining	Information Retrieval and Text Mining	QuickLLaMA: Query-aware Inference Acceleration for Large Language Models	Jingyao Li, Han Shi, Sitong Wu, Chuanyang Zheng, Zhenguo Li, Xin Jiang, Hong Xu and Jiaya Jia	Jingyao Li	Jingyao Li	The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code is in https://github.com/dvlab-research/Q-LLM.	Gather	Gather TBD	Gather		Gather TBD	TBD			TBD	TBD
36	139	Poster	Virtual	Ethical Reviewing:Information Retrieval and Text Mining	Information Retrieval and Text Mining	SVD-GCL: A Noise-Augmented Hybrid Graph Contrastive Learning Framework for Recommendation	Liping Wang, Shichao Li, Hui Wang, Yuyan Gao and Mingyao Wei	Shichao Li	Shichao Li	Recently, deep graph neural networks (GNNs) have emerged as the predominant architecture for recommender systems based on collaborative filtering. Nevertheless, numerous GNN-based approaches confront challenges such as complex computations and skewed feature distributions, especially with high-dimensional, sparse, and noisy data, making it difficult to accurately capture user preferences. To tackle these issues, we introduce SVD-GCL, a streamlined graph contrastive learning recommendation model based on noise augmentation that integrates truncated singular value decomposition in the feature engineering stage. This hybrid optimization approach reduces the dimensionality and denoises the original data. Through extracting self-supervised signals and gradually adding noise to embeddings in the training phase to enrich data samples, the data sparsity is effectively alleviated. Experimental outcomes on three large public benchmark datasets illustrate that SVD-GCL effectively manages high-dimensional sparse data, remains stable in the presence of noise, and provides significant advantages in computational efficiency, recommendation performance, and robustness.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
37	147	Poster	Virtual	Ethical Reviewing:NLP and LLM Applications	NLP and LLM Applications	MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL	Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, LinZheng Chai, Zhao Yan, Qian-Wen Zhang, di yin, Xing Sun and Zhoujun Li	Bing Wang	Bing Wang	Recent LLM-based Text-to-SQL methods usually suffer from significant performance degradation on "huge" databases and complex user questions that require multi-step reasoning. Moreover, most existing methods neglect the crucial significance of LLMs utilizing external tools and model collaboration. To address these challenges, we introduce MAC-SQL, a novel LLM-based multi-agent collaborative framework. Our framework comprises a core decomposer agent for Text-to-SQL generation with few-shot chain-of-thought reasoning, accompanied by two auxiliary agents that utilize external tools or models to acquire smaller sub-databases and refine erroneous SQL queries. The decomposer agent collaborates with auxiliary agents, which are activated as needed and can be expanded to accommodate new features or tools for effective Text-to-SQL parsing. In our framework, We initially leverage GPT-4 as the strong backbone LLM for all agent tasks to determine the upper bound of our framework. We then fine-tune an open-sourced instruction-followed model, SQL-Llama, by leveraging Code Llama 7B, to accomplish all tasks as GPT-4 does. Experiments show that SQL-Llama achieves a comparable execution accuracy of 43.94, compared to the baseline accuracy of 46.35 for vanilla GPT-4. At the time of writing, MAC-SQL+GPT-4 achieves an execution accuracy of 59.59 when evaluated on the BIRD benchmark, establishing a new state-of-the-art (SOTA) on its holdout test set.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
38	149	Poster	Virtual	Ethical Reviewing:Interpretability and Explainability	Interpretability and Explainability	Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?	Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du and Yongfeng Zhang	Mingyu Jin	Mingyu Jin	Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of ``Concept Depth'' to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
39	155	Poster	In-person	Ethical Reviewing:Machine Learning for CL/NLP		Knowledge Graph Entity Typing with Curriculum Contrastive Learning	hao wang, Minghua Nuo and shan jiang	hao wang	hao wang	The Knowledge Graph Entity Typing (KGET) task aims to predict missing type annotations for entities in knowledge graphs. Most recent studies only focus on the structural information from an entity's neighborhood or semantic information from textual representations of entities or relations. In this paper, inspired by curriculum learning and contrastive learning, we propose the CCLET model using the Curriculum Contrastive Learning strategy for KGET, which uses the Pre-trained Language Model (PLM) and the graph model to fuse the entity related semantic and the structural information of the Knowledge Graph (KG) respectively. Our CCLET model consists of two main parts. In the Knowledge Fusion part, we design an Enhanced-MLP architecture to fuse the text of the entity's description, related triplet, and tuples; In the Curriculum Contrastive Learning part, we define the difficulty of the course by controlling the level of added noise, we aim to accurately learn with curriculum contrastive learning strategy from easy to difficult. Our extensive experiments demonstrate that the CCLET model outperforms recent state-of-the-art models, verifying its effectiveness in the KGET task.	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
40	160	Oral	In-person	Language Modeling:Ethical Reviewing		The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models	Zihui Wu, Haichang Gao, Jianping He and Ping Wang	Haichang Gao	Haichang Gao	Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90\% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures	Hall B-B	Session 7: Oral/Poster F	Language Modeling 1		Wed, Jan 22	14:00-15:30	0:00	Djamé Seddah djame.seddah@inria.fr	14:30	14:45
41	162	Oral	In-person	Ethical Reviewing:Language Modeling		Adapters Selector: Cross-domains and Multi-tasks LoRA Modules Integration Usage Method	Yimin Tian, Bolin Zhang, Zhiying Tu and Dianhui Chu	Bolin Zhang	Bolin Zhang	Parameter-Efficient Fine-Tuning (PEFT) adapts large language models (LLMs) to specific domains by updating only a small portion of the parameters. Although fine-tuning on a single task within a specific domain has demonstrated promising results, there remains limited exploration on how to effectively integrate these adapters for optimal performance. In this paper, we propose Adapters Selector (AS): a novel framework for better integrating usage of multiple adapters by training a middleman adapter to select the appropriate adapter for inference. Our approach utilizes PEFT to train a selector that determines which input content corresponds to which task in which domain, and subsequently selects the homologous adapter. By the way, The AS has developed the capability to execute cross-domain multi-tasks effectively through the utilization of a compact model in combination with multiple LoRA modules. Our code is publicly available.	Hall B-B	Session 7: Oral/Poster F	Language Modeling 1		Wed, Jan 22	14:00-15:30	0:00	Djamé Seddah djame.seddah@inria.fr	14:00	14:15
42	163	Poster	In-person	Ethical Reviewing:Information Extraction		XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser	Xianfu Cheng, hang zhang, Jian Yang, Xiang Li, Weixiao Zhou, Fei Liu, Kui Wu, Xiangyuan Guan, Tao Sun, Xianjie Wu, Tongliang Li and Zhoujun Li	Xiang Li	Xiang Li	In the domain of Document AI, parsing semi-structured image form is a crucial Key Information Extraction (KIE) task. The advent of pre-trained multimodal models significantly empowers Document AI frameworks to extract key information from form documents in different formats such as PDF, Word, and images. Nonetheless, form parsing is still encumbered by notable challenges like subpar capabilities in multilingual parsing and diminished recall in industrial contexts in rich text and rich visuals. In this work, we introduce a simple but effective Multimodal and Multilingual semi-structured FORM PARSER (XFormParser), which is anchored on a comprehensive Transformer-based pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework. Combined with Bi-LSTM, the performance of multilingual parsing is significantly improved. Furthermore, we develop InDFormSFT, a pioneering supervised fine-tuning (SFT) industrial dataset that specifically addresses the parsing needs of forms in a variety of industrial contexts. Through rigorous testing on established benchmarks, XFormParser has demonstrated its unparalleled effectiveness and robustness. Compared to existing state-of-the-art (SOTA) models, XFormParser notably achieves up to 1.79% F1 score improvement on RE tasks in language-specific settings. It also exhibits exceptional improvements in cross-task performance in both multilingual and zero-shot settings.	Atrium	Session 2: Oral/Poster A	Poster		Tue, Jan 21	11:00-12:30			11:00	12:30
43	165	Poster	Virtual	Ethical Reviewing:Ethics, Bias, and Fairness	Ethics, Bias, and Fairness	Debiasing by obfuscating with 007-classifiers promotes fairness in multi-community settings	Ingroj Shrestha and Padmini Srinivasan	Ingroj Shrestha	Ingroj Shrestha	While there has been considerable amount of research on bias mitigation algorithms, two properties: multi-community perspective and fairness to all communities have not been given sufficient attention. Focusing on these, we propose an obfuscation based data augmentation debiasing approach. In it we add to the training data obfuscated versions of all false positive instances irrespective of source community. We test our approach by debiasing toxicity classifiers built using 5 neural models (multi layer perceptron model and masked language models) and 3 datasets in a 4 communities setting. We also explore 4 different obfuscators for debiasing. Results demonstrate the merits of our approach: bias is reduced for almost all of our runs without sacrificing false positive rates or F1 scores for minority or majority communities. In contrast, the 4 state of the art baselines typically make performance sacrifices (often large) while reducing bias. Crucially, we demonstrate that it is possible to debias while maintaining standards for both minority and majority communities.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
44	176	Poster	Virtual	Machine Learning for CL/NLP:Ethical Reviewing	Machine Learning for CL/NLP	Graph Representation Learning in Hyperbolic Space via Dual-Masked	rui gong, zuyun jiang and daren zha	Rui Gong	Rui Gong	Graph representation learning (GRL) in hyperbolic space has gradually emerged as a promising approach. Meanwhile, masking and reconstruction-based (MR-based) methods lead to state-of-the-art self-supervised graph representation. However, existing MR-based methods do not fully consider deep node and structural information. Inspired by the recent active and emerging field of self-supervised learning, we propose a novel node and edge dual-masked self-supervised graph representation learning framework in hyperbolic space, named HDM-GAE. We have designed a graph dual-masked module and a hyperbolic structural self-attention encoder module to mask nodes or edges and perform node aggregation within hyperbolic space, respectively. Comprehensive experiments and ablation studies on real-world multi-category datasets, demonstrate the superiority of our method in downstream tasks such as node classification and link prediction.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
45	177	Poster	Virtual	Information Retrieval and Text Mining:Ethical Reviewing	Information Retrieval and Text Mining	Perturbation-driven Dual Auxiliary Contrastive Learning for Collaborative Filtering Recommendation	Caihong Mu, Keyang Zhang, Jialiang Zhou and Yi Liu	Jialiang Zhou	Jialiang Zhou	Graph collaborative filtering has made great progress in the recommender systems, but these methods often struggle with the data sparsity issue in real-world recommendation scenarios. To mitigate the effect of data sparsity, graph collaborative filtering incorporates contrastive learning as an auxiliary task to improve model performance. However, existing contrastive learning-based methods generally use a single data augmentation graph to construct the auxiliary contrastive learning task, which has problems such as loss of key information and low robustness. To address these problems, this paper proposes a Perturbation-driven Dual Auxiliary Contrastive Learning for Collaborative Filtering Recommendation (PDACL). PDACL designs structure perturbation and weight perturbation to construct two data augmentation graphs. The Structure Perturbation Augmentation (SPA) graph perturbs the topology of the user-item interaction graph, while the Weight Perturbation Augmentation (WPA) graph reconstructs the implicit feedback unweighted graph into a weighted graph similar to the explicit feedback. These two data augmentation graphs are combined with the user-item interaction graph to construct the dual auxiliary contrastive learning task to extract the self-supervised signals without losing key information and jointly optimize it together with the supervised recommendation task, to alleviate the data sparsity problem and improve the performance. Experimental results on multiple public datasets show that PDACL outperforms numerous benchmark models, demonstrating that the dual-perturbation data augmentation graph in PDACL can overcome the shortcomings of a single data augmentation graph, leading to superior recommendation results. The implementation of our work will be found at https://github.com/zky77/PDACL.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
46	194	Oral	In-person	Ethical Reviewing:Information Retrieval and Text Mining		Enhancing Reranking for Recommendation with LLMs through User Preference Retrieval	Haobo Zhang, Qiannan Zhu and Zhicheng Dou	Haobo Zhang	Haobo Zhang	Recently, large language models (LLMs) have shown the potential to enhance recommendations due to their sufficient knowledge and remarkable summarization ability. However, the existing LLM-powered recommendation may create redundant output, which generates irrelevant information about the user's preferences on candidate items from user behavior sequences. To address the issues, we propose a framework UR4Rec that enhances reranking for recommendation with large language models through user preference retrieval. Specifically, UR4Rec develops a small transformer-based user preference retriever towards candidate items to build the bridge between LLMs and recommendation, which focuses on producing the essential knowledge through LLMs from user behavior sequences to enhance reranking for recommendation. Our experimental results on three real-world public datasets demonstrate the superiority of UR4Rec over existing baseline models.	Hall B-B	Session 4: Oral/Poster C	Information extraction and retrieval 1		Tue, Jan 21	16:00-17:30	0:00	Ge Shi	16:45	17:00
47	196	Poster	Virtual	Ethical Reviewing:NLP and LLM Applications	NLP and LLM Applications	SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task	Zijie Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin and Xiaofan Zhang	Zijie Zhong	Zijie Zhong	Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs' efficacy and mitigating their "hallucinations". Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), it is critical to connect LLMs with KG databases by automating the translation of natural language into Cypher queries (termed as "Text2Cypher" task). Prior efforts tried to bolster LLMs' proficiency in Cypher generation through Supervised Fine-Tuning (SFT). However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of such annotation. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based prompting and (2) template-filling. SyntheT2C is applied to two medical KG databases, culminating in the creation of a synthetic dataset, MedT2C. Comprehensive experiments demonstrate that the MedT2C dataset effectively enhances the performance of backbone LLMs on Text2Cypher task via SFT. Both the SyntheT2C codebase and the MedT2C dataset will be released.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
48	199	Poster	In-person	Interpretability and Explainability:Ethical Reviewing		Language Models Encode the Value of Numbers Linearly	Fangwei Zhu, Damai Dai and Zhifang Sui	Fangwei Zhu	Fangwei Zhu	Large language models (LLMs) have exhibited impressive competence in various tasks, but their internal mechanisms on mathematical problems are still under-explored. In this paper, we study a fundamental question: how language models encode the value of numbers, a basic element in math. To study the question, we construct a synthetic dataset comprising addition problems and utilize linear probes to read out input numbers from the hidden states. Experimental results support the existence of encoded number values in LLMs on different layers, and these values can be extracted via linear probes. Further experiments show that LLMs store their calculation results in a similar manner, and we can intervene the output via simple vector additions, proving the causal connection between encoded numbers and language model outputs. Our research provides evidence that LLMs encode the value of numbers linearly, offering insights for better exploring, designing, and utilizing numeric information in LLMs.	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
49	201	Poster	Virtual	Language Resources and Evaluation:Ethical Reviewing	Language Resources and Evaluation	FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models	Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan and yang chong	Shu Liu	Shu Liu	Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce FinDABench, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. The benchmark comprises 15,200 training instances and 8,900 test instances, all meticulously crafted by human experts. FinDABench assesses LLMs across three dimensions: 1) Core Ability, evaluating the models' ability to perform financial indicator calculation and corporate sentiment risk assessment; 2) Analytical Ability, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) Technical Ability, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release FinDABench, and the evaluation scripts at https://github.com/xxx. FinDABench aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
50	213	Poster	Virtual	Ethical Reviewing:Language Resources and Evaluation	Language Resources and Evaluation	Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding	Nguyen Binh Nguyen and Yang He	Binh-Nguyen Nguyen	Binh-Nguyen Nguyen	Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on cross-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain examples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources.	Gather	Gather Session 3	Gather		Tue, Jan 28	13:00-14:30			13:00	14:30	Yes
51	214	Poster	In-person	Language Resources and Evaluation:Ethical Reviewing		SLARD: A Chinese Superior Legal Article Retrieval Dataset	Zhe Chen, Pengjie Ren, Fuhui Sun, Xiaoyan Wang, Yujun Li, Siwen Zhao and Tengyi Yang	Zhe Chen	Zhe Chen	Retrieving superior legal articles involves identifying relevant legal articles that hold higher legal effectiveness. This process is crucial in legislative work because superior legal articles form the legal basis for drafting new laws. However, most existing legal information retrieval research focuses on retrieving legal documents, with limited research on retrieving superior legal articles. This gap restricts the digitization of legislative work. To advance research in this area, we propose SLARD: A Chinese Superior Legal Article Retrieval Dataset, which filters 2,627 queries and 9,184 candidates from over 4.3 million effective Chinese regulations, covering 32 categories, such as environment, agriculture, and water resources. Each query is manually annotated, and the candidates include superior articles at both the provincial and national levels. We conducted detailed experiments and analyses on the dataset and found that existing retrieval methods struggle to achieve ideal results. The best method achieved a R@1 of only 0.4719. Additionally, we found that existing large language models (LLMs) lack prior knowledge of the content of superior legal articles. This indicates the necessity for further exploration and research in this field.	Atrium	Session 6: Oral/Poster E	Poster		Wed, Jan 22	11:00-12:30			11:00	12:30
52	223	Oral	In-person	Dialogue and Conversational Interaction:Ethical Reviewing		Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations	Nuo Chen, Hongguang Li, Jianhui Chang, juhua huang, Baoyuan Wang and Jia Li	hongguang li	hongguang li	Existing retrieval-based methods have made significant strides in maintaining long-term conversations. However, these approaches face challenges in memory database management and accurate memory retrieval, hindering their efficacy in dynamic, real-world interactions. This study introduces a novel framework, COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews traditional retrieval modules and memory databases. Instead, COMEDY~adopts a ``One-for-All'' approach, utilizing a single language model to manage memory generation, compression, and response generation. Central to this framework is the concept of compressive memory, which integrates session-specific summaries, user-bot dynamics, and past events into a concise memory format. To support COMEDY, we collect the biggest Chinese long-term conversation dataset, Dolphin, derived from real user-chatbot interactions. Comparative evaluations demonstrate COMEDY's superiority over traditional retrieval-based methods in producing more nuanced and human-like conversational experiences.	Hall B-C	Session 4: Oral/Poster C	Dialogue and Conversational Interaction 1		Tue, Jan 21	16:00-17:30	0:00	Li Zhang	17:15	17:30
53	227	Poster	In-person	Ethical Reviewing:Language Resources and Evaluation		Refined Evaluation for End-to-End Grammatical Error Correction Using an Alignment-Based Approach	Junrui Wang, Mengyang Qiu, Yang Gu, Zihao Huang and Jungyeul Park	Mengyang Qiu	Mengyang Qiu	We propose a refined alignment-based method to assess end-to-end grammatical error correction (GEC) systems, aiming to reproduce and improve results from existing evaluation tools, such as \texttt{errant}, even when applied to raw text input—reflecting real-world language learners' writing scenarios. Our approach addresses challenges arising from sentence boundary detection deviations in text preprocessing, a factor overlooked by current GEC evaluation metrics. We demonstrate its effectiveness by replicating results through a re-implementation of \texttt{errant}, utilizing \texttt{stanza} for error annotation and simulating end-to-end evaluation from raw text. Additionally, we propose a potential multilingual \texttt{errant}, presenting Chinese and Korean GEC results. Previously, Chinese and Korean \texttt{errant} were implemented independently for each language, with different annotation formats. Our approach generates consistent error annotations across languages, establishing a basis for standardized grammatical error annotation and evaluation in multilingual GEC contexts.	Atrium	Session 6: Oral/Poster E	Poster		Wed, Jan 22	11:00-12:30			11:00	12:30
54	228	Poster	Virtual	Ethical Reviewing:Dialogue and Conversational Interaction	Dialogue and Conversational Interaction	LLMs on interactive feature collections with implicit dynamic decision strategy	Juyeon Heo, Vihari Piratla, Kyunghyun Lee, Hyonkeun Joh and Adrian Weller	Juyeon Heo	Juyeon Heo	In real-world contexts such as medical diagnosis and business consulting, effective problem-solving often requires gathering relevant information through interactions and targeted questioning to pinpoint the root cause of a problem. However, Large Language Models (LLMs) often struggle to efficiently narrow down the search space, leading to either missing key information or asking redundant questions when guided by implicit methods like Chain-of-Thought (CoT). Some approaches employ external engineered systems to guide reasoning paths, but these methods may not fully utilize the inherent problem-solving capabilities of LLMs and often require multiple expensive API calls. This study explores how we can implicitly guide LLMs to enhance their interactive feature collection abilities within a single prompt. Instead of employing explicit search algorithms or step-by-step external guidance, we provide high-level guidelines that allow LLMs to dynamically adjust their strategies and iteratively refine their decision-making processes independently. Evaluations on synthetic 20-Questions games and real-world scenarios, including business and medical diagnosis cases, demonstrate that LLMs guided by these strategies perform more effective interactive feature collection, asking fewer and more strategic questions and achieving better problem-solving efficiency.	Gather	Gather TBD	Gather		Gather TBD	TBD			TBD	TBD
55	232	Poster	Virtual	Information Retrieval and Text Mining:Ethical Reviewing	Information Retrieval and Text Mining	Pre-trained Semantic Interaction based Inductive Graph Neural Networks for Text Classification	Shiyu Wang, Gang Zhou, Jicang Lu, Jing Chen and Ningbo Huang	Shiyu Wang	Shiyu Wang	Nowadays, research of Text Classification (TC) based on graph neural networks (GNNs) is on the rise. Both inductive methods and transductive methods have made significant progress. For transductive methods, the semantic interaction between texts plays a crucial role in the learning of effective text representations. However, it is difficult to perform inductive learning while modeling interactions between texts on the graph. To give a universal solution, we propose the graph neural network based on pre-trained semantic interaction called PaSIG. Firstly, we construct a text-word heterogeneity graph and design an asymmetric structure to ensure one-way message passing from words to the test texts. Meanwhile, we use the context representation capability of the pre-trained language model to construct node features that contain classification semantic information. Afterward, we explore the adaptative aggregation methods with a gated fusion mechanism. Extensive experiments on five datasets have shown the effectiveness of PaSIG, with the accuracy exceeding the baseline by 2.7% on average. While achieving state-of-the-art performance, we have also taken measures of subgraph sampling and intermediate state preservation to achieve fast inference.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
56	233	Poster	In-person	Natural Language Generation, Summarization and Simplification:Ethical Reviewing		From Superficial to Deep: Integrating External Knowledge for Follow-up Question Generation Using Knowledge Graph and LLM	Jianyu Liu, Yi Huang, Sheng Bi, Junlan Feng and Guilin Qi	hongguang li Jianyu Liu	hongguang li Jianyu Liu	In a conversational system, dynamically generating follow-up questions based on context can help users explore information and provide a better user experience. Humans are usually able to ask questions that involve some general life knowledge and demonstrate higher order cognitive skills. However, the questions generated by existing methods are often limited to shallow contextual questions that are uninspiring and have a large gap to the human level. In this paper, we propose a three-stage external knowledge-enhanced follow-up question generation method, which generates questions by identifying contextual topics, constructing a knowledge graph (KG) online, and finally combining these with a large language model to generate the final question. The model generates information-rich and exploratory follow-up questions by introducing external common sense knowledge and performing a knowledge fusion operation. Experiments show that compared to baseline models, our method generates questions that are more informative and closer to human questioning levels while maintaining contextual relevance.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
57	236	Poster	Virtual	Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining	Sentiment Analysis, Opinion and Argument Mining	AGCL: Aspect Graph Construction and Learning for Aspect-level Sentiment Classification	Zhongquan Jian, Daihang Wu, Shaopan Wang, Yancheng Wang, Junfeng Yao, Meihong Wang and Qingqiang Wu	Zhongquan Jian	Zhongquan Jian	Prior studies on Aspect-level Sentiment Classification (ALSC) emphasize modeling interrelationships among aspects and contexts but overlook the crucial role of aspects themselves as essential domain knowledge. To this end, we propose AGCL, a novel Aspect Graph Construction and Learning method, aimed at furnishing the model with finely tuned aspect information to bolster its task-understanding ability. AGCL's pivotal innovations reside in Aspect Graph Construction (AGC) and Aspect Graph Learning (AGL), where AGC harnesses intrinsic aspect connections to construct the domain aspect graph, and then AGL iteratively updates the introduced aspect graph to enhance its domain expertise, making it more suitable for the ALSC task. Hence, this domain aspect graph can serve as a bridge connecting unseen aspects with seen aspects, thereby enhancing the model's generalization capability. Experiment results on three widely used datasets demonstrate the significance of aspect information for ALSC and highlight AGL's superiority in aspect learning, surpassing state-of-the-art baselines greatly. Code is available at https://github.com/jian-projects/agcl.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
58	237	Poster	Virtual	Language Modeling:Ethical Reviewing	Language Modeling	TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution	Jiuding Yang, Shengyao Lu, Weidong Guo, Xiangyang Li, Kaitong Yang, Yu Xu and Di Niu	Jiuding Yang	Jiuding Yang	The fine-tuning of Large Language Models (LLMs) specialized in code generation has seen notable advancements through the use of open-domain coding queries. Despite the successes, existing methodologies like \textit{Evol-Instruct} encounter performance limitations, impeding further enhancements in code generation tasks. This paper examines the constraints of existing prompt evolution techniques and introduces a novel approach, Instruction Fusion (IF). IF innovatively combines two distinct prompts through a hybridization process, thereby enhancing the evolution of training prompts for code LLMs. Our experimental results reveal that the proposed novel method effectively addresses the shortcomings of prior methods, significantly improving the performance of Code LLMs across five code generation benchmarks, namely HumanEval, HumanEval+, MBPP, MBPP+ and MultiPL-E, which underscore the effectiveness of Instruction Fusion in advancing the capabilities of LLMs in code generation.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	TBD
59	242	Withdrawn	Withdrawn	Ethical Reviewing:Machine Learning for CL/NLP		withdrawn	withdrawn			withdrawn	TBA		Withdrawn
60	243	Poster	Virtual	NLP and LLM Applications:Ethical Reviewing	NLP and LLM Applications	LLaMA-E: Empowering E-commerce Authoring with Object-Interleaved Instruction Following	Kaize Shi, Xueyao Sun, Dingxian Wang, Yinlin Fu, Guandong Xu and Qing Li	Kaize Shi	Kaize Shi	E-commerce authoring entails creating engaging, diverse, and targeted content to enhance preference elicitation and retrieval experience. While Large Language Models (LLMs) have revolutionized content generation, they often fall short in e-commerce applications due to their limited memorization of domain-specific features. This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms, the essential objects in e-commerce operation. We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A. The instruction formulation ensures the interleaved cover of the presented and required object features, allowing the alignment of base models to parameterize e-commerce knowledge comprehensively. The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications. To our knowledge, this is the first LLM tailored to empower authoring applications with comprehensive scenario understanding by integrating features focused on participated objects.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30
61	249	Oral	In-person	Information Retrieval and Text Mining:Ethical Reviewing		LLMTreeRec: Unleashing the Power of Large Language Models for Cold-Start Recommendations	Wenlin Zhang, Chuhan Wu, Xiangyang Li, Yuhao Wang, Kuicai Dong, Yichao Wang, Xinyi Dai, Xiangyu Zhao, Huifeng Guo and Ruiming Tang	Wenlin Zhang	Wenlin Zhang	The lack of training data gives rise to the system cold-start problem in recommendation systems, making them struggle to provide effective recommendations. To address this problem, Large Language Models(LLMs) can model recommendation tasks as language analysis tasks and provide zero-shot results based on their vast open-world knowledge. However, the large scale of the item corpus poses a challenge to LLMs, leading to substantial token consumption that makes it impractical to deploy in real-world recommendation systems. To tackle this challenge, we introduce a tree-based LLM recommendation framework LLMTreeRec, which structures all items into an item tree to improve the efficiency of LLM's item retrieval. LLMTreeRec achieves state-of-the-art performance under the system cold-start setting in two widely used datasets, which is even competitive with conventional deep recommendation systems that use substantial training data. Furthermore, LLMTreeRec outperforms the baseline model in the A/B test on Huawei industrial system. Consequently, LLMTreeRec demonstrates its effectiveness as an industry-friendly solution that has been successfully deployed online.	Hall B-B	Session 5: Oral/Poster D	Information extraction and retrieval 2		Wed, Jan 22	9:00-10:30	0:00	Barbara Plank	9:45	10:00
62	253	Poster	In-person	Ethical Reviewing:Natural Language Generation, Summarization and Simplification		Collaborative Document Simplification Using Multi-Agent Systems	Dengzhao Fang, Jipeng Qiang, Xiaoye Ouyang, Yi Zhu, Yunhao Yuan and Yun Li	Jipeng Qiang	Jipeng Qiang	Research on text simplification has been ongoing for many years. However, the task of document simplification (DS) remains a significant challenge due to the need to consider complex factors such as technical terminology, metaphors, and overall coherence. In this work, we introduce a novel multi-agent framework for document simplification (\textit{AgentSimp}) based on large language models (LLMs). This framework emulates the collaborative process of a human expert team through the roles played by multiple agents, addressing the intricate demands of document simplification. We explore two communication strategies among agents (pipeline-style and synchronous) and two document reconstruction strategies (Direct and Iterative ). According to both automatic evaluation metrics and human evaluation results, the documents simplified by AgentSimp are deemed to be more thoroughly simplified and more coherent on a variety of articles across different types and styles.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
63	255	Poster	Virtual	Ethical Reviewing:Language Modeling	Language Modeling	Distilling Rule-based Knowledge into Large Language Models	Wenkai Yang, Yankai Lin, Jie Zhou and Ji-Rong Wen	Wenkai Yang	Wenkai Yang	Large language models (LLMs) have shown incredible performance in completing various real-world tasks. The current paradigm of knowledge learning for LLMs is mainly based on learning from examples, in which LLMs learn the internal rule implicitly from a certain number of supervised examples. However, this learning paradigm may not well learn those complicated rules, especially when the training examples are limited. We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules. That is, humans can learn new tasks or grasp new knowledge quickly and generalize well given only a detailed rule and a few optional examples. Therefore, in this paper, we aim to explore the feasibility of this new learning paradigm, which targets on encoding rule-based knowledge into LLMs. We further propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules, and then explicitly encode the knowledge into the parameters of LLMs by learning from the above in-context signals produced inside the model. Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability. Warning: This paper may contain examples with offensive content.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
64	257	Poster	Virtual	Ethical Reviewing:Machine Learning for CL/NLP	Machine Learning for CL/NLP	Exploring Backdoor Vulnerabilities of Chat Models	Wenkai Yang, Yunzhuo Hao and Yankai Lin	Wenkai Yang	Wenkai Yang	Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on single-turn instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90\% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor cannot be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic examples.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
65	259	Oral	In-person	Ethical Reviewing:Phonology, Morphology, and Word Segmentation		Towards the Machine Translation of Scientific Neologisms	Paul Lerner and François Yvon	Paul Lerner	Paul Lerner	Scientific research continually discovers and invents new concepts, which are then referred to by new terms, neologisms, or neonyms in this context. As the vast majority of publications are written in English, disseminating this new knowledge to the general public often requires translating these terms. However, by definition, no parallel data exist to provide such translations. Therefore, we propose to leverage term definitions as a useful source of information for the translation process. As we discuss, Large Language Models are well suited for this task and can benefit from in-context learning with co-hyponyms and terms sharing the same derivation paradigm. These models, however, are sensitive to the superficial and morphological similarity between source and target terms. Their predictions are also impacted by subword tokenization, especially for prefixed terms.	Hall B-A	Session 3: Oral/Poster B	Discourse, phonology and syntax		Tue, Jan 21	14:00-15:30	0:00	Owen Rambow	14:30	14:45
66	260	Poster	Virtual	Ethical Reviewing:Sentiment Analysis, Opinion and Argument Mining	Sentiment Analysis, Opinion and Argument Mining	HyperIDP: Customizing Temporal Hypergraph Neural Networks for Multi-Scale Information Diffusion Prediction	Haowei Xu, Chao Gao, Xianghua Li and Zhen Wang	Haowei Xu	Haowei Xu	Information diffusion prediction is crucial for understanding how information spreads within social networks, addressing both macroscopic and microscopic prediction tasks. Macroscopic prediction assesses the overall impact of diffusion, while microscopic prediction focuses on identifying the next user likely to be influenced. However, few studies have focused on both scales of diffusion. This paper presents HyperIDP, a novel Hypergraph-based model designed to manage both macroscopic and microscopic Information Diffusion Prediction tasks. The model captures interactions and dynamics of cascades at the macro level with hypergraph neural networks (HGNNs) while integrating social homophily at the micro level. Considering the diverse data distributions across social media platforms, which necessitate extensive tuning of HGNN architectures, a search space is constructed to accommodate diffusion hypergraphs, with optimal architectures derived through differentiable search strategies. Additionally, cooperative-adversarial loss, inspired by multi-task learning, is introduced to ensure that the model can leverage the advantages of the shared representation when handling both tasks, while also avoiding potential conflicts. Experimental results show that the proposed model significantly outperforms baselines.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30
67	264	Poster	Virtual	Information Extraction:Ethical Reviewing	Information Extraction	Enhancing multi-modal Relation Extraction with Reinforcement Learning Guided Graph Diffusion Framework	Rui Yang and Rajiv Gupta	Rui Yang	Rui Yang	With the massive growth of multi-modal information such as text, images, and other data, how should we analyze and align these data becomes very important. In our work, we introduce a new framework based on Reinforcement Learning Guided Graph Diffusion to address the complexity of multi-modal graphs and enhance the interpretability, making it clearer to understand the alignment of multi-modal information. Our approach leverages pre-trained models to encode multi-modal data into scene graphs and combines them into a cross-modal graph (CMG). We design a reinforcement learning agent to filter nodes and modify edges based on the observation of the graph state to dynamically adjust the graph structure, providing coarse-grained refinement. Then we will iteratively optimize edge weights and node selection to achieve fine-grained adjustment. We conduct extensive experimental results on multi-modal relation extraction task datasets and show that our model significantly outperforms existing multi-modal methods such as MEGA and MKGFormer. We also conduct an ablation study to demonstrate the importance of each key component, showing that performance drops significantly when any key element is removed. Our method uses reinforcement learning methods to better mine potential multi-modal information relevance, and adjustments based on graph structure make our method more interpretable.	Gather	Gather Session 3	Gather		Tue, Jan 28	13:00-14:30			13:00	14:30	Yes
68	267	Oral	In-person	Dialogue and Conversational Interaction:Ethical Reviewing		Non-Emotion-Centric Empathetic Dialogue Generation	Yuanxiang Huangfu, Peifeng Li, Yaxin Fan and Qiaoming Zhu	Yuanxiang Huangfu	Yuanxiang Huangfu email: hfyx0111@163.com	Previous work on empathetic response generation mainly focused on utilizing the speaker's emotions to generate responses. However, the performance of identifying fine-grained emotions is limited, introducing cascading errors to empathetic response generation. Moreover, due to the conflict between the information in the dialogue history and the recognized emotions, previous work often generated general and uninformative responses. To address the above issues, we propose a novel framework NEC (Non-Emotion-Centric empathetic dialogue generation) based on contrastive learning and context-sensitive entity and social commonsense, in which the frequent replies and sentences with incorrect emotions are punished through contrastive learning, thereby improving the empathy, diversity and information of the responses. The experimental results demonstrate that our NEC enhances the quality of empathetic generation and generates more diverse responses in comparison with the state-of-the-art baselines.The code will be available at https://github.com/huangfu170/NEC-empchat	Hall B-C	Session 4: Oral/Poster C	Dialogue and Conversational Interaction 1		Tue, Jan 21	16:00-17:30	0:00	Li Zhang	16:00	16:15
69	270	Poster	Virtual	Ethical Reviewing:Reasoning, Question Answering, and Sentence-level Semantics	Reasoning, Question Answering, and Sentence-level Semantics	Aligning Retrieval with Reader Needs: Reader-Centered Passage Selection for Open-Domain Question Answering	Chunlei Xin, Shuheng Zhou, Xuanang Chen, Yaojie Lu, Huijia Zhu, weiqiang wang, Zhongyi Liu, Xianpei Han and Le Sun	Chunlei Xin	Chunlei Xin	Open-Domain Question Answering (ODQA) systems often struggle with the quality of retrieved passages, which may contain conflicting information and be misaligned with the reader's needs. Existing retrieval methods aim to gather relevant passages but often fail to prioritize consistent and useful information for the reader. In this paper, we introduce a novel Reader-Centered Passage Selection (R-CPS) method, which enhances the performance of the retrieve-then-read pipeline by re-ranking and clustering passages from the reader's perspective. Our method re-ranks passages based on the reader's prediction probability distribution and clusters passages according to the predicted answers, prioritizing more useful and relevant passages to the top and reducing inconsistent information. Experiments on ODQA datasets demonstrate the effectiveness of our approach in improving the quality of evidence passages under zero-shot settings.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
70	274	Poster	Virtual	Ethical Reviewing:Interpretability and Explainability	Interpretability and Explainability	Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding	Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng and Kai-Wei Chang	Cheng Wang	Cheng Wang	The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
71	279	Poster	In-person	Ethical Reviewing:Ethics, Bias, and Fairness		Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields	Jan Philip Wahle, Terry Lima Ruas, Mohamed Abdalla, Bela Gipp and Saif M. Mohammad	Jan Philip Wahle	Jan Philip Wahle	This study examines the tendency to cite older work across 20 fields of study over 43 years (1980--2023). We put NLP's propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of ~240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) --- even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community's engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.	Atrium	Session 5: Oral/Poster D	Poster		Wed, Jan 22	9:00-10:30			9:00	10:30
72	284	Poster	Virtual	Low-resourced and Less Studied Languages:Ethical Reviewing	Low-resourced and Less Studied Languages	Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation	Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu and Datao You	Yanxu Mao	Yanxu Mao	In recent years, text classification methods based on neural networks and pre-trained models have gained increasing attention and demonstrated excellent performance. However, these methods still have some limitations in practical applications: (1) They typically focus only on the matching similarity between sentences. However, there exists implicit high-value information both within sentences of the same class and across different classes, which is very crucial for classification tasks. (2) Existing methods such as pre-trained language models and graph-based approaches often consume substantial memory for training and text-graph construction. (3) Although some low-resource methods can achieve good performance, they often suffer from excessively long processing times. To address these challenges, we propose a low-resource and fast text classification model called LFTC. Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data. We then remove redundant information irrelevant to the target classification to reduce processing time. Finally, we compute the similarity distance between text pairs for classification. We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time, especially under limited computational and data resources, highlighting its superior advantages.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
73	299	Poster	Virtual	Machine Learning for CL/NLP:Ethical Reviewing	Machine Learning for CL/NLP	Monte Carlo Tree Search Based Prompt Autogeneration for Jailbreak Attacks against LLMs	Suhuang WU, Huimin Wang, Yutian Zhao, Xian Wu, yefeng zheng, Wei Li, Hui Li and rongrong ji	SUHUANG WU	SUHUANG WU	Jailbreak attacks craft specific prompts or append adversarial suffixes to prompts, thereby inducing language models to generate harmful or unethical content and bypassing the model's safety guardrails. With the recent blossom of large language models (LLMs), there's a growing focus on jailbreak attacks to probe their safety. While current white-box attacks typically focus on meticulously identifying adversarial suffixes for specific models, their effectiveness and efficiency diminish when applied to different LLMs. In this paper, we propose a Monte Carlo Tree Search (MCTS) based Prompt Auto-generation (MPA) method to enhance the effectiveness and efficiency of attacks across various models. MPA automatically searches for and generates adversarial suffixes for valid jailbreak attacks. Specifically, we first identify a series of action candidates that could potentially trick LLMs into providing harmful responses. To streamline the exploration of adversarial suffixes, we design a prior confidence probability for each MCTS node. We then iteratively auto-generate adversarial prompts using the MCTS framework. Extensive experiments on multiple open-source models (like Llama, Gemma, and Mistral) and closed-source models (such as ChatGPT) show that our proposed MPA surpasses existing methods in search efficiency as well as attack effectiveness. The codes are available at https://github.com/KDEGroup/MPA.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
74	301	Poster	Virtual	Ethical Reviewing:Reasoning, Question Answering, and Sentence-level Semantics	Reasoning, Question Answering, and Sentence-level Semantics	LogiGraph: Logical Reasoning with Contrastive Learning and Lightweight Graph Networks	Xiang Li, Chen Shi, Yong Xu and jun huang	Xiang Li	Xiang Li	Logical reasoning is a crucial factor in machine reading comprehension tasks (MRC). Existing methods suffer from the balance between semantic and explicit logical relation representations, in which some emphasize contextual semantics, while others pay more attention to explicit logical features. Additionally, previous methods utilize graph convolutional networks (GCN) for node updates, still exhibiting some shortcomings. To address these challenges, in this paper, we propose a logical reasoning method with contrastive learning and lightweight graph networks (LogiGraph). Our method focuses on the \textit{lightweight} aspect of the GCN, which greatly improves the shortcomings of the GCN, and employs conjunction and punctuation marks as two types of edges to construct a dual graph. Besides, we combine contrastive learning with graph reasoning, which changes the logical expression's content as the negative sample of the original context, enabling the model to capture negative logical relationships and improving generalization ability. We conduct extensive experiments on two public datasets, ReClor and LogiQA. Experimental results demonstrate that LogiGraph can achieve state-of-the-art performance on both datasets.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
75	304	Poster	Virtual	Ethical Reviewing:Natural Language Generation, Summarization and Simplification	Natural Language Generation, Summarization and Simplification	Explaining Relationships Among Research Papers	Xiangci Li and Jessica Ouyang	Xiangci Li	Xiangci Li	The rapid pace of research publications makes it challenging for researchers to stay up to date. There is a growing need for automatically generated, concise literature reviews to help researchers quickly identify papers relevant to their interests. Prior work over the past decade has focused on summarizing individual research papers, typically in the context of citation generation, while the relationships among multiple papers have largely been overlooked. Existing approaches primarily generate standalone citation sentences without addressing the need for expository and transition sentences to explain the relationships among multiple citations. In this work, we propose a feature-based, LLM-prompting approach to generate richer citation texts and simultaneously capture the complex relationships among multiple papers. Our expert evaluation reveals a strong correlation between human preference and integrative writing styles, indicating that readers favor high-level, abstract citations with transition sentences that weave them into a coherent narrative.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
76	305	Poster	In-person	Ethical Reviewing:NLP and LLM Applications		From Generalist to Specialist: A Survey of Large Language Models for Chemistry	Yang Han, Ziping Wan, Lu Chen, Kai Yu and Xin Chen	Yang Han	Yang Han	Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP). However, the predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry. The scarcity of specialized chemistry data, coupled with the complexity of multi-modal data such as 2D graph, 3D structure and spectrum, present distinct challenges. Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs. In this paper, we outline methodologies for incorporating domain-specific chemistry knowledge and multi-modal information into LLMs, we also conceptualize chemistry LLMs as agents using chemistry tools and investigate their potential to accelerate scientific research. Additionally, we conclude the existing benchmarks to evaluate chemistry ability of LLMs. Finally, we critically examine the current challenges and identify promising directions for future research. Through this comprehensive survey, we aim to assist researchers in staying at the forefront of developments in chemistry LLMs and to inspire innovative applications in the field.	Atrium	Session 2: Oral/Poster A	Poster		Tue, Jan 21	11:00-12:30			11:00	12:30
77	306	Poster	In-person	Ethical Reviewing:Interpretability and Explainability		Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution	Milad Alshomary, Narutatsu Ri, Marianna Apidianaki, Ajay Patel, Smaranda Muresan and Kathleen McKeown	Milad Alshomary	Milad Alshomary	Recent state-of-the-art authorship attribution methods learn authorship representations of text in a latent, uninterpretable space, which hinders their usability in real-world applications. We propose a novel approach for interpreting learned embeddings by identifying representative points in the latent space and leveraging large language models to generate informative natural language descriptions of the writing style associated with each point. We evaluate the alignment between our interpretable and latent spaces and demonstrate superior prediction agreement over baseline methods. Additionally, we conduct a human evaluation to assess the quality of these style descriptions and validate their utility in explaining the latent space. Finally, we show that human performance on the challenging authorship attribution task improves by +20% on average when aided with explanations from our method.	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
78	310	Poster	Virtual	Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing	Multimodal and Grounded Language Acquisition, HRI	Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step Parsing	HaiXiang Zhu, Lixian Su, ShuangMing Mao and Jing Ye	Haixiang Zhu	Haixiang Zhu	Visual grounding (VG) is an important task in vision and language that involves understanding the mutual relationship between query terms and images. However, existing VG datasets typically use simple and intuitive textual descriptions, with limited attribute and spatial information between images and text. Recently, the Scene Knowledge Visual Grounding (SK-VG) task has been introduced, which constructs VG datasets using visual knowledge and relational referential expressions. Due to the length of textual visual knowledge and the complexity of the referential relationships between entities, previous models have struggled with this task. Therefore, we propose ReadVG, a zero-shot, plug-and-play method that leverages the robust language understanding capabilities of Large Language Models (LLMs) to transform long visual knowledge texts into concise, information-dense visual descriptions. To improve the accuracy of target localisation, we employ a multi-step parsing algorithm that can progressively extract the query targets and their features from the visual knowledge and relational referencing expressions, thereby assisting multimodal models to more accurately localise the target for grounding purposes. Extensive experiments and case studies show that our approach can significantly improve the performance of multimodal grounding models.	Gather	Gather TBD	Gather		Gather TBD	TBD			TBD	TBD
79	314	Poster	In-person	Ethical Reviewing:Natural Language Generation, Summarization and Simplification		Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem	Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller and Vera Schmitt	Qianli Wang	Qianli Wang	Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
80	316	Poster	Virtual	Ethical Reviewing:Language Modeling	Language Modeling	BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation	Minchong Li, Feng Zhou and Xiaohui Song	Minchong Li	Minchong Li mincoolee@gmail.com	In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-k teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
81	327	Poster	Virtual	Ethical Reviewing:Low-resourced and Less Studied Languages	Low-resourced and Less Studied Languages	Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs	Tamzeed Mahfuz, Satak Kumar Dey, Ruwad Naswan, Hasnaen Adil, Khondker Salman Sayeed and Haz Sameen Shahgir	Tamzeed Mahfuz	Tamzeed Mahfuz	Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30
82	332	Poster	Virtual	Ethical Reviewing:Ethics, Bias, and Fairness	Ethics, Bias, and Fairness	Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs	Julia Watson, Sophia S. Lee, Barend Beekhuizen and Suzanne Stevenson	Julia Watson	Julia Watson	We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is "correct" or "natural", LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs' metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
83	338	Poster	Virtual	NLP and LLM Applications:Ethical Reviewing	NLP and LLM Applications	T-MES: Trait-Aware Mix-of-Experts Representation Learning for Multi-trait Essay Scoring	Jiong Wang and Jie Liu	Jiong Wang	Jiong Wang	In current research on automatic essay scoring, related work tends to focus more on evaluating the overall quality or a single trait of prompt-specific essays. However, when scoring essays in an educational context, it is essential not only to consider the overall score but also to provide feedback on various aspects of the writing. This helps students clearly identify areas for improvement, enabling them to engage in targeted practice. Although many methods have been proposed to address the scoring issue, they still suffer from insufficient learning of trait representations and overlook the diversity and correlations between trait scores in the scoring process. To address this problem, we propose a novel multi-trait essay scoring method based on Trait-Aware Mix-of-Experts Representation Learning. Our method obtains trait-specific essay representations using a Mix-of-Experts scoring architecture. Furthermore, based on this scoring architecture, we propose a diversified trait-expert method to learn distinguishable expert weights. And to facilitate multi-trait scoring, we introduce two trait correlation learning strategies that achieve learning the correlations among traits. Experimental results demonstrate the effectiveness of our method, and compared to existing methods, it achieves a further improvement in computational efficiency.	Gather	Gather TBD	Gather		Gather TBD	TBD			TBD	TBD
84	344	Poster	Virtual	Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing	Multimodal and Grounded Language Acquisition, HRI	A Graph Interaction Framework on Relevance for Multimodal Named Entity Recognition with Multiple Images	Jiachen Zhao, Shizhou Huang and xin Lin	Jiachen Zhao	Jiachen Zhao	Posts containing multiple images have significant research potential in Multimodal Named Entity Recognition nowadays. The previous methods determine whether the images are related to named entities in the text through similarity computation, such as using CLIP. However, it is not effective in some cases and not conducive to task transfer, especially in multi-image scenarios. To address the issue, we propose a graph interaction framework on relevance (GIFR) for Multimodal Named Entity Recognition with multiple images. For humans, they have the abilities to distinguish whether an image is relevant to named entities, but human capabilities are difficult to model. Therefore, we propose using reinforcement learning based on human preference to integrate human abilities into the model to determine whether an image-text pair is relevant, which is referred to as relevance. To better leverage relevance, we construct a heterogeneous graph and introduce graph transformer to enable information interaction. Experiments on benchmark datasets demonstrate that our method achieves the state-of-the-art performance.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
85	347	Poster	Virtual	Ethical Reviewing:Phonology, Morphology, and Word Segmentation	Phonology, Morphology, and Word Segmentation	Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation	Xuebin Wang, Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong and Yang Hou	Xuebin Wang	Xuebin Wang	Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from parallel speech-text data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1K sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
86	351	Oral	In-person	NLP and LLM Applications:Ethical Reviewing		RoBGuard: Enhancing LLMs to Assess Risk of Bias in Clinical Trial Documents	Changkai Ji, Bowen Zhao, Zhuoyao Wang, Yingwen Wang, Yuejie Zhang, Ying Cheng, Rui Feng and Xiaobo Zhang	Changkai Ji	Changkai Ji	Randomized Controlled Trials (RCTs) are rigorous clinical studies crucial for reliable decision-making, but their credibility can be compromised by bias. The Cochrane Risk of Bias tool (RoB 2) assesses this risk, yet manual assessments are time-consuming and labor-intensive. Previous approaches have employed Large Language Models (LLMs) to automate this process. However, they typically focus on manually crafted prompts and a restricted set of simple questions, limiting their accuracy and generalizability. Inspired by the human bias assessment process, we propose RoBGuard, a novel framework for enhancing LLMs to assess the risk of bias in RCTs. Specifically, RoBGuard integrates medical knowledge-enhanced question reformulation, multimodal document parsing, and multi-expert collaboration to ensure both completeness and accuracy. Additionally, to address the lack of suitable datasets, we introduce two new datasets: RoB-Item and RoB-Domain. Experimental results demonstrate RoBGuard's effectiveness on the RoB-Item dataset, outperforming existing methods.	Suite 7	Session 6: Oral/Poster E	Applications 3		Wed, Jan 22	11:00-12:30	0:00	Marco Rovera	12:00	12:15
87	352	Poster	Virtual	Ethical Reviewing:Information Extraction	Information Extraction	A Compressive Memory-based Retrieval Approach for Event Argument Extraction	Wanlong Liu, Enqi Zhang, shaohuan cheng, Dingyi Zeng, Li Zhou, Chen Zhang, Malu Zhang and Wenyu Chen	Wanlong Liu	Wanlong Liu	Recent works have demonstrated the effectiveness of retrieval augmentation in the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE methods have two main limitations: (1) input length constraints and (2) the gap between the retriever and the inference model. These issues limit the diversity and quality of the retrieved information. In this paper, we propose a Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the two limitations mentioned above. Our compressive memory, designed as a dynamic matrix that effectively caches retrieved information and supports continuous updates, overcomes the limitations of input length. Additionally, after pre-loading all candidate demonstrations into the compressive memory, the model further retrieves and filters relevant information from the memory based on the input query, bridging the gap between the retriever and the inference model. Extensive experiments show that our method achieves new state-of-the-art performance on three public datasets (RAMS, WikiEvents, ACE05), significantly outperforming existing retrieval-based EAE methods.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
88	355	Poster	In-person	Ethical Reviewing:Machine Learning for CL/NLP		FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics	Yupei Du, Albert Gatt and Dong Nguyen	Yupei Du	Yupei Du	Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, % of the reference model, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that 1) training dynamics are highly transferable across model sizes and pre-training methods, and that 2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these observations, we propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with dataset cartography, FTFT uses more efficient reference models and aggressive early stopping. FTFT achieves robustness improvements over ERM while lowering the training cost by up to ~50%	Atrium	Session 4: Oral/Poster C	Poster		Tue, Jan 21	16:00-17:30			16:00	17:30
89	357	Poster	In-person	Low-resourced and Less Studied Languages:Ethical Reviewing		PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation	Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka and Masao Utiyama	Hour Kaing	Hour Kaing	This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
90	358	Poster	Virtual	Reasoning, Question Answering, and Sentence-level Semantics:Ethical Reviewing	Reasoning, Question Answering, and Sentence-level Semantics	Relation Logical Reasoning and Relation-aware Entity Encoding for Temporal Knowledge Graph Reasoning	Longzhou Liu, Chenglong Xiao, Shanshan Wang and Tingwen Liu	Longzhou Liu	Longzhou Liu	Temporal Knowledge Graph Reasoning (TKGR) aims to predict future facts based on historical data. Current mainstream models primarily use embedding techniques, which predict missing facts by representing entities and relations as low-dimensional vectors. However, these models often consider only the structural information of individual entities and relations, overlooking the broader structure of the entire TKG. To address these limitations, we propose a novel model called Relation Logical Reasoning and Relation-aware Entity Encoding (RLEE), drawing inspiration from attention mechanisms and logical rule-based techniques. RLEE introduces a two-layer representation of the TKG: an entity layer and a relation layer. At the relation layer, we extract relation paths to mine potential logical correlations between different relations, learning relation embeddings through a process of relation logical reasoning. At the entity layer, we use the relation-aware attention mechanism to learn the entity embeddings specific to the predicted query relations. These learned relation and entity embeddings are then used to predict facts at future timestamps. When evaluated on five commonly used public datasets, RLEE consistently outperforms state-of-the-art baselines.	Gather	Gather Session 2	Gather		Tue, Jan 28	07:00-08:30			7:00	8:30	Yes
91	359	Poster	In-person	Ethical Reviewing:Reasoning, Question Answering, and Sentence-level Semantics		Awakening Augmented Generation: Learning to Awaken Internal Knowledge of Large Language Models for Question Answering	Huanxuan Liao, Shizhu He, Yao Xu, Yuanzhe Zhang, Shengping Liu, Kang Liu and Jun Zhao	Huanxuan Liao	Huanxuan Liao	Retrieval-Augmented-Generation and Generation-Augmented-Generation have been proposed to enhance the knowledge required for question answering with Large Language Models (LLMs) by leveraging richer context. However, the former relies on external resources, and both require incorporating explicit documents into the context, which increases execution costs and susceptibility to noise data during inference. Recent works indicate that LLMs model rich knowledge, but it is often not effectively activated and awakened. Inspired by this, we propose a novel knowledge-augmented framework, Awakening-Augmented-Generation (AAG), which mimics the human ability to answer questions using only thinking and recalling to compensate for knowledge gaps, thereby awaking relevant knowledge in LLMs without relying on external resources. AAG consists of two key components for awakening richer context. Explicit awakening fine-tunes a context generator to create a synthetic, compressed document that functions as symbolic context. Implicit awakening utilizes a hypernetwork to generate adapters based on the question and synthetic document, which are inserted into LLMs to serve as parameter context. Experimental results on three datasets demonstrate that AAG exhibits significant advantages in both open-domain and closed-book settings, as well as in out-of-distribution generalization. Our code will be available at https://github.com/Xnhyacinth/IAG.	Atrium	Session 3: Oral/Poster B	Poster		Tue, Jan 21	14:00-15:30			14:00	15:30
92	368	Poster	Virtual	Discourse and Pragmatics:Ethical Reviewing	Discourse and Pragmatics	Dying or Departing? Euphemism Detection for Death Discourse in Historical Texts	Ali Al-Laith, Alexander Conroy, Jens Bjerring-Hansen, Bolette Pedersen, Carsten Levisen and Daniel Hershcovich	Ali Mohammed Ali Allaith	Ali Mohammed Ali Allaith	Euphemisms are a linguistic device used to soften discussions of sensitive or uncomfortable topics, with death being a prominent example. In this paper, we present a study on the detection of death-related euphemisms in historical literary texts from a corpus containing Danish and Norwegian novels from the late 19th century. We introduce an annotated dataset of euphemistic and literal references to death, including both common and rare euphemisms, ranging from well-established terms to more culturally nuanced expressions. We evaluate the performances of state-of-the-art pre-trained language models fine-tuned for euphemism detection. Our findings show that fixed, literal expressions of death became less frequent over time, while metaphorical euphemisms grew in prevalence. Additionally, euphemistic language was more common in historical novels, whereas contemporary novels tended to refer to death more literally, reflecting the rise of secularism. These results shed light on the shifting discourse on death during a period when the concept of death as final became prominent.	Gather	Gather Session 3	Gather		Tue, Jan 28	13:00-14:30			13:00	14:30	Yes
93	369	Oral	In-person	Information Retrieval and Text Mining:Ethical Reviewing		ITERATE: Image-Text Enhancement, Retrieval, and Alignment for Transmodal Evolution with LLMs	Chenhan Fu, Guoming Wang, Juncheng Li, Wenqiao Zhang, Rongxing Lu and Siliang Tang	Chenhan Fu Guoming Wang	Chenhan Fu Guoming Wang	Inspired by human cognitive behavior, we introduce visual modality to enhance the performance of pure text-based question-answering tasks with the development of multimodal models. However, obtaining corresponding images through manual annotation often entails high costs. Faced with this challenge, an intuitive strategy is to use search engines or use web scraping techniques to automatically obtain relevant image information. However, the images obtained by this strategy may be of low quality and may not match the context of the original task, which could fail to improve or even decrease performance on downstream tasks. In this paper, we propose a novel framework named "ITERATE", aimed at retrieving and optimizing the quality of images to improve the alignment between text and images. Inspired by evolutionary algorithms in reinforcement learning and driven by the synergy of large language models (LLMs) and multimodal models, ITERATE employs a series of strategic actions such as filtering, optimizing, and retrieving to acquire higher quality images, and repeats this process over multiple generations to enhance the quality of the entire image cluster. Our experimental results on the ScienceQA, ARC-Easy, and OpenDataEval datasets also verify the effectiveness of our method, showing improvements of 3.5%, 5%, and 7%, respectively.	Hall B-B	Session 5: Oral/Poster D	Information extraction and retrieval 2		Wed, Jan 22	9:00-10:30	0:00	Barbara Plank	10:00	10:15
94	371	Poster	Virtual	Information Retrieval and Text Mining:Ethical Reviewing	Information Retrieval and Text Mining	Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation	zhe yang and Tiantian Liang	Tiantian Liang	Tiantian Liang	Session-based recommendation focuses on predicting the next item a user will interact with based on sequences of anonymous user sessions. A significant challenge in this field is data sparsity due to the typically short-term interactions. Most existing methods rely heavily on users' current interactions, overlooking the wealth of auxiliary information available. To address this, we propose a novel model, the Multi-Graph Co-Training model (MGCOT), which leverages not only the current session graph but also similar session graphs and a global item relation graph. This approach allows for a more comprehensive exploration of intrinsic relationships and better captures user intent from multiple views, enabling session representations to complement each other. Additionally, MGCOT employs multi-head attention mechanisms to effectively capture relevant session intent and uses contrastive learning to form accurate and robust session representations. Extensive experiments on three datasets demonstrate that MGCOT significantly enhances the performance of session-based recommendations, particularly on the Diginetica dataset, achieving improvements up to 2.00% in P@20 and 10.70% in MRR@20. Resources have been made publicly available in our GitHub repository https://github.com/liang-tian-tian/MGCOT.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
95	372	Poster	Virtual	Multimodal and Grounded Language Acquisition, HRI:Ethical Reviewing	Multimodal and Grounded Language Acquisition, HRI	CAST: Cross-modal Alignment Similarity Test for Vision Language Models	Gautier Dagan, Olga Loginova and Anil Batra	Gautier Dagan	Gautier Dagan	Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.	Gather	Gather Session 1	Gather		Mon, Jan 27	19:00-20:30			19:00	20:30	Yes
96	380	Poster	In-person	Ethical Reviewing:Language Modeling		Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models	Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley and Lina Yao	Chengkai Huang	Chengkai Huang	Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks. However, it was observed by previous works that retrieval is not always helpful, especially when the LLM is already knowledgable on the query to answer. Motivated by this, Adaptive Retrieval-Augmented Generation (ARAG) studies retrieving only when the knowledge asked by the query is absent in the LLM. Previous works of ARAG either require accessing the pre-training corpus or prompting with additional model inferences. Aiming to avoid such drawbacks, we propose to determine whether the model is knowledgeable on a query via inspecting the (contextualized) pre-trained token embeddings of LLMs. We hypothesize that such embeddings capture rich information on the model's intrinsic knowledge base, which enables an efficient way of judging the necessity to retrieve from an external corpus. Extensive experiments demonstrate our ARAG approach's superior performance across various benchmarks.	Atrium	Session 6: Oral/Poster E	Poster		Wed, Jan 22	11:00-12:30			11:00	12:30
97	386	Poster	In-person	Ethical Reviewing:Lexical Semantics		Investigating the Contextualised Word Embedding Dimensions Specified for Contextual and Temporal Semantic Changes	Taichi Aida and Danushka Bollegala	Taichi Aida	Taichi Aida	The sense-aware contextualised word embeddings (SCWEs) encode semantic changes of words within the contextualised word embedding (CWE) spaces. Despite the superior performance of (SCWE) in contextual/temporal semantic change detection (SCD) benchmarks, it remains unclear as to how the meaning changes are encoded in the embedding space. To study this, we compare pre-trained CWEs and their fine-tuned versions on contextual and temporal semantic change benchmarks under Principal Component Analysis (PCA) and Independent Component Analysis (ICA) transformations. Our experimental results reveal (a) although there exist a smaller number of axes that are specific to semantic changes of words in the pre-trained CWE space, this information gets distributed across all dimensions when fine-tuned, and (b) in contrast to prior work studying the geometry of CWEs, we find that PCA to better represent semantic changes than ICA within the top 10% of axes. These findings encourage the development of more efficient SCD methods with a small number of SCD-aware dimensions.	Atrium	Session 6: Oral/Poster E	Poster		Wed, Jan 22	11:00-12:30			11:00	12:30
98	387	Oral	In-person	Low-resourced and Less Studied Languages:Ethical Reviewing		Uncertainty Modelling in Under-Represented Languages with Bayesian Deep Gaussian Processes	Ubaid Azam, Imran Razzak, Shelly Vishwakarma and Shoaib Jameel	Imran Razzak	Imran Razzak	NLP models often face challenges with under-represented languages due to a lack of sufficient training data and language complexities. This can result in inaccurate predictions and a failure to capture the inherent uncertainties within these languages. This paper introduces a new method for modelling uncertainty in under-represented languages by employing deep Bayesian Gaussian Processes. We develop a novel framework that integrates prior knowledge and leverages kernel functions. This helps enable the quantification of uncertainty in predictions to overcome the data limitations in under-represented languages. The efficacy of our approach is validated through various experiments, and the results are benchmarked against existing methods to highlight the enhancements in prediction accuracy and measurement of uncertainty.	Suite 7	Session 5: Oral/Poster D	Low-resource languages 1		Wed, Jan 22	9:00-10:30	0:00	Maite Melero	10:00	10:15
99	388	Oral	In-person	Ethical Reviewing:Low-resourced and Less Studied Languages		Cross-lingual Text Classification Transfer: The Case of Ukrainian	Daryna Dementieva, Valeriia Khylenko and Georg Groh	Daryna Dementieva	Daryna Dementieva	Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks---toxicity classification, formality classification, and natural language inference (NLI)---providing the ``recipe'' for the optimal setups for each task.	Suite 7	Session 5: Oral/Poster D	Low-resource languages 1		Wed, Jan 22	9:00-10:30	0:00	Maite Melero	9:00	9:15
100	390	Oral	In-person	Ethical Reviewing:NLP and LLM Applications		LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots	Dongge Han, Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Peter Bell and Amos Storkey	Dongge Han	Dongge Han	Large language models (LLMs) have shown significant potential for robotics applications, particularly task planning, by harnessing their language comprehension and text generation capabilities. However, in applications such as household robotics, a critical gap remains in the personalization of these models to household preferences. For example, an LLM planner may find it challenging to perform tasks that require personalization, such as deciding where to place mugs in a kitchen based on specific household preferences. We introduce LLM-Personalize, a novel framework designed to personalize LLM planners for household robotics. LLM-Personalize uses an LLM planner to perform iterative planning in multi-room, partially-observable household environments, utilizing a scene graph built dynamically from local observations. To personalize the LLM planner towards user preferences, our optimization pipeline integrates imitation learning and reinforced Self-Training. We evaluate LLM-Personalize on Housekeep, a challenging simulated real-world 3D benchmark for household rearrangements, demonstrating a more than 30 percent increase in success rate over existing LLM planners, showcasing significantly improved alignment with human preferences.	Suite 7	Session 9: Oral/Poster G	Applications 4		Thu, Jan 23	9:00-10:30	0:00	Yi Feng	9:30	9:45