KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation
ACL (Findings), 2024
Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation mainly relies on exact matching with human references. This scheme fails to recognize systems that generate keyphrases semantically equivalent to the references or diverse keyphrases that carry practical utility. To better assess the capability of keyphrase systems, we propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. For each aspect, we design semantic-based metrics to reflect the evaluation objectives. Meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously used metrics. Using KPEval, we re-evaluate 20 keyphrase systems and discover that (1) the best model differs depending on the evaluated aspect; (2) the utility in downstream tasks does not always correlate with reference agreement; and (3) large language models exhibit strong performance with few-shot prompting, especially under reference-free evaluation.