MetaKP

Highlights

We introduce on-demand keyphrase generation, a novel paradigm that requires keyphrase predictions to conform to specific high-level goals or intents. We release MetaKP, a large-scale benchmarking dataset covering four datasets, 7500 documents, and 3760 goals from the news and biomedical text domain.

Benchmark Construction

We build a scalable labeling pipeline that combines GPT-4 and human annotators to construct high-quality goals from keyphrases.

We run the pipeline on KPTimes, DUC2001, KPBiomed, and Pubmed to collect a diverse dataset covering two domains, 7500 documents, and 3760 goals.

Modeling

We develop both unsupervised and supervised methods to perform on-demand keyphrase generation.

For the unsupervised approach, we design a self-consistency prompting approach that leverages the frequency and rank information from the samples collected from large language models (LLMs). Concretely, we collect K samples, each of which consists of a sequence of keyphrases. Then, each phrase is scored using the following formula.

For the supervised method, we design a multi-task learning approach to fine-tune a sequence-to-sequence pre-trained language model to self-determine the relevance of a goal and selectively generate keyphrases.

Results

Our experiments reveal the following insight:

MetaKP represents a challenging benchmark for keyphrase generation. Flan-T5-XL, the strongest fine-tuned model, only achieves an average of 0.609 Satisfaction Rate across all the datasets, and zero-shot prompting GPT-4o, a strong LLM, only achieves 0.492 SR.

The proposed self-consistency prompting approach greatly improves the performance of LLMs, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model.

Supervised fine-tuning can fail to generalize on out-of-distribution testing data. By contrast, LLM-based unsupervised method achieves consistent performance in all the domains, especially in the news domain, where GPT-4o outperforms supervised Flan-T5-XL by 19\% in out-of-distribution testing.

We provide more analyses on both methods as well as qualitative examples in the paper.

Application

Finally, we demonstrate the potential of on-demand keyphrase generation as a general NLP infrastructure. Specifically, we use event detection for epidemics prediction as a test bed. By constructing simple goals from event ontology and attempting to extract relevant keyphrases from social media text, we show that an on-demand keyphrase generation model has the potential to extract epidemic-related trends similar to an event detection model trained on task-specific data.

BibTeX

@article{wu2024metakp,
	  title={MetaKP: On-Demand Keyphrase Generation}, 
	  author={Di Wu and Xiaoxian Shen and Kai-Wei Chang},
	  year={2024},
	  eprint={2407.00191},
	  archivePrefix={arXiv},
	  primaryClass={cs.CL},
	  url={https://arxiv.org/abs/2407.00191}, 
	  }