Research Output
Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
  Reproducibility is a cornerstone of scientific research, ensuring the reliability and generalisability of findings. The ReproNLP Shared Task on Reproducibility of Evaluations in NLP aims to assess the reproducibility of human evaluation studies. This paper presents a reproduction study of the human evaluation experiment presented in "Hierarchical Sketch Induction for Paraphrase Generation" by Hosking et al. (2022). The original study employed a human evaluation on Amazon Mechanical Turk, assessing the quality of paraphrases generated by their proposed model using three criteria: meaning preservation, fluency, and dissimilarity. In our reproduction study, we focus on the meaning preservation criterion and utilise the Prolific platform for participant recruitment, following the ReproNLP challenge’s common approach to reproduction. We discuss the methodology, results, and implications of our reproduction study, comparing them to the original findings. Our findings contribute to the understanding of reproducibility in NLP research and highlights the potential impact of platform changes and evaluation criteria on the reproducibility of human evaluation studies.

  • Date:

    09 April 2024

  • Publication Status:

    Accepted

  • Funders:

    Edinburgh Napier Funded

Citation

Watson, L. N., & Gkatzia, D. (in press). Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation.

Authors

Monthly Views:

Available Documents