Statistical Significance Testing for Natural Language Processing

Dror, Rotem, Peled-Cohen, Lotem, Shlomov, Segev

商品描述

Data-driven experimental analysis has become the main evaluation tool of Natural Language Processing (NLP) algorithms. In fact, in the last decade, it has become rare to see an NLP paper, particularly one that proposes a new algorithm, that does not include extensive experimental analysis, and the number of involved tasks, datasets, domains, and languages is constantly growing. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we, as a community, rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental.

The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Our guiding assumption throughout the book is that the basic question NLP researchers and engineers deal with is whether or not one algorithm can be considered better than another one. This question drives the field forward as it allows the constant progress of developing better technology for language processing challenges. In practice, researchers and engineers would like to draw the right conclusion from a limited set of experiments, and this conclusion should hold for other experiments with datasets they do not have at their disposal or that they cannot perform due to limited time and resources. The book hence discusses the opportunities and challenges in using statistical significance testing in NLP, from the point of view of experimental comparison between two algorithms. We cover topics such as choosing an appropriate significance test for the major NLP tasks, dealing with the unique aspects of significance testing for non-convex deep neural networks, accounting for a large number of comparisons between two NLP algorithms in a statistically valid manner (multiple hypothesis testing), and, finally, the unique challenges yielded by the nature of the data and practices of the field.

商品描述(中文翻譯)

資料驅動的實驗分析已成為自然語言處理(NLP)演算法的主要評估工具。事實上,在過去的十年中,很少見到一篇NLP論文,特別是提出新演算法的論文,不包含廣泛的實驗分析,且涉及的任務、數據集、領域和語言的數量不斷增加。

對實證結果的強調凸顯了統計顯著性檢驗在NLP研究中的作用:如果我們作為一個社群依賴實證評估來驗證我們的假設並揭示正確的語言處理機制,我們最好確保我們的結果不是巧合。

本書的目標是討論NLP中統計顯著性檢驗的主要方面。我們在整本書中的基本假設是,NLP研究人員和工程師所處理的基本問題是一個演算法是否可以被認為比另一個演算法更好。這個問題推動著領域的發展,因為它允許不斷進步,開發更好的語言處理技術。在實踐中,研究人員和工程師希望從有限的一組實驗中得出正確的結論,並且這個結論應該對於他們無法使用或由於時間和資源有限而無法執行的其他數據集的實驗也成立。因此,本書從兩個演算法之間的實驗比較的角度討論了在NLP中使用統計顯著性檢驗的機會和挑戰。我們涵蓋了選擇適當的顯著性檢驗方法以進行主要NLP任務、處理非凸深度神經網絡的顯著性檢驗的獨特方面、以統計上有效的方式處理兩個NLP演算法之間的大量比較(多重假設檢驗),以及由數據的性質和領域實踐帶來的獨特挑戰。