EMNLP 2021 | 羅氏和博阿齊奇大學(xué)研究合作團(tuán)隊提出：多標(biāo)簽文本分類中長尾分布的平衡策略

本文作者：我在思考中

2021-11-10 15:37

導(dǎo)語：平衡損失函數(shù)為多標(biāo)簽文本分類的應(yīng)用提供了一個有效策略。

EMNLP 2021 | 羅氏和博阿齊奇大學(xué)研究合作團(tuán)隊提出：多標(biāo)簽文本分類中長尾分布的平衡策略

作者 | 黃毅

作者簡介：黃毅，本文一作，目前為羅氏集團(tuán)的數(shù)據(jù)科學(xué)家，研究領(lǐng)域為自然語言處理的生物醫(yī)學(xué)應(yīng)用。

論文鏈接：https://arxiv.org/pdf/2109.04712.pdf

文章源碼：https://github.com/Roche/BalancedLossNLP

摘要

多標(biāo)簽文本分類是自然語言處理中的一類經(jīng)典任務(wù)，訓(xùn)練模型為給定文本標(biāo)記上不定數(shù)目的類別標(biāo)簽。然而實際應(yīng)用時，各類別標(biāo)簽的訓(xùn)練數(shù)據(jù)量往往差異較大（不平衡分類問題），甚至是長尾分布，影響了所獲得模型的效果。重采樣（Resampling）和重加權(quán)（Reweighting）常用于應(yīng)對不平衡分類問題，但由于多標(biāo)簽文本分類的場景下類別標(biāo)簽間存在關(guān)聯(lián)，現(xiàn)有方法會導(dǎo)致對高頻標(biāo)簽的過采樣。本項工作中，我們探討了優(yōu)化損失函數(shù)的策略，尤其是平衡損失函數(shù)在多標(biāo)簽文本分類中的應(yīng)用。基于通用數(shù)據(jù)集 (Reuters-21578，90 個標(biāo)簽) 和生物醫(yī)學(xué)領(lǐng)域數(shù)據(jù)集（PubMed，18211 個標(biāo)簽）的多組實驗，我們發(fā)現(xiàn)一類分布平衡損失函數(shù)的表現(xiàn)整體優(yōu)于常用損失函數(shù)。研究人員近期發(fā)現(xiàn)該類損失函數(shù)對圖像識別模型的效果提升，而我們的工作進(jìn)一步證明其在自然語言處理中的有效性。

引言

多標(biāo)簽文本分類是自然語言處理（NLP）的核心任務(wù)之一，旨在為給定文本從標(biāo)簽庫中找到多個相關(guān)標(biāo)簽，可應(yīng)用于搜索（Prabhu et al., 2018）和產(chǎn)品分類（Agrawal et al., 2013）等諸多場景。圖 1 展示了通用多標(biāo)簽文本分類數(shù)據(jù)集 Reuters-21578 的樣例數(shù)據(jù)（Hayes and Weinstein, 1990）。

圖1 Reuters-21578 的樣例數(shù)據(jù)（僅展示文章標(biāo)題）。

標(biāo)簽后面的數(shù)字代表數(shù)據(jù)集中帶有該標(biāo)簽的數(shù)據(jù)實例個數(shù)。

當(dāng)標(biāo)簽數(shù)據(jù)存在長尾分布（不平衡分類）和標(biāo)簽連鎖（類別共現(xiàn)）時，多標(biāo)簽文本分類會變得更加復(fù)雜（圖2）。長尾分布，指的是一小部分標(biāo)簽（即頭部標(biāo)簽）有很多數(shù)據(jù)實例，而大多數(shù)標(biāo)簽（即尾部標(biāo)簽）只有很少數(shù)據(jù)實例的不平衡分類情況。標(biāo)簽連鎖，指的是頭部標(biāo)簽與尾部標(biāo)簽共同出現(xiàn)導(dǎo)致模型對頭部標(biāo)簽的權(quán)重傾斜。現(xiàn)有的 NLP 解決方案包括但不限于：在分類中對尾部標(biāo)簽重采樣（Estabrooks et al., 2004; Charte et al., 2015），模型初始化時將類別共現(xiàn)信息納入考慮（Kurata et al., 2016），以及將頭尾部標(biāo)簽混合的多任務(wù)架構(gòu)方案 (Yang et al., 2020) 。但這些方案依賴于模型架構(gòu)的專門設(shè)計，或不適用于長尾分布數(shù)據(jù)。

EMNLP 2021 | 羅氏和博阿齊奇大學(xué)研究合作團(tuán)隊提出：多標(biāo)簽文本分類中長尾分布的平衡策略

圖2 Reuters-21578的長尾分布和標(biāo)簽連鎖現(xiàn)象。

熱圖矩陣展示了第i列標(biāo)簽在含第j行標(biāo)簽數(shù)據(jù)實例中的條件概率p(i|j)

近年來，計算機(jī)視覺（CV）領(lǐng)域也有不少關(guān)于多標(biāo)簽分類的研究。其中，優(yōu)化損失函數(shù)的策略已被用于多種 CV 任務(wù)，如對象識別（Durand et al., 2019; Milletari et al., 2016）、語義分割（Ge et al., 2018）與醫(yī)學(xué)影像（Li et al., 2020a）等。平衡損失函數(shù)，如 Focal loss (Lin et al., 2017)、Class-balanced loss (Cui et al., 2019) 和 Distribution-balanced loss (Wu et al., 2020) 等，提供了針對多標(biāo)簽圖像分類的長尾分布和標(biāo)簽連鎖問題的解決方案。由于損失函數(shù)的調(diào)整可以獨(dú)立于模型架構(gòu)地靈活嵌入常見模型，NLP 中也逐步有類似的優(yōu)化損失函數(shù)的策略探索（Li et al., 2020b; Cohan et al., 2020）。例如，(Li et al., 2020b) 將醫(yī)學(xué)圖像分割任務(wù)中的 Dice loss (Milletari et al., 2016) 引入 NLP，顯著改善了多種任務(wù)的模型效果。

本項工作中，我們將一類新的平衡損失函數(shù)引入 NLP，用于多標(biāo)簽文本分類任務(wù)，并使用 Reuters-21578（一個通用的小型數(shù)據(jù)集）和 PubMed（一個生物醫(yī)學(xué)領(lǐng)域的大型數(shù)據(jù)集）數(shù)據(jù)集進(jìn)行了實驗。對于這兩個數(shù)據(jù)集，分布平衡損失函數(shù)在總指標(biāo)上優(yōu)于其他損失函數(shù)，并且顯著改善了尾部標(biāo)簽的模型表現(xiàn)。我們認(rèn)為，平衡損失函數(shù)為多標(biāo)簽文本分類的應(yīng)用提供了一個有效策略。

方法介紹

損失函數(shù)

多標(biāo)簽文本分類中，二值交叉熵（Binary Cross Entropy, BCE）是較常用的損失函數(shù) (Bengio et al., 2013)。原始的 BCE 容易被大量頭部標(biāo)簽或負(fù)樣本干擾。近年來，一些新的損失函數(shù)通過調(diào)節(jié) BCE 的權(quán)重，實現(xiàn)了模型訓(xùn)練過程的相對平衡。我們在此回顧了三類損失函數(shù)設(shè)計。

Focal loss （FL）通過模型對數(shù)據(jù)實例標(biāo)記標(biāo)簽的“難易程度”為 BCE 設(shè)計權(quán)重 (Lin et al., 2017)。對于同一數(shù)據(jù)實例，相比可輕松分類（p值接近真實值）的標(biāo)簽，難以標(biāo)記（p值遠(yuǎn)離真實值）的標(biāo)簽將獲得比 BCE 更高的權(quán)重。由于 FL 在模型訓(xùn)練過程中良好的自適應(yīng)效果，下述兩類損失函數(shù)也采用了這一組件。

Class-balanced focal loss（CB）通過估計數(shù)據(jù)采樣的有效數(shù)量，將每個標(biāo)簽增量訓(xùn)練數(shù)據(jù)的邊際效用納入考慮，在不同訓(xùn)練數(shù)據(jù)支持的標(biāo)簽間調(diào)節(jié)權(quán)重 (Cui et al., 2019)。

Distribution-balanced loss（DB，分布平衡損失函數(shù)）則是在 FL 基礎(chǔ)上添加了兩部分組件 (Wu et al., 2020)。其一為 Rebalancing 組件，減少了標(biāo)簽連鎖帶來的冗余信息，其二為 Negative Tolerant Regularization （NTR）組件，在不同正負(fù)樣本數(shù)目的標(biāo)簽間調(diào)節(jié)權(quán)重，降低尾部標(biāo)簽的閾值。

上述損失函數(shù)的具體設(shè)計如圖3所示（簡單起見已略去求和平均項）。

EMNLP 2021 | 羅氏和博阿齊奇大學(xué)研究合作團(tuán)隊提出：多標(biāo)簽文本分類中長尾分布的平衡策略

圖3 損失函數(shù)的具體設(shè)計。

數(shù)據(jù)集

本項工作中，我們使用了兩個不同數(shù)據(jù)量和領(lǐng)域的多標(biāo)簽文本分類數(shù)據(jù)集（表 1）。Reuters-21578 數(shù)據(jù)集包含1987 年刊登在路透社的一萬多份新聞文章（Hayes and Weinstein, 1990）。我們按照（Yang and Liu, 1999）使用的訓(xùn)練-測試分割數(shù)據(jù)，并將 90 個標(biāo)簽平均分為頭部（30 個標(biāo)簽，各含 ≥35 個實例）、中部（31 個標(biāo)簽，各含 8-35 個實例）和尾部（30 個標(biāo)簽，各含 ≤8 個實例）標(biāo)簽的子集。PubMed 數(shù)據(jù)集則來自 BioASQ 競賽（Licence：8283NLM123），包含PubMed 文章的標(biāo)題、摘要及對應(yīng)的生物醫(yī)學(xué)主題詞標(biāo)記 (MeSH)（Tsatsaronis et al.，2015; Coordinators, 2017）。類似地，18211個標(biāo)簽按分位數(shù)分為頭部（6018 個標(biāo)簽，各含≥50 個實例）、中部（5581 個標(biāo)簽，各含 15-50 個實例）和尾部（6612 個標(biāo)簽，各含 ≤15 個實例）標(biāo)簽的子集。

EMNLP 2021 | 羅氏和博阿齊奇大學(xué)研究合作團(tuán)隊提出：多標(biāo)簽文本分類中長尾分布的平衡策略

表1 實驗用數(shù)據(jù)集的基本信息

實驗

我們比較了不同損失函數(shù)與經(jīng)典 SVM one-vs-rest 模型的表現(xiàn)。對于各個數(shù)據(jù)集和模型，我們計算了標(biāo)簽集整體以及頭部、中部、尾部標(biāo)簽子集的micro-F1 和 macro-F1 得分（Wu et al., 2019；Lipton et al., 2014 ）。表 2 匯總了不同損失函數(shù)的實驗結(jié)果。Reuters-21578 結(jié)果中，BCE 的表現(xiàn)最差。依次對比 micro-F1 和 macro-F1之間、及不同組間的得分可以看出長尾分布的影響。PubMed 數(shù)據(jù)由于不平衡更明顯，長尾分布的影響更大。

表2 實驗結(jié)果對比

對于 Reuters-21578 數(shù)據(jù)集，損失函數(shù) FL、CB、R-FL 和 NTR-FL 在頭部標(biāo)簽中的表現(xiàn)與 BCE 相似，但在中部和尾部標(biāo)簽中的表現(xiàn)優(yōu)于 BCE，說明它們對于不平衡問題的改進(jìn)。DB 在尾部標(biāo)簽改進(jìn)最明顯，整體表現(xiàn)也優(yōu)于先前使用相同數(shù)據(jù)集的解決方案，例如 Binary Relevance、EncDec、CNN、CNN-RNN、Optimal Completion Distillation和 GNN 等（Nam et al., 2017 ; Pal et al., 2020；Tsai and Lee et al., 2020）。對于PubMed 數(shù)據(jù)集，由于BCE 中部和尾部標(biāo)簽已失效，我們使用 FL 作為更強(qiáng)的基線。其他損失函數(shù)在中部和尾部標(biāo)簽中的表現(xiàn)均優(yōu)于 FL。DB 再次證明了其在整體、中部和尾部標(biāo)簽的良好效果。

我們進(jìn)一步嘗試從 DB 中去除一個組件，即移除 NTR 組件得到 R-FL、移除 Rebalancing 組件得到 NTR-FL，移除 FL 組件得到 DB-0FL，通過比較三個殘缺模型探索對應(yīng)三個組件的效果。如表 2 所示，對于兩個數(shù)據(jù)集，移除 NTR 組件 (R-FL) 或 FL 組件 (DB-0FL) 會降低所有亞組的模型效果。移除 Rebalancing 組件 (NTR-FL) 產(chǎn)生相似的整體 micro-F1，但整體 macro-F1 及中部和尾部標(biāo)簽 F1 得分不如 DB，顯示增加Rebalancing 組件的作用。最終，我們還嘗試將 NTR-FL 與 CB 集成，從而得到一個全新的損失函數(shù) CB-NTR，它在兩個數(shù)據(jù)集上得到的所有 F1 值均優(yōu)于 CB。CB-NTR 和 DB 間的唯一區(qū)別是使用 CB 權(quán)重替換了 Rebalancing 權(quán)重，而 DB 在中部和尾部標(biāo)簽中的表現(xiàn)優(yōu)于或非常接近 CB-NTR，可能來自于通過 Rebalancing 權(quán)重處理標(biāo)簽連鎖對模型效果的提升。

結(jié)語

針對多標(biāo)簽文本分類中的不平衡分類問題，我們研究了優(yōu)化損失函數(shù)的策略，并系統(tǒng)比較了各種平衡損失函數(shù)的效果。我們首次將 DB 引入 NLP，并設(shè)計了全新的平衡損失函數(shù) CB-NTR。在開放數(shù)據(jù)集 Reuters-21578（90 類標(biāo)簽，通用領(lǐng)域）和 PubMed（18211 類標(biāo)簽，生物醫(yī)學(xué)領(lǐng)域）的實驗表明，DB 的模型效果優(yōu)于其他損失函數(shù)。這項研究證明，優(yōu)化損失函數(shù)的策略可以有效解決多標(biāo)簽文本分類時不平衡分類的問題。該策略由于僅需調(diào)整損失函數(shù)，可以靈活兼容各種基于神經(jīng)網(wǎng)絡(luò)的模型框架，也適用于其他受到長尾分布影響的 NLP 任務(wù)。

羅氏集團(tuán)制藥部門中國 CIO 施涪軍：該工作來自于合作團(tuán)隊在生物醫(yī)學(xué)領(lǐng)域的深度學(xué)習(xí)應(yīng)用探索。相比于日常文本，生物醫(yī)學(xué)領(lǐng)域的語料往往更專業(yè)，而標(biāo)注更稀疏，導(dǎo)致 AI 應(yīng)用面臨“最后一公里”的落地挑戰(zhàn)。本論文從稀疏標(biāo)注的長尾分布等問題入手，由 CV 前沿研究引入損失函數(shù)并優(yōu)化，使得既有 NLP 模型可以在框架不變的情況下將訓(xùn)練資源向?qū)嵗^少的類別平衡，進(jìn)而實現(xiàn)整體的模型效果提升。很高興看到此策略在面臨類似問題的日常文本上同樣有效，希望繼續(xù)與院校、企業(yè)在前沿技術(shù)的研究與應(yīng)用上扎實共創(chuàng)。

參考文獻(xiàn)：

Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828.

Francisco Charte, Antonio J Rivera, María J del Jesus,and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163:3–16.

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.

NCBI Resource Coordinators. 2017. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 46(D1):D8–D13.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class-balanced loss based on effective number of samples. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9260–9269.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

T. Durand, N. Mehrasa, and G. Mori. 2019. Learning a deep convnet for multi-label classification with partial labels. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 647–657, Los Alamitos, CA, USA. IEEE Computer Society.

Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. 2004. A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1):18–36.

Weifeng Ge, Sibei Yang, and Yizhou Yu. 2018. Multievidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Philip J. Hayes and Steven P. Weinstein. 1990. Construe/tis: A system for content-based indexing of a database of news stories. In Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, IAAI ’90, page 49–64. AAAI Press.

Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Improved neural network-based multi-label classification with better initialization leveraging label cooccurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 521–526.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.

Jianqiang Li, Guanghui Fu, Yueda Chen, Pengzhi Li, Bo Liu, Yan Pei, and Hui Feng. 2020a. A multilabel classification model for full slice brain computerised tomography image. BMC Bioinformatics, 21(6):200.

Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2020b. Dice loss for dataimbalanced NLP tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 465–476, Online. Association for Computational Linguistics.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, Los Alamitos, CA, USA. IEEE Computer Society.

Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. 2014. Optimal thresholding of classifiers to maximize f1 measure. In Machine Learning and Knowledge Discovery in Databases, pages 225–239, Berlin, Heidelberg. Springer Berlin Heidelberg. Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571.

Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes Fürnkranz. 2017. Maximizing subset accuracy with recurrent neural networks in multilabel classification. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Ankit Pal, Muru Selvakumar, and Malaikannan Sankarasubbu. 2020. Magnet: Multi-label text classification using attention-based graph neural network. In ICAART (2), pages 494–505.

F. Pedregosa, G. Varoqu

aux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pages 993–1002.

Che-Ping Tsai and Hung-yi Lee. 2020. Order-free learning alleviating exposure bias in multi-label classification. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 6038–6045. AAAI Press.

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artieres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16:138.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Jiawei Wu, Wenhan Xiong, and William Yang Wang. 2019. Learning to learn and predict: A metalearning approach for multi-label classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4354– 4364, Hong Kong, China. Association for Computational Linguistics.

Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. 2020. Distribution-balanced loss for multi-label classification in long-tailed datasets. In Computer Vision – ECCV 2020, pages 162–178, Cham. Springer International Publishing.

Wenshuo Yang, Jiyi Li, Fumiyo Fukumoto, and Yanming Ye. 2020. HSCNN: A hybrid-Siamese convolutional neural network for extremely imbalanced multi-label text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6716–6722, Online. Association for Computational Linguistics.

Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 42–49, New York, NY, USA. Association for Computing Machinery.

EMNLP 2021 | 羅氏和博阿齊奇大學(xué)研究合作團(tuán)隊提出：多標(biāo)簽文本分類中長尾分布的平衡策略

雷鋒網(wǎng)

雷峰網(wǎng)版權(quán)文章，未經(jīng)授權(quán)禁止轉(zhuǎn)載。詳情見轉(zhuǎn)載須知。

1人收藏

相關(guān)文章

我在思考中

運(yùn)營

發(fā)私信

當(dāng)月熱門文章