Using machine learning for the topic annotation of oral speech corpus texts
Pogodaeva Elena Nikolaevna
Tomsk State University
Submitted: 20.02.2024
Abstract. The research aims to determine the effectiveness of the thesaurus method for forming a list of topic classes when using machine learning for the topic classification of text materials of sociolinguistic interviews. The paper considers the potential of using machine learning in the topic annotation of linguistic corpus materials. The polytopical nature of the analyzed material is due to its genre belonging to dialogical speech. The hierarchical structure of the topics, identified as a result of a preliminary introspective analysis of the texts, can be described using a thesaurus. The results of using the unsupervised machine learning method are discussed involving two sets of topic class names: a list of topics used in manual text annotation and an extended list of micro-topics whose names were selected from a Russian language thesaurus. The paper is novel in that it is the first to propose the thesaurus method for selecting topic labels for the zero-shot classification of weakly structured Russian texts. The research findings show that using a more detailed lexical description for topic classes improves the classification result.
Key words and phrases: лингвистический корпус, машинное обучение, тематическая классификация, разметка данных, диалогическая речь, linguistic corpus, machine learning, topic classification, data annotation, dialogical speech
Open the whole article in PDF format. Free PDF-files viewer can be downloaded here.
References:
Baranov A. N., Dobrovol'skii D. O. Korpusnaya model' idiostilya Dostoevskogo. M.: LEKSRUS, 2021.
Zakharov V. P., Bogdanova S. Yu. Korpusnaya lingvistika. SPb.: Izd-vo S.-Peterb. un-ta, 2020.
Kazakevich O. A. O printsipakh postroeniya funktsional'noi tipologii malykh yazykov (na materiale malykh avtokhtonnykh yazykov Sibiri i Dal'nego Vostoka) // Funktsional'noe razvitie yazykov v polietnicheskikh stranakh mira (Rossiya – V'etnam): materialy mezhdunarodnogo kruglogo stola. M.: Azbukovnik, 2015.
Lukashevich N. V. Tezaurusy v zadachakh informatsionnogo poiska. M., 2010.
Lyashevskaya O. N. Korpusnye instrumenty v grammaticheskikh issledovaniyakh russkogo yazyka. M.: Izdatel'skii dom YaSK; Rukopisnye pamyatniki Drevnei Rusi, 2016.
Rezanova Z. I. Korpus ustnoi rechi russko-tyurkskikh bilingvov Yuzhnoi Sibiri: razmetka otklonenii ot rechevogo standarta // Voprosy leksikografii. 2019. № 15.
Rezanova Z. I. Podkorpus ustnoi rechi russko-tyurkskikh bilingvov Yuzhnoi Sibiri: tipologicheski relevantnye priznaki // Voprosy leksikografii. 2017. № 11.
Bhambhoria R., Chen L., Zhu X. A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification // arXiv. 2023. Art. 2305.15282. https://doi.org/10.48550/arXiv.2305.15282
Marian V., Blumenfeld H. K., Kaushanskaya M. The Language Experience and Proficiency Questionnaire (LEAP-Q): Assessing Language Profiles in Bilinguals and Multilinguals // Journal of Speech, Language, and Hearing Research. 2007. Vol. 50 (4).
Plaza-del-Arco F., Nozza D., Hovy D. Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation // arXiv. 2023. Art. 2307.12973. https://doi.org/10.48550/arXiv.2307.12973.
Rothman D. Transformers for Natural Language Processing and Computer Vision. Birmingham: Packt Publishing, 2024.
Singh J. Natural Language Processing in the Real World: Text Processing, Analytics, and Classification. 1st ed. N. Y.: Chapman and Hall, 2023.
Song Y., Upadhyay S., Peng H., Mayhew S., Roth D. Toward Any-Language Zero-Shot Topic Classification of Textual Documents // Artificial Intelligence. 2019. Vol. 274.
Wang Z., Pang Y., Lin Y. Large Language Models Are Zero-Shot Text Classifiers // arXiv. 2023. Art. 2312.01044. https://doi.org/10.48550/arXiv.2312.01044
Zhang Y., Yang R., Xu X., Xiao J., Shen J., Han J. TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision // arXiv. 2024. Art. 2403.00165. https://doi.org/10.48550/arXiv.2403.00165