Building a linguistic corpus based on natural language processing tools: Planning software solutions
Gorozhanov Alexey Ivanovich
Moscow State Linguistic University
Submitted: 17.03.2023
Abstract. The paper is aimed at building a model of a linguistic corpus, which is generated according to the rules of the spaCy natural language processing library. Scientific novelty lies in the fact that within the framework of humanities research, the method of modelling is used, which is combined with a corpus approach and takes into account the technological (software) component at the very stage of goal setting. In the research, firstly, a general structural model of a linguistic corpus as a sequence of blocks was determined and standard queries to the database were formulated; secondly, a model of the corpus manager interface able to implement these standard queries was built; thirdly, an analysis of the proposed model with the help of mini-programs that allow assessing the degree of technical feasibility of the queries and their practical value was conducted. At this stage, text arrays of fictional works by German-speaking (F. Kafka, E. M. Remarque) and English-speaking (A. C. Doyle, G. Orwell) writers were involved as linguistic material. The obtained results showed that the constructed model has a number of advantages with a limited number of disadvantages, is flexible in terms of further development and can be programmatically implemented in the short term.
Key words and phrases: моделирование, корпусная лингвистика, корпусный менеджер, графический интерфейс пользователя, spaCy, modelling, corpus linguistics, corpus manager, graphical user interface
Open the whole article in PDF format. Free PDF-files viewer can be downloaded here.
References:
Bakaev M. A., Razumnikova O. M. Opredelenie slozhnosti zadach dlya zritel'no-prostranstvennoi pamyati i propusknoi sposobnosti cheloveka-operatora // Upravlenie bol'shimi sistemami: sbornik trudov. 2017. № 70.
Boiko V. A., Legalov A. I., Zykov S. V. Arkhitektura intellektual'noi sistemy testirovaniya // Zhurnal Sibirskogo federal'nogo universiteta. Seriya «Tekhnika i tekhnologii». 2022. T. 15. № 2. DOI: 10.17516/1999-494X-0390
Gorozhanov A. I. Eksperimental'noe modelirovanie bazy dannykh sbalansirovannogo lingvisticheskogo korpusa // Filologicheskie nauki. Voprosy teorii i praktiki. 2022. T. 15. Vyp. 10. DOI: 10.30853/phil20220563
Gorozhanov A. I., Stepanova D. V. Sostavlenie sbalansirovannogo korpusa khudozhestvennogo proizvedeniya (na materiale romanov F. Kafki) // Vestnik Moskovskogo gosudarstvennogo lingvisticheskogo universiteta. Gumanitarnye nauki. 2022. № 7 (862). DOI: 10.52070/2542-2197_2022_7_862_31
Pisarik O. I. Printsipy razrabotki bazy dannykh pod"yazyka predmetnoi oblasti «Stroitel'stvo» // Vestnik Moskovskogo gosudarstvennogo lingvisticheskogo universiteta. Gumanitarnye nauki. 2021. № 5 (847). DOI: 10.52070/2542-2197_2021_5_847_150
Chitalov D. I. Dorabotka graficheskogo interfeisa platformy OpenFOAM v chasti rasshireniya perechnya utilit dlya raboty s raschetnymi setkami // Sistemy i sredstva informatiki. 2022. T. 32. № 1. DOI: 10.14357/08696527220113
Fonseca C. A., Guelpeli M. V. C., De Souza Netto R. S. Representation of structured data of the text genre as a technique for automatic text processing // Texto Livre. 2021. Vol. 15. DOI: 10.35699/1983-3652.2022.35445
Malyuga E. N., McCarthy M. “No” and “net” as response tokens in English and Russian business discourse: In search of a functional equivalence // Russian Journal of Linguistics. 2021. Vol. 25 (2). DOI: 10.22363/2687-0088-2021-25-2-391-416
O’Neill H., Welsh A., Smith D. A., Roe G., Terras M. Text mining mill: Computationally detecting influence in the writings of John Stuart Mill from library records // Digital Scholarship in the Humanities. 2021. Vol. 36 (4). DOI: 10.1093/llc/fqab010
Tsujii J. Natural language processing and computational linguistics // Computational Linguistics. 2021. Vol. 47 (4). DOI: 10.1162/COLI_a_00420