Problems of extracting semi-structured textual information based on Text Mining technology (using the material of the Russian and Chuvash languages)
Gubanov Aleksey Rafailovich, Danilov Andrey Anatolyevich, Isaev Yuri Nikolaevich, Gubanova Galina Fedorovna
Chuvash State University
Chuvash State Institute of Humanities
Submitted: 10.07.2024
Abstract. The study aims to identify models and algorithms for processing textual information related to modal correction of intentional relationship schemes in languages with different structures based on Text Mining technology. The growth of diverse textual information flows on the internet, consisting of complexly organized documents, poses challenges for analysts. Such challenges are related to differentiated knowledge extraction (Text Mining technology is used in the analysis of diverse textual information). The paper proposes an approach to analyzing information related to modal correction of schemes of intentional semantic relations in languages with different structures involving methods of computational linguistics and Text Mining. Using the Language Resources library, an analysis of Russian and Chuvash corpora in the Datastores database was conducted (transferring information based on an analysis of the problems of integration and compatibility of data with various types of documents from different sources). Based on the proposed conceptual approach, clustering is performed (of document clusters, of the text corpus). The scientific novelty of the study lies in developing a complex of models and algorithms for analyzing intentional relations in languages with different structures, in particular, in Russian and Chuvash, ensuring accuracy and completeness in extracting information in search queries. Attention is focused on the content of linguistic resources; a classification of linguistic resources is conducted according to class-modes of intentional semantic relations. An approach to formalizing lexico-syntactic templates is determined, and on their basis, the task of constructing a taxonomy for the concept of intentional semantic relations is solved. As a result of the study, it has been found that the proposed method is effective for solving problems of Text Mining and interpreting its results.
Key words and phrases: искусственный интеллект, разноструктурные языки, интенциональные смысловые отношения (ИСО), Text Mining, GATE, Data Mining, artificial intelligence, languages with different structures, intentional semantic relations
Open the whole article in PDF format. Free PDF-files viewer can be downloaded here.
References:
Belonogov G. G., Gilyarevskii R. S., Khoroshilov A. A. Problemy avtomaticheskoi smyslovoi obrabotki tekstovoi informatsii // Nauchno-tekhnicheskaya informatsiya. Seriya 2: Informatsionnye protsessy i sistemy. 2012. № 11.
Bol'shakova E. I., Baeva N. V., Bordachenkova E. A., Vasil'eva N. E., Morozov S. S. Leksiko-sintaksicheskie shablony v zadachakh avtomaticheskoi obrabotki // Komp'yuternaya lingvistika i intellektual'nye tekhnologii: trudy mezhdunarodnoi konferentsii «Dialog 2007». M.: RGGU, 2007.
Bol'shakova E. I., Noskov A. A. Programmnye sredstva analiza tekstov na osnove leksiko-sintaksicheskikh shablonov yazyka LSPL // Programmnye sistemy i instrumenty: tematicheskii sbornik. 2010. № 11.
Gubanov A. R. Mashinnyi fond chuvashskogo yazyka i ego komponenty // Aktual'nye voprosy istorii i kul'tury chuvashskogo naroda: sbornik. Cheboksary: ChGIGN, 2013.
Gubanov A. R. Morfologicheskii standart dlya sistem avtomaticheskoi obrabotki tekstov na chuvashskom yazyke i arkhitektura grammaticheskogo slovarya // Aktual'nye voprosy istorii i kul'tury chuvashskogo naroda: sbornik statei. Cheboksary: ChGIGN, 2015a. Vyp. 3.
Gubanov A. R. Natsional'nyi korpus chuvashskogo yazyka: sozdanie leksicheskogo poiskovika v sisteme Java // Aktual'nye voprosy istorii i kul'tury chuvashskogo naroda: sbornik statei. Cheboksary: ChGIGN, 2015b. Vyp. 3.
Gubanov A. R. Semantiko-sintaksicheskie osobennosti predlozhenii s predikatami intentsional'nogo sostoyaniya v russkom i chuvashskom yazykakh // Vysshaya shkola – narodnomu khozyaistvu Chuvashii. Gumanitarnye nauki: tez. dokl. / Chuvash. gos. un-t im. I. N. Ul'yanova. Cheboksary, 1992.
Gubanov A. R., Gubanova G. F., Sveklova O. V. Tezaurus chuvashskogo yazyka (chăvash pĕlĕvĕn mulĕ) kak yazykovaya sistema znanii // Vestnik Chuvashskogo universiteta. Gumanitarnye nauki. 2017. № 2.
Gubanov A. R., Kozhemyakova E. A., Gubanova G. F. Ontologicheskie modeli poslovits kak pretsedentnykh tekstov (na materiale raznostrukturnykh modelei v russkom i chuvashskom yazykakh) // Etnicheskaya kul'tura. 2023. T. 5. № 2.
Ermakov A. E., Pleshko V. V. Semanticheskaya interpretatsiya v sistemakh komp'yuternogo analiza teksta // Informatsionnye tekhnologii. 2009. T. 6.
Zayukova E. V. Semanticheskie i pragmaticheskie osobennosti leksicheskikh sredstv vyrazheniya intentsional'nosti // Aktual'nye problemy gumanitarnogo znaniya: materialy regional'noi nauchno-prakticheskoi konferentsii molodykh uchenykh. Barnaul, 2004.
Klushina N. I. Intentsional'nyi metod v sovremennoi lingvisticheskoi paradigme // Mediastilistika. 2012. Vyp. 4.
Lukashevich N. V. Tezaurusy v zadachakh informatsionnogo poiska. M.: Izd-vo Moskovskogo un-ta, 2011.
Makarevich T. I. Intellektual'nyi analiz tekstovoi informatsii v spetsializirovannykh oblastyakh v sisteme elektronnogo pravitel'stva // Tsifrovaya transformatsiya. 2019. № 2 (7).
Musaev A. A., Grigor'ev D. A. Obzor sovremennykh tekhnologii izvlecheniya znanii iz tekstovykh soobshchenii // Komp'yuternye issledovaniya i modelirovanie. 2021. T. 13. № 6.
Osipov G. S., Smirnov I. V. Semanticheskii analiz nauchnykh tekstov i ikh bol'shikh massivov // Sistemy vysokoi dostupnosti. 2016. № 1.
Smirnov I. V. Intellektual'nyi analiz tekstov na osnove metodov raznourovnevoi obrabotki estestvennogo yazyka: monografiya. M.: FITs IU RAN, 2023a.
Smirnov I. V. Raznourovnevaya obrabotka estestvennogo yazyka dlya intellektual'nogo poiska i analiza tekstov // Iskusstvennyi intellekt i prinyatie reshenii. 2023b. № 1.
Tikhomirov I. A., Smirnov I. V. Primenenie metodov lingvisticheskoi semantiki i mashinnogo obucheniya dlya povysheniya tochnosti i polnoty poiska v poiskovoi mashine Exactus // Trudy mezhdunarodnoi konferentsii «Dialog 2009». M., 2009.
Tuzov V. A. Komp'yuternaya semantika russkogo yazyka. SPb.: Izd-vo S.-Peterb. un-ta, 2004.
Chepovskii A. M. Informatsionnye modeli v zadachakh obrabotki tekstov na estestvennykh yazykakh. M.: Natsional'nyi otkrytyi universitet «Intuit», 2014.
Shvets A. V. Vzaimodeistvie informatsionnykh i lingvisticheskikh metodov v zadachakh analiza kachestva nauchnykh tekstov: diss. … k. tekhn. n. M., 2015.
Shelmanov A. O. Issledovanie metodov avtomaticheskogo analiza tekstov i razrabotka integrirovannoi sistemy semantiko-sintaksicheskogo analiza: disc. … k. tekhn. n. M., 2015.