Automatic extraction of named entities in the Chinese-Russian corpus of parallel and comparable texts on political topics
Zhu Hui, Mitrofanova Olga Aleksandrovna
Dalian University of Foreign Languages
Saint Petersburg State University
Submitted: 24.07.2024
Abstract. The aim of the research is to experimentally identify and interpret standard and nested named entities in Chinese and Russian political texts, common and specific to the compared languages, using HanLP and SpaСy libraries. During the study, a Chinese-Russian corpus of parallel and comparable political texts was created. The scientific novelty of the research lies in presenting the results of recognizing various named entities and systematizing the types of errors in the Chinese-Russian corpus of parallel and comparable political texts. The study found that the most frequent named entities in original Chinese and Russian political texts are location names, followed by organization names, with person names being the least frequent. Most high-frequency named entities in original Chinese and translated texts generally correspond to each other, proving that translators often use literal translation when rendering named entities from Chinese into Russian in political texts. Our research systematizes and summarizes information on nested named entities in political texts, identifying and analyzing the following types: [[location]LOCATION], [[location]ORGANIZATION], [[number]ORGANIZATION], [[location]OBJECT], [[location]PROJECT].
Key words and phrases: распознавание именованных сущностей, вложенные именованные сущности, корпус текстов, параллельный корпус, политические тексты, named entity recognition, nested named entities, text corpus, parallel corpus, political texts
Open the whole article in PDF format. Free PDF-files viewer can be downloaded here.
References:
Babina O. I. Imenovannye sushchnosti v korpuse tekstov novostnykh soobshchenii: lingvisticheskoe opisanie // Nauka YuUrGU: materialy 68-i nauchnoi konferentsii / Ministerstvo obrazovaniya i nauki Rossiiskoi Federatsii; Yuzhno-Ural'skii gosudarstvennyi universitet. Chelyabinsk, 2016.
Bol'shakova E. I., Efremova N. E. Izvlechenie informatsii iz tekstov: portret napravleniya // Bol'shakova E. I., Vorontsov K. V., Efremova N. E., Klyshinskii E. S., Lukashevich N. V., Sapin A. S. Avtomaticheskaya obrabotka tekstov na estestvennom yazyke i analiz dannykh. M., 2017.
Bol'shakova E. I., Ivanov K. M., Sapin A. S., Sharikov E. F. Sistema dlya izvlecheniya informatsii iz tekstov na baze leksiko-sintaksicheskikh shablonov // Pyatnadtsataya natsional'naya konferentsiya po iskusstvennomu intellektu s mezhdunarodnym uchastiem KII-2016: trudy konferentsii: v 3 t. Smolensk: Universum, 2016. T. 1.
Brykina M. M., Fainveits A. V., Toldova S. Yu. Izvlechenie i identifikatsiya imenovannykh sushchnostei s ispol'zovaniem slovarei v russkom yazyke // Aktual'nye innovatsionnye issledovaniya: nauka i praktika. 2013. № 1.
Zakharov V. P., Bogdanova S. Yu. Korpusnaya lingvistika. SPb., 2020.
Kolpachkova E. N. Korpusy kitaiskogo yazyka: sovremennoe sostoyanie i osnovnye problemy // Korpusnaya lingvistika – 2015: trudy mezhdunarodnoi konferentsii. SPb., 2015.
Sokolovskii D. E., Nekrasov V. N., Zemlyanskii S. A., Aksenov S. V. Otsenka ispol'zovaniya instrumentov biblioteki SpaCy i DeepPavlov dlya zadachi izvlecheniya imenovannykh sushchnostei iz opisanii rezul'tatov osmotrov patsientov s COVID-19 // Izvestiya Tomskogo politekhnicheskogo universiteta. Promyshlennaya kibernetika. 2023. № 2.
Staltmane V. E. Onomasticheskaya leksikografiya. M.: Nauka, 1989.
Superanskaya A. V. Obshchaya teoriya imeni sobstvennogo. M.: Nauka, 1973.
Tao Yu., Zakharov V. P. Razrabotka i ispol'zovanie parallel'nogo korpusa russkogo i kitaiskogo yazykov // Nauchno-tekhnicheskaya informatsiya. Seriya 2: Informatsionnye protsessy i sistemy. 2015. № 4.
Filippova E. A. Izvlechenie informatsii // Prikladnaya i komp'yuternaya lingvistika / pod red. I. S. Nikolaeva, O. V. Mitreninoi, T. M. Lando. M.: Lenand, 2017.
Chzhu Kh., Zakharov V. P. Korpusnoe sravnenie yazyka kitaiskikh i rossiiskikh politicheskikh tekstov // Politicheskaya lingvistika. 2024. № 1.
Au T. W. T., Lampos V., Cox I. J. E-NER – an Annotated Named Entity Recognition Corpus of Legal Text // arXiv. 2022. https://doi.org/10.48550/arXiv.2212.09306
Baker M. Corpus Linguistics and Translation Studies: Implications and Applications // Text and Technology: In Honour of John Sinclair / ed. by M. Baker, G. Francis, E. Tognini-Bonelli. Amsterdam: John Benjamins, 1993.
Bonnefoy L., Bellot P., Benoit M. Mesure Non-Supervisée du Degré d’Appartenance d’une Entité à un Type // TALN 2011 (Montpellier, 27 juin – 1er juillet 2011). Montpellier, 2011.
Borthwick A., Sterling J., Agichtein E., Grishman R. NYU: Description of the MENE Named Entity System as Used in MUC-7 // Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, 1998. Fairfax, 1998.
Cetoli A., Bragaglia S., Harney A. D., Sloan M. Graph Convolutional Networks for Named Entity Recognition // arXiv. 2018. https://doi.org/10.48550/arXiv.1709.10053
Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K., Kuksa P. Natural Language Processing (Almost) from Scratch // Journal of Machine Learning Research. 2011. Vol. 12.
Devlin J., Chang M., Lee K., Toutanova K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding // arXiv. 2019. https://doi.org/10.48550/arXiv.1810.04805
Gao S., Kotevska O., Sorokine A., Christian J. B. A Pre-Training and Self-Training Approach for Biomedical Named Entity Recognition // PloS One. 2021. Vol. 2.
Grishman R., Sundheim B. Message Understanding Conference – 6: A Brief History // Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen, 1996.
Huang J., Li C., Subudhi K., Jose D., Balakrishnan Sh., Chen W., Peng B., Gao J., Han J. Few-Shot Named Entity Recognition: A Comprehensive Study // arXiv. 2020. https://doi.org/10.48550/arXiv.2012.14978
Keraghel I., Morbieu S., Nadif M. A Survey on Recent Advances in Named Entity Recognition // arXiv. 2024. https://doi.org/10.48550/arXiv.2401.10825
Kozareva Z., Bonev B., Montoyo A. Self-Training and Co-Training Applied to Spanish Named Entity Recognition // Mexican International Conference on Artificial Intelligence. Monterrey: Springer, 2005.
Li Ch., Sun A., Weng J., He Q. Tweet Segmentation and Its Application to Named Entity Recognition // IEEE Transactions on Knowledge and Data Engineering. 2014. Vol. 27 (2).
Li J., Sun A., Han J., Li Ch. A Survey on Deep Learning for Named Entity Recognition // IEEE Transactions on Knowledge and Data Engineering. 2020. Vol. 34 (1).
Alvarado J. C. S., Verspoor K., Baldwin T. Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment // Proceedings of the Australasian Language Technology Association Workshop. Parramatta, 2015.
Li P., Sun T., Tang Q., Yan H., Wu Y., Huang X., Qiu X. CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors // arXiv. 2023. https://doi.org/10.48550/arXiv.2305.05711
Liu P., Guo Y., Wang F., Li G. Chinese Named Entity Recognition: The State of the Art // Neurocomputing. 2022. Vol. 473.
Loukachevitch N., Artemova E., Batura T., Braslavski P., Denisov I., Ivanov V., Manandhar S., Pugachev A., Tutubalina E. NEREL: A Russian Dataset with Nested Named Entities and Relations // Proceedings of the International Conference on Recent Advances in Natural Language Processing. RANLP, 2021.
Luz de Araujo P. H., De Campos T. E., De Oliveira R. R. R., Stauffer M., Couto S., Bermejo P. LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text // Computational Processing of the Portuguese Language. PROPOR 2018 / ed. by A. Villavicencio, V. Moreira, A. Abad. Cham: Springer, 2018. https://doi.org/10.1007/978-3-319-99722-3_32
Morwal S., Jahan N., Chopra D. Named Entity Recognition Using Hidden Markov Model (HMM) // International Journal on Natural Language Computing. 2012. Vol. 1.
Nadeau D., Sekine S. A Survey of Named Entity Recognition and Classification // Lingvisticae Investigationes. 2007. Vol. 30. Iss. 1.
Popov A. M., Adaskina Yu. V., Andreyeva D. A., Charabet Ja., Moskvina A. D., Protopopova E. V., Yushina T. A. Named Entity Normalization for Fact Extraction Task // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”. Moscow, 2016.
Sekine S., Sudo K., Nobata C. Extended Named Entity Hierarchy // International Conference on Language Resources and Evaluation. Las Palmas, 2002.
Shaalan K., Raza H. NERA: Named Entity Recognition for Arabic // Journal of the American Society for Information Science and Technology. 2009. Vol. 8.
Shinyama Y., Sekine S. Named Entity Discovery Using Comparable News Articles // COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Switzerland. Geneva, 2004.
Shishtla P. M., Gali K., Pingali P., Varma V. Experiments in Telugu NER: A Conditional Random Field Approach // Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. Hyderabad, 2008.
Tran V. C., Hwang D., Jung J. J. Semi-Supervised Approach Based on Cooccurrence Coefficient for Named Entity Recognition on Twitter // 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science (NICS). Ho Chi Minh City, 2015.
Wang X., Yang Ch., Guan R. A Comparative Study for Biomedical Named Entity Recognition // International Journal of Machine Learning and Cybernetics. 2018. Vol. 9 (3).
Yamada H., Kudo T., Matsumoto Y. Japanese Named Entity Extraction Using Support Vector Machine // Transactions of IPSJ. 2002. Vol. 43. Iss. 1.
Zhang X., Wang L. Identification and Analysis of Chinese Organization and Institution Names // Journal of Chinese Information Processing. 1997. Vol. 4.
Zhang Y., Zhang H. 2022. FinBERT-MRC: Financial Named Entity Recognition Using BERT under the Machine Reading Comprehension Paradigm // arXiv. 2022. https://doi.org/10.48550/arXiv.2205.15485
Zhou G., Zhang J., Su J., Shen D., Tan Ch. L. Recognizing Names in Biomedical Texts: A Machine Learning Approach // Bioinformatics. 2004. Vol. 20 (7).
崔卫, 李峰. 俄汉-汉俄平行语料库的构建设想与应用展望 // 中国俄语教学. 2014. № 1 (Tsui V., Li F. Kontseptsiya postroeniya i perspektivy primeneniya russko-kitaiskogo parallel'nogo korpusa // Prepodavanie russkogo yazyka v Kitae. 2014. № 1).
李晓倩, 胡开宝. 中国政府工作报告英译文中主题词及其搭配研究 // 中国外语. 2017. № 6 (Li S., Khu K. Issledovanie klyuchevykh slov i ikh sochetanii v angliiskikh perevodakh «Dokladov o rabote pravitel'stva Kitaya» // Inostrannye yazyki v Kitae. 2017. № 6).
王克非, 秦洪武. 英译汉语言特征探讨——基于对应语料库的宏观分析 // 外语学刊. 2009. № 1 (Van K., Tsin' Kh. Issledovanie lingvisticheskikh osobennostei perevoda s angliiskogo na kitaiskii – makroanaliz na osnove korpusa // Zhurnal inostrannykh yazykov. 2009. № 1).