مروری بر برچسب‌گذاری واژگانی زبان‌هایی با صورت نوشتاری لاتین و غیرلاتین: نگاهی مبسوط بر زبان فارسی

نویسندگان

  • Meisam Moghadam Fasa University
  • Niloufar Jafarpour Technische Universität Darmstadt, Germany

DOI::

https://doi.org/10.22046/LA.2021.05

کلمات کلیدی:

برچسب‌گذاری واژگانی، زبان‌های نوشتاری لاتین، زبان‌های نوشتاری غیر لاتین، زبان فارسی، سیستم RTL.

چکیده

مقاله حاضر، به بررسی جامع موضوع برچسب‌گذاری واژگانی صورت نوشتاری زبان‌های لاتین و غیرلاتین به ویژه زبان فارسی می‌‌پردازد. در این نوشتار میزان پیشرفت برچسب‌گذاری واژگانی در بیست و سه زبان گفتاری دنیا، که دارای بیشترین متکلم می‌باشند، مورد بررسی قرار می‌گیرد. برخی از این زبان‌ها مثل زبان‌های عربی، اردو و فارسی از سیستم نوشتاری از راست به چپ پیروی می‌کنند، و در نوع خود با مشکلات و چالش‌هایی در زمینه برچسب‌گذاری واژگانی روبرو هستند. این چالش‌ها می‌تواند منحصر به یک زبان خاص باشد و یا در بین زبان‌های گوناگون مشترک باشند، که به برخی از آن‌ها اشاره خواهیم کرد. در این مقاله، با مروری نقادانه بر مطالعات اخیر در حیطه برچسب‌گذاری واژگانی، چالش‌های پیش روی زبان فارسی مد نظر قرار گرفته شده است. با مرور تحقیقات پیشین و مطالعه ویژگی‌ها، مسائل، چالش‌ها و ابزارهای برچسب‌گذاری واژگانی، این نتیجه حاصل می‌شود که، چالش‌های برچسب‌گذاری واژگانی در زبان فارسی بیشتر در سطح توکن‌سازی و مربوط به شرایط رسم الخط عربی است.

مراجع

AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Knowledge-Based Systems, 22(5), 382-387.

Amtrup, J. W., Mansouri Rad, H., Megerdoomian, K., & Zajac, R. (2000). Persian-English Machine Translation: An Overview of the Shiraz Project. Memoranda in Computer and Cognitive Science.

Assi, S. M. (2005). Word Prediction in a Running Text: A Statistical Language Modeling for the Persian Language, poster presented at the Australian Language Technology Workshop, 2005, Sydney University, Australia.

Assi, S. M. (1997). Farsi Linguistic Database (FLDB), International Journal of Lexicography, 10(3), 5-10.

Assi, S. M., & Abdolhoseini, M. H. (2000). Grammatical Tagging of a Persian Corpus. International Journal of Corpus Linguistics, 5(1), 69-81.

Azimizadeh, A., Arab, M. M., Quchani, S. R., (2008). Persian Part of Speech Tagger Based on Hidden Markov Model, 9th International Conference on the Statistical Analysis of Textual Data (JADT), USA.

Beeman, O. W. (2005). Perisan, Dari and Tajik in central Asia. The National Council for Eurasian and East European Research.

Bijankhan, M. (2004). The Role of the Corpus in writing a Grammar: An Introduction to a Software. Iranian Journal of Linguistics, 19.

Bijankhan, M., Sheykhzadegan, J., Bahrani, M. & Ghayoomi, M. (2011). Lessons from building a Persian written corpus: Peykare. Language resources and evaluation, 45(2):143–164.

Brants, T. (2000). TNT: A statistical part-of-speech tagger, In the Proceedings of 6th conference on applied natural language processing (ANLP), USA.

Brill, E. (1995). Transformation-based Error driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Journal of Computational Linguistic, 21(4), 543-565.

Curtis J. E. and Tallis, N. (2005). Forgotten Empire, The World of Ancient Persia,.University of California Press, Berkeley and Los Angeles, California.

Fadaei, H. & Shamsfard, M. (2008). Persian POS tagging using probabilistic morphological analysis. International Journal of Computer Applications in Technology, 38(4), 264-273.

Forsati, R. & Shamsfard, M. (2012). Cooperation of evolutionary and statistical POS-tagging. In The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), pages 446-451.

Hashemi, H. B., Shakery, A. & Faili, H. (2010). Creating a Persian-English Comparable Corpus, in proceedings of Conference on Multilingual and Multimodal Information Access Evaluation (CLEF), Padua, Italy, pp. 27-39.

Hosseini Pozveh, Z., Monadjemi, A., Ahmadi, A. (2016). Persian Texts Part of Speech Tagging Using Artificial Neural Networks, Journal of Computing and Security, 3(4). 233-241.

Jurafsky D., & Martin, J. H. (2009). Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, Upper Saddle River, New Jersey.

Kardan, A. A. & Imani, M. B. (2014). Improving Persian POS tagging using the maximum entropy model. In 2014 Iranian Conference on Intelligent Systems (ICIS), 1-5.

Karlsson, F., Voutilainen, A., Heikkilä, J. & Anttila, A. (1995). Constraint Grammar: A Language-Independent Framework for Parsing Unrestricted Text. Mouton de Gruyter, Berlin /New York.

Khanam, M. H., Madhumurthy, K. V., Khudhus, M. A. (2013). Part-Of-Speech Tagging for Urdu in Scarce Resource: Mix Maximum Entropy Modelling System, International Journal of Advanced Research in Computer and Communication Engineering, 2(9).

Khoja, S. (2001). Arabic Part of Speech Tagger, Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania.

Megerdoomian, K. (2004). Developing a Part of Speech Tagger. In Proceedings of First Workshop on Persian Language and Computers. Iran.

Mohseni, M., & Minaei-Bidgoli, B. (2010). A Persian Part-of-Speech Tagger Based on Morphological Analysis.The International Conference on Language Resources and Evaluation, Valletta, Malta.

Mohseni, M., Motalebi, H., Minaei-bidgoli, B., Shokrollahi-far, M. (2008). A farsi part-of-speech tagger based on markov, In the proceedings of ACM symposium on Applied computing, Brazil.

Mohtarami, M., Oroumchian, F. & Rahgizar, M. (2008). Using Heuristic Rules to Improve Persian Part of Speech Tagging Accuracy, International Conference on information and Knowledge Engineering, Universal Conference Management Systems and Support, California, USA.

Ng, H. T. & Low, J. K. (2004). Chinese Part-of Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?, Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 277-284.

Nourian, A., Rasooli, M. S., Imany, M. & Faili, H. (2015). On the importance of ezafe construction in Persian parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, 877-882.

Okhovvat, and Minaei Bidgoli, B. (2011). A Hidden Markov Model for Persian Part-of-Speech Tagging, Procedia Computer Science 3, 977–981.

Pakzad, A. & Minaei Bidgoli, B. (2016). An improved joint model: POS tagging and dependency parsing. Journal of AI and Data Mining, 4(1), 1-8.

Raja, F., Amiri, H., Tasharofi, S., Sarmadi, M., Hojat, H. (2007). Evaluation of Part of Speech Tagging on Persian Text. Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages Linguistic Institute, Stanford, California, USA, pp. 21-22.

Rehman, Z. Anwar, W., Bajwa, U. I. Xuan, W., & Chaoing, Z. (2013). Morpheme Matching Based Text Tokenization for a Scarce Resourced Language, PLoS ONE 8(8): e68178. Retrived from https://doi.org/10.1371/journal.pone.0068178.

Schuetze, H. (1995). Distributional Part-of-Speech Tagging From Texts to Tags: Issues in Multilingual Language Analysis, In the Proceedings of the ACL SIDGAT Workshop, available at: http://xxx.lanl.gov/find/cmp-lg.

Seraji, M., Megyesi, B., & Nivre, J. (2012). Dependency parsers for Persian. In Proceedings of the 10th Workshop on Asian Language Resources, 35-44.

Seraji, M. (2015). Morphosyntactic Corpora and Tools for Persian, Uppsala University, Sweden.

Sheykh Esmaili, K., Abolhassani, H., Neshati, M., Behrangi, E., Rostami, A., & Mohammadi, M. (2007). Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems, IEEE/ACS International Conference on Computer Systems and Applications.

Supreme Council of Information and Communication Technology, Mizan English Persian Parallel Corpus, (2013). Available: http://dadegan.ir/catalog/mizan [2014-01-01].

Taghiyareh F., Darrudi E., Oroumchian F., Angoshtari N. (2003) Compression of Persian Text for Web-Based Applications, Without Explicit Decompression, WSEAS Transactions on Computers, 4 (2), 961-966.

Vikram, S. (2013). Morphology: Indian Languages and European Languages. International Journal of Scientific and Research Publications, 3(6), 1-5.

Xia, F. (2000). The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank, University of Pennsylvania.

##submission.downloads##

چاپ شده

2021-02-28

شماره

نوع مقاله

مقاله

##plugins.generic.recommendBySimilarity.heading##

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 > >> 

##plugins.generic.recommendBySimilarity.advancedSearchIntro##