A Survey of Part of Speech Tagging of Latin and non-Latin Script Languages: A more vivid view on Persian
Keywords:Part of Speech Tagging, Latin Script Language, Non-Latin Script language, RTL system, Persian Language
AbstractThis research is a general overview of the Latin script languages part of speech (POS) tagging with a specific focus on the non-Latin script languages, especially Persian. The study reviews the progress in POS tagging among the 23 highest native spoken languages in the world. Some of these languages follow the right-to-left (RTL) writing system such as Arabic, Urdu and Persian which have their own specific issues in POS tagging. This paper also goes through the issues and challenges which occurs during the tokenization and part of speech tagging of these languages. The challenges can be common between the languages or be specified to one. The Persian Language is chosen as the main interest of this paper and an attempt is made to critically overview the recent studies on Persian part of speech tagging and enumerate the specific challenges occurring in these studies. Reviewing the bulk of literature and examining the features, challenges, issues, and POS tagging tools in Persian, it was concluded that significant challenges of the researches on Persian were generally in the tokenization level and mostly as a result of using the Arabic script and its characteristics.
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Knowledge-Based Systems, 22(5), 382-387.
Amtrup, J. W., Mansouri Rad, H., Megerdoomian, K., & Zajac, R. (2000). Persian-English Machine Translation: An Overview of the Shiraz Project. Memoranda in Computer and Cognitive Science.
Assi, S. M. (2005). Word Prediction in a Running Text: A Statistical Language Modeling for the Persian Language, poster presented at the Australian Language Technology Workshop, 2005, Sydney University, Australia.
Assi, S. M. (1997). Farsi Linguistic Database (FLDB), International Journal of Lexicography, 10(3), 5-10.
Assi, S. M., & Abdolhoseini, M. H. (2000). Grammatical Tagging of a Persian Corpus. International Journal of Corpus Linguistics, 5(1), 69-81.
Azimizadeh, A., Arab, M. M., Quchani, S. R., (2008). Persian Part of Speech Tagger Based on Hidden Markov Model, 9th International Conference on the Statistical Analysis of Textual Data (JADT), USA.
Beeman, O. W. (2005). Perisan, Dari and Tajik in central Asia. The National Council for Eurasian and East European Research.
Bijankhan, M. (2004). The Role of the Corpus in writing a Grammar: An Introduction to a Software. Iranian Journal of Linguistics, 19.
Bijankhan, M., Sheykhzadegan, J., Bahrani, M. & Ghayoomi, M. (2011). Lessons from building a Persian written corpus: Peykare. Language resources and evaluation, 45(2):143–164.
Brants, T. (2000). TNT: A statistical part-of-speech tagger, In the Proceedings of 6th conference on applied natural language processing (ANLP), USA.
Brill, E. (1995). Transformation-based Error driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Journal of Computational Linguistic, 21(4), 543-565.
Curtis J. E. and Tallis, N. (2005). Forgotten Empire, The World of Ancient Persia,.University of California Press, Berkeley and Los Angeles, California.
Fadaei, H. & Shamsfard, M. (2008). Persian POS tagging using probabilistic morphological analysis. International Journal of Computer Applications in Technology, 38(4), 264-273.
Forsati, R. & Shamsfard, M. (2012). Cooperation of evolutionary and statistical POS-tagging. In The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), pages 446-451.
Hashemi, H. B., Shakery, A. & Faili, H. (2010). Creating a Persian-English Comparable Corpus, in proceedings of Conference on Multilingual and Multimodal Information Access Evaluation (CLEF), Padua, Italy, pp. 27-39.
Hosseini Pozveh, Z., Monadjemi, A., Ahmadi, A. (2016). Persian Texts Part of Speech Tagging Using Artificial Neural Networks, Journal of Computing and Security, 3(4). 233-241.
Jurafsky D., & Martin, J. H. (2009). Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, Upper Saddle River, New Jersey.
Kardan, A. A. & Imani, M. B. (2014). Improving Persian POS tagging using the maximum entropy model. In 2014 Iranian Conference on Intelligent Systems (ICIS), 1-5.
Karlsson, F., Voutilainen, A., Heikkilä, J. & Anttila, A. (1995). Constraint Grammar: A Language-Independent Framework for Parsing Unrestricted Text. Mouton de Gruyter, Berlin /New York.
Khanam, M. H., Madhumurthy, K. V., Khudhus, M. A. (2013). Part-Of-Speech Tagging for Urdu in Scarce Resource: Mix Maximum Entropy Modelling System, International Journal of Advanced Research in Computer and Communication Engineering, 2(9).
Khoja, S. (2001). Arabic Part of Speech Tagger, Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania.
Megerdoomian, K. (2004). Developing a Part of Speech Tagger. In Proceedings of First Workshop on Persian Language and Computers. Iran.
Mohseni, M., & Minaei-Bidgoli, B. (2010). A Persian Part-of-Speech Tagger Based on Morphological Analysis.The International Conference on Language Resources and Evaluation, Valletta, Malta.
Mohseni, M., Motalebi, H., Minaei-bidgoli, B., Shokrollahi-far, M. (2008). A farsi part-of-speech tagger based on markov, In the proceedings of ACM symposium on Applied computing, Brazil.
Mohtarami, M., Oroumchian, F. & Rahgizar, M. (2008). Using Heuristic Rules to Improve Persian Part of Speech Tagging Accuracy, International Conference on information and Knowledge Engineering, Universal Conference Management Systems and Support, California, USA.
Ng, H. T. & Low, J. K. (2004). Chinese Part-of Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?, Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 277-284.
Nourian, A., Rasooli, M. S., Imany, M. & Faili, H. (2015). On the importance of ezafe construction in Persian parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, 877-882.
Okhovvat, and Minaei Bidgoli, B. (2011). A Hidden Markov Model for Persian Part-of-Speech Tagging, Procedia Computer Science 3, 977–981.
Pakzad, A. & Minaei Bidgoli, B. (2016). An improved joint model: POS tagging and dependency parsing. Journal of AI and Data Mining, 4(1), 1-8.
Raja, F., Amiri, H., Tasharofi, S., Sarmadi, M., Hojat, H. (2007). Evaluation of Part of Speech Tagging on Persian Text. Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages Linguistic Institute, Stanford, California, USA, pp. 21-22.
Rehman, Z. Anwar, W., Bajwa, U. I. Xuan, W., & Chaoing, Z. (2013). Morpheme Matching Based Text Tokenization for a Scarce Resourced Language, PLoS ONE 8(8): e68178. Retrived from https://doi.org/10.1371/journal.pone.0068178.
Schuetze, H. (1995). Distributional Part-of-Speech Tagging From Texts to Tags: Issues in Multilingual Language Analysis, In the Proceedings of the ACL SIDGAT Workshop, available at: http://xxx.lanl.gov/find/cmp-lg.
Seraji, M., Megyesi, B., & Nivre, J. (2012). Dependency parsers for Persian. In Proceedings of the 10th Workshop on Asian Language Resources, 35-44.
Seraji, M. (2015). Morphosyntactic Corpora and Tools for Persian, Uppsala University, Sweden.
Sheykh Esmaili, K., Abolhassani, H., Neshati, M., Behrangi, E., Rostami, A., & Mohammadi, M. (2007). Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems, IEEE/ACS International Conference on Computer Systems and Applications.
Supreme Council of Information and Communication Technology, Mizan English Persian Parallel Corpus, (2013). Available: http://dadegan.ir/catalog/mizan [2014-01-01].
Taghiyareh F., Darrudi E., Oroumchian F., Angoshtari N. (2003) Compression of Persian Text for Web-Based Applications, Without Explicit Decompression, WSEAS Transactions on Computers, 4 (2), 961-966.
Vikram, S. (2013). Morphology: Indian Languages and European Languages. International Journal of Scientific and Research Publications, 3(6), 1-5.
Xia, F. (2000). The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank, University of Pennsylvania.