Word Vector Space for Text Classification and Prediction According to Author
İlknur Dönmez1*, Elnaz Pashaei 2, Elham Pashaei 3
1İstanbul Aydın University, İstanbul , Turkey
2İstanbul Aydın University, İstanbul , Turkey
3İstanbul Gelisim University, İstanbul , Turkey
* Corresponding author: ilknurdonmez@aydin.edu.tr
Presented at the International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA2019), Ürgüp, Turkey, Jul 05, 2019
SETSCI Conference Proceedings, 2019, 8, Page (s): 106-111 , https://doi.org/10.36287/setsci.4.5.021
Published Date: 12 October 2019 | 1534 22
Abstract
Word embedings are intense vector representations of words. They capture the semantic order between words. In our study, we propose a new method for classifying text by author, using word embeds and author vector space. Using each author's books, we trained the data to create author-specific word vector space. Word vector spaces consist of vector representations of all the words used by the author in the training dataset. In our study, different word vector spaces were created for different authors. The aim here is answering the question "Which author is more likely to write any given text?". In a text, we propose a method to find out whether consecutive word vectors belong to that author's vector space. Our method is based on a basic principle used when constructing word embedding vectors. Extensive results on the data sets show that the proposed method performs better than the latest technology methods in terms of accuracy.
Keywords - word embeding, word vectors, Turkish, text classification according to author
References
[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[2] Erk, K. (2012). Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compass, 6(10), 635-653.
[3] Zhou, G., He, T., Zhao, J., & Hu, P. (2015). Learning continuous word embedding with metadata for question retrieval in community question answering. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 250-259).
[4] Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
[5] Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1555-1565).
[6] Dos Santos, C., & Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).
[7] Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embedding’s to document distances. In International Conference on Machine Learning (pp. 957-966).
[8] Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).
[9] Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1555-1565).
[10] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
[11] Rangel, F., Rosso, P., Potthast, M., & Stein, B. (2017). Overview of the 5-th author-profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF.
[12] Amir, S., Wallace, B. C., Lyu, H., & Silva, P. C. M. J. (2016). Modelling context with user embeddings for sarcasm detection in social media. arXiv preprint arXiv:1607.00976.
[13] Bayot, R. K., & Gonçalves, T. (2016, September). Author Profiling using SVMs and Word Embedding Averages. In CLEF (Working Notes) (pp. 815-823).
[14] Flekova, L., & Gurevych, I. (2015). Personality profiling of fictional characters using sense-level links between lexical resources. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1805-1816).
[15] Şahİn, G. (2017, May). Turkish document classification based on Word2Vec and SVM classifier. In 2017 25th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
[16] Uysal, A. K., & Murphey, Y. L. (2017, August). Sentiment classification: Feature selection based approaches versus deep learning. In 2017 IEEE International Conference on Computer and Information Technology (CIT) (pp. 23-30). IEEE.
[17] Ayata, D., Saraçlar, M., & Özgür, A. (2017, May). Turkish tweet sentiment analysis with word embedding and machine learning. In 2017 25th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
[18] Sarı, M., & Özbayoğlu, A. M. (2018, September). Classification of Turkish Documents Using Paragraph Vector. In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP) (pp. 1-5). IEEE.
[19] Ertugrul, A. M., Velioglu, B., & Karagoz, P. (2017, June). Word embedding based event detection on social media. In International Conference on Hybrid Artificial Intelligence Systems (pp. 3-14). Springer, Cham.
[20] Tang, J., Qu, M., & Mei, Q. (2015, August). Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1165-1174). ACM.
[21] Yang, Z., Zhu, C., & Chen, W. (2018). Zero-training Sentence Embedding via Orthogonal Basis. arXiv preprint arXiv:1810.00438.
[22] Zhou, G., He, T., Zhao, J., & Hu, P. (2015, July). Learning continuous word embedding with metadata for question retrieval in community question answering. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 250-259).
[23] Guthrie, D., Allison, B., Liu, W., Guthrie, L., & Wilks, Y. (2006, May). A closer look at skip-gram modelling. In LREC (pp. 1222-1225).
[24] Rehurek, R., & Sojka, P. (2011). Gensim—statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD.
[25] Shen, D., Wang, G., Wang, W., Min, M. R., Su, Q., Zhang, Y., ... & Carin, L. (2018). Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. arXiv preprint arXiv:1805.09843.
[26]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).