Open Access

Comparison of Text Classification Performances in Balanced and Imbalanced Datasets

Mehmet Fatih Karaca1*, Şafak Bayır2
1Department of Computer Technologies, Gaziosmanpaşa University, Tokat, Turkey
2Department of Computer Engineering, Karabük University, Karabük, Turkey
* Corresponding author: mfkaraca@gmail.com

Presented at the Ist International Symposium on Innovative Approaches in Scientific Studies (ISAS 2018), Kemer-Antalya, Turkey, Apr 11, 2018

SETSCI Conference Proceedings, 2018, 2, Page (s): 223-223

Published Date: 23 June 2018

Text mining, which aims to derive unknown and useful information from available textual data, is one of the subfield of data mining. Text mining transforms unstructured textual data into structured form by utilizing various methods. Data mining techniques can also be applied to textual data as a result of transformation to structured form. Text classification which is one of the widely favoured subject of text mining is the process of assigning texts into predefined classes. As a result of this, classification is realized more rapidly and consistently. In this study, text classification performances in balanced and imbalanced datasets were compared. Corpora which consist of Turkish and English texts were utilized. The features of 4 datasets including 5 classes in each were as follows: Corpus 1 and Corpus 3 include Turkish contents and Corpus 2 and Corpus 4 include English contents. 3375 training and 1125 test documents were included in the Corpus 1 and Corpus 2 which are balanced datasets, whereas 1825 training and 825 test documents were included in the Corpus 3 and Corpus 4 which are imbalanced datasets. Documents included in datasets were pre-processed, weighted as binary and tf-idf and document vectors were obtained. kNN was preferred for text classification and Manhattan Distance, Harmonic Mean, Inner Product, Squared Chord and Dice’s Coefficient were selected for the measurement of similarities of document vectors. Results were taken into consideration in terms of classification success as well as process time. Since the number of the documents within the classes are equal, it is seen that more successful classification was obtained in terms of average values in both Turkish and English content balanced datasets but the process time was longer.  

Keywords - text mining, text classification, kNN, balanced dataset, imbalanced dataset

0
Citations (Crossref)
363
Total Views
22
Total Downloads

Licence Creative Commons This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
SETSCI 2025
info@set-science.com
Copyright © 2025 SETECH
Tokat Technology Development Zone Gaziosmanpaşa University Taşlıçiftlik Campus, 60240 TOKAT-TÜRKİYE