عنوان المقالة:تصنيف النص لإسناد التأليف باستخدام مصنف Naive Bayes مع بيانات تدريب محدودة Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data
فاطمة هويدي | Fatma Howedi | 526
نوع النشر
مجلة علمية
المؤلفون بالعربي
فاطمة هويدي، مازنيزا موحد
المؤلفون بالإنجليزي
Fatma Howedi, Masnizah Mohd
الملخص الانجليزي
Abstract Authorship attribution (AA) is the task of identifying authors of disputed or anonymous texts. It can be seen as a single, multi-class text classification task. It is concerned with writing style rather than topic matter. The scalability issue in traditional AA studies concerns the effect of data size, the amount of data per candidate author. This has not been probed in much depth yet, since most stylometry researches tend to focus on long texts per author or multiple short texts, because stylistic choices frequently occur less in such short texts. This paper investigates the task of authorship attribution on short historical Arabic texts written by10 different authors. Several experiments are conducted on these texts by extracting various lexical and character features of the writing style of each author, using N-grams word level (1,2,3, and 4) and character level (1,2,3, and 4) grams as a text representation. Then Naive Bayes (NB) classifier is employed in order to classify the texts to their authors. This is to show robustness of NB classifier in doing AA on very short-sized texts when compared to Support Vector Machines (SVMs). Using dataset (called AAAT) which consists of 3 short texts per author’s book, it is shown our method is at least as effective as Information Gain (IG) for the selection of the most significant n-grams. Moreover, the significance of punctuation marks is explored in order to distinguish between authors, showing that an increase in the performance can be achieved. As well, the NB classifier achieved high accuracy results. Since the experiments of AA task that are done on AAAT dataset show interesting results with a classification accuracy of the best score obtained up to 96% using N-gram word level 1gram.
تاريخ النشر
04/01/2014
الناشر
Computer Engineering and Intelligent Systems
رقم المجلد
5
رقم العدد
4
ISSN/ISBN
2222-2863
الصفحات
48 - 56
رابط الملف
تحميل (0 مرات التحميل)
رابط خارجي
https://iiste.org/Journals/index.php/CEIS/article/view/12132/12484
الكلمات المفتاحية
Authorship attribution, Text classification, Naive Bayes classifier, Character n-grams features, Word n-grams features.
رجوع