A Study of Local and Global Thresholding Techniques in Text Categorization

Wanas, N., Said, D., Darwish, N. and Hegazy, N.

    Feature Filtering is an approach that is widely used for dimensionality reduction in text categorization. In this approach feature scoring methods are used to evaluate features leading to selection. Thresholding is then applied to select the highest scoring features either locally or globally. In this paper, we investigate several local and global feature selection methods. The usage of Standard Deviation (STD) and Maximum Deviation (MD) as globalization schemes is suggested. This work provides a comparative study among fourteen thresholding techniques using different scoring methods and benchmark datasets of diverse nature. This includes investigation of normalizing feature scores before combining them in the global pool. The results suggest that normalized MD outperforms other methods in thresholding Document Frequency (DF) scores using even and moderate diverse data-sets. Furthermore, the results indicated that normalizing feature scores improves the performance of rare categories and balances the bias of some techniques to frequent categories.
Cite as: Wanas, N., Said, D., Darwish, N. and Hegazy, N. (2006). A Study of Local and Global Thresholding Techniques in Text Categorization. In Proc. Fifth Australasian Data Mining Conference (AusDM2006), Sydney, Australia. CRPIT, 61. Peter, C., Kennedy, P. J., Li, J., Simoff, S. J. and Williams, G. J., Eds. ACS. 91-101.
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS