Term signal is an existing text representation that
depicts a term as a vector of frequencies of occurrences
in a number of user-defined partitions of a
document. Although term signal augments the traditional
vector space model with patterns of term
occurrences, its document division is not coherent
with the actual logical structure of a document. In
this paper, we propose a novel document model,
termed Structure-Based Document Model with Discrete
Wavelet Transforms (SDMDWT), that exploits
the structural information of documents and mathematical
transforms for document representation. The
proposed SDMDWT model enhances the existing
term signal concept by additionally taking into consideration
document's structural information during
document division. We evaluated the proposed model
on two different domains of standard data sets, WebKB
4-Universities and TREC Genomics 2005, using
Support Vector Machines binary classification.
The experimental results show that using our SDMDWT
model for document representation demonstrates
promising improvements of classification performances
over existing document models.
Cite as: Thaicharoen, S., Altman, T. and Cios, K.J. (2008). Structure-Based Document Model with Discrete Wavelet Transforms and Its Application to Document Classification. In Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds. ACS. 209-217.
(from crpit.com)
(local if available)