Domain Relevance on Term Weighting
Marko Brunzel; Myra Spiliopoulou
In: 12th International Conference on Applications of Natural Language to Information Systems. NLDB-2007, June 27-29, CNAM, Paris, France, Lecture Notes in Computer Science, Vol. 4592, Springer, 2007.
The TFxIDF term weighting scheme is the standard approach on vectorization of textual data. For a data set where textual data stemming from web document structure is to be vectorized citeDBLP:conf/kdxd/BrunzelS06 the need for a enhanced term weighting scheme arose. In this publication we introduce a term weighting scheme which improves the behavior compared to the traditional TFxIDF scheme by adding a component which is based on the linguistically inspired notion of domain relevance. Domain relevance measures the degree to which a term is regarded as more relevant within a data set compared to a reference data set. By means of this external component a potential weakness of TFxIDF on non standard distributed data sets is overcome. This weighting scheme favours domain relevant terms, which can be regarded as more useful in settings where the clustering is performed to be consumed by an human supervisor e.g for semi-automatic ontology learning.