Wednesday, March 26th, 2008 http://blog.semantichacker.com/?cat=3
TextWise Semantic Signatures® are based on vector spaces with thousands of dimensions, each corresponding to a single concept in some domain of interest. We use a special semantic dictionary to map the content of a given text document into a point within that conceptual space; and one can then gauge the similarity of two documents from the distance between their points in the conceptual space.
High-dimensional vector spaces should be quite familiar to information retrieval specialists and users of Salton’s SMART system. One should note, however, that SMART is based solely on counting word occurrences and so is not semantic. The highly skewed distribution of words in text, as described by Zipf’s Law, means that the dimensions will be highly unbalanced, much like all the extra folded up physical dimensions of strings in current grand unification theories.
Semantic spaces behave better, in large part because we get to choose the concepts for their dimensions. One wants those concepts to be independent, well balanced, and representative of the kind of text content to be described. This turns out to be a challenging set of requirements, but the underlying ideas are quite straightforward.
TextWise takes a purely statistical approach to semantics. Each concept in a semantic space has to be defined by a big sample of text documents related to that concept. We can then apply standard language modeling methods on such data to estimate the conditional probabilities of certain terms being associated with certain concepts; and these numbers with a few adjustments will then constitute our semantic dictionaries. This whole process is called “training”
TextWise has already built several large semantic dictionaries, most notably one with categories and training data from the USPTO and another with categories and indexed web pages from the ODP. The latter is probably the best choice for working with web applications, but one should note that many DMOZ categories have had to be consolidated or eliminated in order to satisfy minimum training data requirements for a dictionary.
A Semantic Signature® derived with the latest ODP dictionary will have over ten thousand dimensions. This will be hard to work with, but with typical web pages, only a few of those dimensions will have a significant weight. So our API currently keeps only the top 30 dimensions of a signature, which should be plenty to work with. Subsequent release will introduce signatures with variable degrees of truncation according to the actual statistical significance of weights.
Because they are statistical, Semantic Signatures® will always have a small unavoidable degree of noise. It is possible that some highly weighted categories in a signature will be wrong, just as it is possible for the house to lose some bets in a casino. With proper handling of training data, though, one should be able to ensure that the house will in fact still win most of the time.