connotation
Probability that the a message is a particular connotation:
p1 * ... * pn
p = -----------------------------------------
p1 * ... * pn + (1 - p1) * ... * (1 - pn)
where px is the probability that word x has the given connotation.
Or, if you prefer Haskell,
probability ps = product ps / (product ps + product [1 - px | px <- ps])Approach
Part I: Count tokens
For each connotation, do the following:
- Get all the text labelled as a connotation (the corpus).
- Convert the corpus to lowercase.
- Break the corpus into tokens. For now
(\w+)should be a sufficient regex. - Count the number of times a token appears in the full corpus.
- Store total counts for each token to use for future reference.
Part II: Evaluate a blob
- Convert the blob to lowercase.
- Break the blob into tokens the same way the corpus was broken into tokens.
- Get the, say, 15 most significant tokens from each connotation.
- Get the probability that the blob has each connotation by using The Formula.
- Ignore any connotations having a probability less than, say, 0.7.
- If there is only one connotation left, mark the blob as having that connotation. If there is more than one probable connotation, do not mark the blob.
The Formula
Here is Paul Graham's Common Lisp code for detecting spam[1]:
(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
(unless (< (+ g b) 5)
(max .01
(min .99 (float (/ (min 1 (/ b nbad))
(+ (min 1 (/ g ngood))
(min 1 (/ b nbad)))))))))In our case ,the bias toward good does not mean much. So to adjust in pseudo-Haskell, that's
probability word = if g + b > 5 then
max .01 (min .99 (min 1 (b / nbad) / (min 1 (g / ngood) + min 1 (b / nbad))))
where g = or (gethash word good) 0
b = or (gethash word bad) 0We also need to be able to rate words in different classes. Graham's spam detection is only using one class, spam (and the implicit class not spam). We need to extend his system four-fold:
positive(andnot positive)negative(andnot negative)neutral(andnot neutral)useless(andnot useless)
There's a possibility that we can skip rating for neutral and mark a blob as such when it has ~0.5 probabilities for positive and negative, but the need to generalize the function remains.
probability word class = if total > 5 then
max .01 (min .99 (min 1 (out / nin) / (min 1 (in / nin) + min 1 (out / nout))))
where total = count word
nout = count outSet
nin = count inSet
out = total - in
in = or (gethash word class) 0Database
In the database the following are used as abbreviations for different connotations:
- + positive
- - negative
- 0 neutral
- u useless
References
[1] Paul Graham's "A Plan for Spam"