1 Definition of the data

You have a dataset D, which is a collection of nd documents D = {D1, ..., Dn}.

Each document Di is composed of a collection of mi words wi = {wi, 1, ..., wi, mi}.

A global list of words (present or not in the documents) is noted y = {y1, ..., ym}

2 TF: Time Frequency

Count the occurence of each word and extract the frequency.

With {w1, ..., wm} the m words of a document, and with {y1, ..., yn} the n different words present in it (or an other list of words).

F_i = \frac{1}{m}\sum_{j=1}^m \delta{w_j, y_i}

where δwj, yi = 1 if wj = yi else 0

Allow to see what are the most frequent words in a given text.

2.1 Variants

  • Binary variant: the word is present (1) or not (0)

Not very interesting. If you sum up over the documents, you have the Document Frequency with a factor or difference

  • Normal fi, j Frequence of word yi in document Dj. Problem, if the word is very too frequent, the IDF term will not compensate (for letting appearing words more interesting).

  • Log normalization 1 + log(fi, j) Decrease the importance of the frequency, by following log evolution. Formula from Wikipedia source, but seem strange to me. Frequency can be 0. A better formula would be log(1 + fi, j) which will avoid negative values or undefined

  • Half-max normalization

    \frac{1}{2} + \frac{1}{2} \frac{f_{i, j}}{max_{k \in D_j} f_{k, j}}
    The highest word frequency of the document is 1. All the others lower. With short sized document, TF-IDF calculation can seem awkward. Very strange. if f = 0, TF = 0.5. It is Nonsense

  • Max normalization

    K + (1 -K) \frac{f_{i, j}}{max_{k \in D_j} (f_{k, j})}
    Same critics than above.

3 IDF: Inverse Document Frequency

When looking at the frequency of the words, the most common, which are use for grammar purpose have a very high frequency.

IDF_i = \log \frac{|D|}{|{D_j, y_i \in D_j }|}

ie, it is the ratio between the number of documents studied and the number of document where word yi is present.

Other way for normalizing (smooth with log(1 + ...))


For the TFIDF of a given word yi

TFIDF_i = (Mean_j TF_{i,j}) \dot IDF_i

5 Analysis

With the hypothesis that the freq is the same in any docs, or that the average is f where the word is present: For simplicity, d is the number of documents

TF/IDF No occurrences 1 occurrence n occurrences always present
f=0 0 X X X
f=0.1 (few) X 0.1 * log(d) n * 0.1 * log(d/n) 0.1 * d
f=0.5 (big) X 0.5 * log(d) n * 0.5 * log(d/n) 0.5 * d

For :

  • Normal frequency.
    TFIDF(n) = f n \log(\frac{d}{n})
    \frac{\partial TFIDF(n)}{\partial n} = f \left( \log(\frac{d}{n}) - 1\right)
    which is minimal on n = d. The less it is present in document, the more it will be visible.

6 Sources


Not very good at all. See if there is something new elsewhere to improve that page.