Way of measuring distances

(*S*, *A*, *C*, *π*)

According to Bellman equation :

*C*(*s*_{i})_{tot} = *C*(*s*_{i})+∑_{j ∈ S}*γ**A*(*i*, *j*)*C*(*s*_{j})

*γ**i**n*(0, 1]

*γ* is replaced by a kind of Poisson process exp^{−λt} = *γ*(*t*)

Distance between grade of observator when labelling data

$$\kappa = \frac{Pr(a) - Pr(e)}{1 - Pr(e)}$$

*P**r*(*e*) Random consensus *P**r*(*a*) Relative consensus between observator

*κ* = 1 Total accord *κ* = 0 No agreement

For a ℝ^{n} space, max 1D-projection distance

*m**a**x*(*A*, *B*)=*m**a**x*{|*a*_{1} − *b*_{i}|,...,|*a*_{n} − *b*_{n}|}

Similarity between two sets

number of bigram / trigram similar in two strings

Dice similarity coefficient

$$DSC = \frac{|U \cap V|}{|U|+|V|}$$

On a ℝ^{n} space :

With **x**_{j} = (*a*_{1, j}, ..., *a*_{n, j})

$$d(\mathbf{x}_1, \mathbf{x}_2) = \sqrt{\sum^n_i (a_{i, 1} - a_{i,2})^2}$$

*E**P**E* = *E*{*Y* − *f̂*(*x*)}^{2} = *E*{*Y* − *f*(*x*)}^{2} + {*E*(*f̂*(*x*)) − *f*(*x*)}^{2} + *E*{*f̂*(*x*)−*E*(*f̂*(*x*))}^{2}

*E**P**E* = *V**a**r*(*Y*)+*B**i**a**s*^{2} + *V**a**r*(*f̂*(*x*))

- Error of the measure (even if the model is right)
- Bias : Mispefying the model
- Error using a sample (lack of continuity and covering the space)

A geodesic is a curve which follow the minimal distance path.

*γ* : *I* → *M* where *I* is the unit interval space and *M* is the metric space.

*γ* is a geodesic if there exist *v* ≥ 0∀*t* ∈ *I*∃*J*|∀(*t*_{1}, *t*_{2})∈*J*

*d*(*γ*(*t*_{1}),*γ*(*t*_{2})) = *v*|*t*_{1} − *t*_{2}|

Distance from predicted classification to effective classification, for a set of items *J* items, which classification value (before rounding) is *f*_{i}. If *f*_{i} is close to 0 or 1, the error related to item *i* is minimal. If *f*_{i} is close to 0.5, ie randomly choosen, the error is maximal

$$I_G(f) = \sum_{i=1}^J f_i(1 - f_i)$$

Distance in bits between two strings.

If they don't have the same size, the extra bits of the longest can be considered as 1.

*d*(*x*_{1}, *x*_{2})=∑*X**O**R*(*x*_{1}, *x*_{2})

XOR table :

0 | 1 | |
---|---|---|

0 | 0 | 1 |

1 | 1 | 0 |

Distance of two points on a sphere.

First :

$$hav(\theta) = \sin^2 \left(\frac{\theta}{2}\right) = \frac{1 - \cos(\theta)}{2}$$

$$hav^{-1} (h) = 2 \arcsin(\sqrt(h))$$

Next :

*d*: greatest distance (unknown)*r*: radius of the sphere*ϕ*: latitude*λ*: longitude

$$hav \left(\frac{d}{r}\right) = hav (\phi_2 - \phi_1) + \cos(\phi_1) \cos(\phi_2) hav(\lambda_2 - \lambda_1)$$

For a set of points *U*,

*û* = *J**C*(*U*)=*a**r**g**m**i**n*_{us ∈ U}*m**a**x*_{ui ∈ U}*D**i**s**t*(*u*_{s}, *u*_{i})

It is the point which is the most central, ie., the maximal distance to the other point is the lowest.

**Question** :

How to find it efficiently ?

- is this center nearby the mean of the cluster ?? ==> No, as if too much point are concentrated elsewhere than in the middle, it does not work.

Find the maximal distance between all points, and find the point nearby the middle of this biggest axis

Length of the shortest computer programm that is able to generate the "string" / "byte string" from scratch in a given language.

Dissimilarity between two probability distribution.

Dissimilarity of *Q* according to *P*

$$D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$$

$$D_{KL}(P||Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx$$

Distance in strings counting the number of characters between two words. Number of character to add/remove/change between two strings.

Exemple, with the code :

- I : insert
- exchange

- R remove

Distance(Alice, Claie) = 3

- a l i c e
- c l a i e
- l * * e

In order to calculate the distance, there are three common algorithms :

- The
**Levenshtein**algorithm, for short length strings - The
**Jaccard**algorithm **TF-IDF**algorithm

TODO

Use a matrice of size (*n* + 1)×(*m* + 1)

TODO

$$d(x_1, x_2) = \sum^n_i |a_{i, 1} - a_{i,2}|$$

Compare the similarity between 2 datasets of *N*-Dimensional object. Try to preserve the distance between two objects in a distance space

*N*= 2

$d_{ij} = \sqrt{(x_i - x_j)^2 + (y_i - y_j)^2}$

Stress / Strain : Loss function

$Stress =\sqrt{\frac{\sum (f(x) - d)^2}{\sum d^2}}$

Visualisation of x without any y by using similarity KL divergence, ...

Distance on the sphere

*Φ*, *λ*: latitude, longitude

*Δ**σ*: angle at the middle of the sphere between two points

*Δ**σ* = 2*a**r**c**c**o**s*(sin *Φ*_{1} ⋅ sin *Φ*_{2} + cos *Φ*_{1} ⋅ cos *Φ*_{2} ⋅ cos *Δ**λ*)

*d* = *r**Δ**σ*

- 1D : 2 adjacent pixels

X **A** Y

- 2D : 4 adjacent pixels

X | ||
---|---|---|

X | A |
X |

X |

- 3D : 6 Adjacents pixels

And for voxels : 2D : Hexagonal, so 6 neighbours. Same distance than Moore in that case

- 2D : Manhattant distance = 1

NW | N | NE |
---|---|---|

w | A |
E |

SW | S | SE |

*Paturi : clustering malware*

*p*, the shortest program to compute *x* from *y* or *y* from *x*.

|*p*|=*m**a**x*{*K*(*x*|*y*),*K*(*y*|*x*)}

*K*(*x*|*y*): Algorithmic information of *x* given *y*

To express similarity, normalize by average, and obtain the **Normalized Information Distance**:

$$NID(x,y) = \frac{max\{K(x|y),K(y|x)\}}{max\{K(x),K(y)\}}$$

As the distance is not always computable, given a compressor *Z*. *Z*(*x*) is the bitlength of x compressed using a specific algorithm, such as *gzip*, *PPMZ* or others.

$$NCD_Z (x,y) = \frac{Z(xy) - min\{Z(x),Z(y)\}}{max\{Z(x),Z(y)\}}$$

NCD is robust to noise, and free of parameter (just the compressor *Z*).

Give the semantic similarity of two words *x* and *y*:

$$NGD (x,y) = \frac{max\{log(f(x)), log(f(y))} - log(f(x,y))\}{log N - min\{log f(x), log f(y)\}}$$

Use the page hit count *f*(*x*) : number of page containing x *f*(*x*, *y*): number of page containing x AND y *N* : number of pages indexed