<< Go back to Tech

# Introduction

Graphs are everywhere.

Communication and social networks are two good examples.

Your message might be encrypted, but your belief is not !

Here I show an example using the polblog dataset. 1

# Dataset

In the Political Blog dataset 1, you have several informations:

• blogs name;
• blogs outgoing links;
• blogs political orientation (left or right).

It was collected for the 2004 US presidential election, and there are about 1500 blogs in it.

# Visualization

As you can see, the red doesn’t mix well with the blue.

There are some outliers of course, like in any dataset. On average, if you look at the neighborhood of a blog, you are able to guess the party it belongs to.

So there is no secret !

» Click here for the interactive view. «

# Processing

## Notation

We have a directed graph $$G=(V, E)$$. $$V$$ represents the blogs, while $$E$$ the link from one blog to another.

We define the outgoing $$\mathcal{N}$$ and incoming neighborhood $$\mathcal{N}^{-1}$$ as:

$\mathcal{N}(u) = \{v | (u, v) \in E\}$ $\mathcal{N}^{-1}(u) = \{v | (v, u) \in E\}$

And $$\|\mathcal{N}(u)\|$$ represents the number of neighbors $$u$$ can directly reach.

## Pruning

First, we select the largest connected component. There are small groups, but made of less than 10 nodes which are not necessary to consider.

Second, we discarded nodes with too few outgoing links. We removed all the nodes with less than $$2$$ links. In a graph, links are the information. So, we cannot learn that much on these badly connected nodes. However, in some situation, you may keep them, especially in power-law distributed networks, as badly connected nodes are the majortiy. In our case, we do not have this issue. Pruning them improves the computational complexity, so we end-up with 1.000 nodes.

## Getting more information.

To compare two nodes, the best option is to look at their surrounding.

However, if you take two nodes at random, the overlap between their neighborhood is probably of 0 nodes.

To create a finer description of a node, we perform a random walk. Each indirect neighbors is weighted according to the probability to reach it.

We can recursively compute the random walk weight $$w^u(v)$$ starting from $$u$$:

$w^u(v) = \sum_{x \in \mathcal{N}^{-1}(v)} \frac{w^u(x)}{|\mathcal{N}(x)|}$

For $$u$$, we have $$w^u(u) = 1$$

### Avoiding the Chains

If my node is connected to a single one, you get a chain, and the weight is the same for the two next. If you have more nodes like that, you keep a strong weight while $n$ steps away from your initial node.

In a network, word-of-mouth, the information is modified after several hop. To take into account this factor, we add a decay factor $$\alpha=0.9$$.

$w^u(v) = \alpha \sum_{x \in \mathcal{N}^{-1}(v)} \frac{w^u(x)}{|\mathcal{N}(x)|}$

With $$5$$ hops, this mecanically decrease the influence to $$0.6$$. This allows to represent the direct and indirect neighborhood of the nodes, without paying too much attention to far away nodes.

The value $\alpha$ must be choosen by taking into account the graph diameter. If the diameter is very small, it means you can reach any nodes within a few steps, so a small alpha is recommended.

## Comparing Two Nodes

As we have weights, the cosine similarity is great to compare two nodes:

$Sim(u, v) = \frac{\mathbf{w}^u \mathbf{w}^v}{\|\mathbf{w}^u\|\|\mathbf{w}^v\|}= \frac{\sum_{i \in \mathcal{N}} w^u_i w^v_i}{\sqrt{(\sum_{i \in \mathcal{N}} w^u_i )(\sum_{i \in \mathcal{N}} w^v_i )}}$

This return a value in $$[0, 1]$$, where a value close to $$1$$ means that the two nodes are very close.

## Building an Embedding

We use $$t$$-SNE 2 for this purpose. $$t$$-SNE is an embedding algorithm. It is used to represent high dimensional data into a low dimensional space, where it is easy to visualize the data.

The input of this algorithm is a set of vectors or a distance matrix. Here we “convert” the similarities we obtained by inversion to get a “distance”. You can see the full details on ArXiv 3.

1. Lada A. Adamic and Natalie Glance. The political blogosphere and the 2004 US election: Divided they blog. In Proc. Int. Workshop on Link Discov., pages 36–43, 2005  2

2. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008

3. G. Candel and D. Naccache, Generating Local Maps of Science using Deep Bibliographic Coupling, ISSI 2021.

>> You can subscribe to my mailing list here for a monthly update. <<