1 State of the Art :

SPAM : an unsolicited, unwanted e-mail that was sent indiscriminately

SPIT is an acronym for spam in internet telephony

1.1 Impact of spam

annoyance to individual users,
less reliable e-mails,
loss of work productivity,
misuse of network bandwidth,
wastage of file server
storage space and computational power,
spread of viruses, worms, and Trojan horses, and
financial losses through phishing, Denial of Service (DoS), directory harvesting attacks, etc.

[Siponen and Stucke, 2006].

2 Mathematical ranking of the classification models :

Social Engineering
Advertising
Phishing
Drive-by-exploit
water hole
Man in the middle
SQL injection
Social spam >Spam destiné à un utilisateur d'un réseau social comme MySpace, Facebook ou LinkedIn. Le spam a habituellement comme objectif de promouvoir une idée, un produit ou un site Web.
Backscatter SPAM > Spam for "subscribe to the mailling list" or "Out of the office"
Image spam : > Use a OCR (Optical Caracter Recognition)
Pump and dump > Pump and dump is a scheme that attempts to boost the price of a stock through recommendations based on false, misleading or greatly exaggerated statements. The perpetrators of this scheme, who already have an established position in the company's stock, sell their positions after the hype has led to a higher share price. This practice is illegal based on securities law and can lead to heavy fines.
Ponzi scheme : Pyramid scheme > The last person on the chain is the looser.
Post Truth / Poop and scoop > A highly illegal practice occurring mainly on the internet. A small group of informed people attempt to push down a stock by spreading false information and rumors. If they are successful, they can purchase the stock at bargain prices, as the overall marketplace will have sold off the security, causing the price to fall dramatically.

2.1.1 Spamming activities on twitter

link farming spamming trending topics phishing aggresive posting with social bot

2.1.2 URL

Detection with:

Black-list (most common, but bad URL evolve quickly)
Heuristic (Signature based detection, but have to design for each known threat)
Machine learning (trained on static and dynamic features of the page)

Features :

Black-list F.
- IP address
- Directory Structure similarity
- Top level domain
- Brand name Equivalence
- Query string substitution
Lexical F.
- Property of the URL
- Specific characters
- Kolmogorov complexity
Host-based F.
- Location
- Identity of the host
- Management style
Content-based F. : downloaded with the entire website
- Lexical content
- HTML DOM
- JS
- CANTINA
- Active X
- Visual Features
Others :
- Popularity
- Context

2.2 Preprocessing

Stop-words removal

a, "and", "or", ...

Lexical Analysis = Tokenization

Study the terms present into the email

Stemming

Transform words to the original form, ie no plurial, conjugaison ...

Representation

Change of format, to use Machine Learning

2.3 Machine Learning classification algorithms

Depending on the availability of the label of each data:

Supervised Learning
Semi supervised Learning
Unsupervised Learning
Similarity Learning (find the distance between two points)
String pattern analysis

Depending on the way you process, iterate and update your model:

Batch Learning (Stack of data)
Online (Stream of data, scalable)
Representation Learning
Others

Extreme Learning Machine (ELM): Hybrid features + Spherical classification

Qualities of a detection system

High Accuracy (High TN and false FP)
Fast detection
High Scalability
Strong adaptation : adpat to drifting
High Flexibility

2.3.1 Online algorithms

Source : Malicious URL Detection using Machine Learning Mars 2017

Usually better design as they must ensure to learn correctly the model

Three kinds :

First Order :
- Update the model given the current input
- Passive aggressive L. (PA)
Second Order
- Use statistics, suppose the distribution of weights
- ex : Confidence Weighted (CW)
Cost-sensitive OL
- Have a high confidence on confident URL (High confidence in False Positives, and minimize False Negative)
Online Active Learning
- Ask a the label (from a not labeled data) only when confidence is low

Passive Aggressive Learning

Passiveness : Avoid the model to diverge too much from the previous one
Aggressiveness : Change the model by the amount necessary to correct the prediction mistake (for a { + 1, −1} classification, going up to 1 is fine)
Use the assumption that the two classes are linearly separable.

Used more for descriptive features (statistical properties of lexical f.)

Improvement : PA-I, PA-II

Confidence Weighted

Same as PA, but use second order info - Passiveness : Avoid the distribution to change too much from the previous one - Aggressiveness : Correct the prediction mistake AND the classification confidence

Improvement : Adaptive Regularisation of Weights (ARoW) for non separable data

Used more for lexical features.

2.4 KNN k-Neirest Neighbours

3 Feature selection algortihm

Link

3.1 LSA Latent Semantic Analysis

If two document are similar based on the similarity of the words

D = {D₁, ..., D_n} Set of documents

X = {X₁, ..., X_m} Set of words present in the corpus

d_i = {x_i1, ..., x_im} Frequency of each word in the document d_i

t_i = {x_1j, ..., x_nj} Frequency of the word j in all the documents

X is a m × n matrices (row are t, col are d)

Correlation:

Between two terms: t_j^Tt_p (XX^T)
Between two Documents: d_j^Td_p (X^TX)

Single Value Decomposition:

X = UΣV^T

Allow you to go to the concept space.

3.2 Probabilistic LSA.

Asymétrique : MASHA (Multinomial ASymmetric Hierarchical Analysis) 5
Symétrique : HPLSA (Hierarchical Probabilistic Latent Semantic Analysis

3.3 Chi Square

Link

b(w)= \frac{n_{spam with w}}{n_{spam}}

: Ratio of number of spam containing word w VS number of spam

g(w)= \frac{n_{ham with w}}{n_{ham}}

: Ratio between number of ham containing word w VS number of spam

p(w)= \frac{b(w)}{b(w)+g(w)}

s : Strength that we give to a background information

x: Proba that an unseen word will belong to a spam

f(w) = \frac{(s \dot x) + (n \dot p(w))}{s + n}

belief

3.4 MI: Mutual Information

3.5 GINI: Improved Gini index

3.6 IG : Information Gain

Kullback-Leibler divergence for decision trees

3.7 OCFS : Orthogonal Centroid Feature Selection

Cluster based approach

3.8 CMFS : Comprehensively Measure Feature Selection

Deriative

3.9 DICE coefficient

3.10 Deviations from Poisson Feature Selection (DFPFS)

3.11 DIA association factor

3.12 DSC : DICE similarity coefficient

further developed the idea of creating a social network graph for inferring reputation ratings of individuals or e-mail addresses.

3.14 Graph clustering

3.15 Analysis of click streem

Click down - Click Up - Wheel - move - double click

3.16 Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers

2017

Local clustering coefficient

number of effective / number of possible links

LC(v)= \frac{2|e^v|}{K_v \cdot (K_v - 1)}

|e^v| total number of edges of v neighbours K_v Indegree + outdegree of v

Betweeness centrality

BC(v)= \frac{1}{(n-1)(n-2)} \sum_{s\neq v \neq t \in V} \frac{\delta_{st}(v)}{\delta_{st}}

δ_st : Number of shortest paths between s and t δ_st(v) that go through v n : number of vertices of the graph

Bidirectional Links Radio

Représente la reciprocité de deux comptes

R_{blink} = \frac{N_{bilink}}{N_{fing}}

N_bilink number of bidirectional links N_[fing] number of following

Average neighbours followers

A_{nfer}=\frac{1}{|N_{fing}(v)|} \sum_{u\in N_{fing}(v)} N_{fer}(u)

3.18 SNA : ROBUST REPUTATION-BASED RANKING ON MULTIPARTITE RATING

May 2017

Unsupervised learning for detecting spamming attack over ranking websites, based on the relationship between users Here, it try to rank between identified users and rank they assign. The dataset contain tupples of (id, marks, item, timestamp)

Robustness : if we are able to resist to spamming attack

Push attack : Try to increase the rating of an article

Nuke attack : Try to decrease the rating

Random spamming : Assign random marks to random items

Love/Hate attack : See Push / Nuke attacks. Select one item to increase and one other to decrease

Reputation attack : Select one item to decrease / increase, and select SEVERAL and rate them with something similar (so the scores of the latests do not change)

3.19 ENWalk: Learning Network Features for Spam Detection in Twitter

NB : Bad detection rate (70%, We can do better) April 2017

Defining first the success rate for Twitter point of view :

sr_u = \frac{|Followers_u|}{|Followed_u|}

, where |.| stands for the cardinal.

3.19.1 Type of spammers

Follow-flood spammer sr_u < 1 They add a lot of friends and flood the network. It is not very successful
Vigilant spammer sr_u > 1 They add new content, create a community, develop relationships

3.19.2 Fraudulance

Tweets are considered as fraudulent if :

Contains adults words
Blacklisted links
Promotional
Duplicated
Violent words

fr_u = \frac{|Fraudulent Tweets_u|}{|Tweets_u|}

Flood spammer : 0.86, Vigilant : 0.34

Janvier 2017

Graph clustering
Stream clustering
Bot or Not

3.21 Discovery, Retrieval, and Analysis of ’Star Wars’ botnet in Twitter

April 2017

Purposes of bots :

spamming: advertising, malware
Fake trending topic: Create a popular topic from scratch
Opinion manipulation: Group of bot representing a "public opinion"
Astrosurfing attack: Create a fake sense of aggreement, by hiding the initiator of a news among a community
Fake followers: Buy it online to improve your attractibility
Streaming API contamination: ?

Bot : Machine able to send message to specific target based on a code Botmaster : Owner of the bots Botnet : Network of bot from the same code

3.22 Detecting sockpuppet in opinion deceptive Spam

Mars 2017

The problem here is that spamming people can take an identity, spam with it, and change to another. So the aim is to reconstruct, assemble a global identity of this user given its multiple identity activities.

They use spy sets, which are labelled sets transformed to unlabelled set in some way ...[TODO : learn more about]

Spy induction is to recover the labels given the neirest and farthest neighbors

Authorship Attribution (AA) : Given a set of authors with their set of documents, and a set of unatributed documents, reassemble author. Assume a big enough data set per author (which makes the problem different from sockpuppet)

Authorship Verification (AV) : Given a set of document written by a user, determine if a new document has been writted by the user. Allow to detect imposters

SockPuppet detection : detectability with SVM and PartOfSpeach, for detecting several ids of a user

... Not the most interesting paper... But develop the concept of Farthest Neighbors

No proper dataset, the data have been generated with 40 people or so ...

3.23 TF-IDF, IR Information Retrieval

3.24 Challenge-response system

CAPTCHA : Completely Automated Public Turing test to tell Computers and Humans Apart

3.25 HIN Hetereogeous Information Network

G = (V, E) r > 1 types of nodes and s > 1 types of relations.

Each node v ∈ V belongs to a node type n ∈ N
Each link e ∈ E belongs to a link type l ∈ L

If 2 links are of the same type, the starting and ending nodes of the two links are the same.

for v_i, 1 − −(e_i)− − v_i, 2 and v_j, 1 − −(e_j)− − v_j, 2.

If e_i, e_j ∈ l_k then: - v_i₁, v_j₁ ∈ n_l - v_i₂, v_j₂ ∈ n_m

The nodes belongs to the same type, but at not necessarily equal ! There is a hierarchy level

3.26 Ensemble classifier

Help to use new and old filter in order to evolve

3.26.1 Node centrality (1978)

Find central segment

3.26.2 cluster analysis

3.26.3 Topological structure

3.27 MELA : Message Linguistic Analysis

WVSM : Word-Vector Space Model n-grams Longuest common substring

CTA : Call To Action

3.28 MPA : Message Pattern Analysis

4 To do

4.1 Preprocessing

Frequency of Words
LSA reduction / transformation
Remove unprobable word and too frequent word
Subset of spam / not spam

4.1.1 Text preprocessing

hashtag-handlers
number-handlers
url-handlers
usr-handlers
diacritic removal ê ê é è == e
punctuation removal
lower casing [... to be completed]

4.1.2 Tokenization

words n-grams
characters n-grams
Skips grams

4.1.3 Weighting scheme

TF
IDF
TFIDF

4.1.4 Classifier

4.1.5 Features

TODO : improve this section

Behavioral Metadata of the review, but not on the content Early time frame, Threshold rating deviation of review

Linguistic Features extracted from the content First person, exclamation symbols

User behavioral burtiness, grade given the businesses, Negative Ratio

User linguistic Average content similarity Maximum content similarity

4.2 To pay attention :

False Positive: Good email, considered as spam

For a binary classification :

Ham ==> Good email
Spam ==> Undesirable email

For a n-classification : Classes of emails

Work
Mailing list
Commercial
Familly
...

5 Targets

5.1 Opinion spaming

Yelp
Lafourchette
Expedia
Hotels.com
Orbitz
Priceline
TripAdvisor

6.1 Estimating the Prevalence of Deception in Online Review Communities

April 2012

Before :

Search of Duplication (mais ne renseigne pas sur la tromperie)
Annotations subjectives
n-grams
http://www.mturk.com

6.2 Contextual Multi-armed Bandits for the Prevention of Spam in VoIP Networks - CMABFAS

Juillet 2012

online algorithm == improve as you obtain data
low memory consumption : kell "balls", ie subset representing data (but not a sample itself)
Better solution each time (given known data)

Decomposition en arbre du probleme MAB, avec réduction de la variance au fur et à mesure de la précision

6.2.1 Données

Utilises les données de headeur des messages. Utilise la distance de Hamming pour comparer les différences sur 16 champs.

6.2.2 Variables mises en jeux

Dans le problème MAB, il y a un set d'actions A = {a₁,...,a_k} qui représente les décisions qui peuvent être prise à chaque étape.

B_i^a = {1,...,n_B^a} : index de la ieme balle

n_B^a : nombre de balles au moment de l'action a

x(B_i^a)∈χ : localisation de la balle B_i^a

r(B_i^a)∈[0,1] : rayon de la balle B_i^a

n_t(B_i^a) : nombre d'échantillon présent dans B_i^a au temps t

q_t(B_i^a) : récompense au temps t

d(x, x(B_i^a)) : distance d'un point x à une balle. Elle y appartient si la distance est inférieure à 1.

avg_t(.) = \frac{q_t(.)}{n_t(.)}

conf_t(.) = c\sqrt{\frac{\log T}{n_t(.)}}

Size_t(.)=2λr(.)

with (.) = B_i^a

A Ball is full if conf_t(B_i^a)<r(B_i^a)

6.2.3 Fonctionnement de l'algorithme

Initialisation

Pour chaque action a, une balle de rayon 1 (recouvrant tout l'espace) est définie. Le nombre d'élement dans chaque balle est nul n₀(B₁^a)=0

q₀ = 0, pas de récompense, r(B₁^a)=1 Taille maximale

Etape iterative

A chaque instant t arrive une donnée de position x_t. L'ensemble des balles dans laquelle x_t se positionne est calculé. Cela donne l'ensemble des balles actives A^a(x_t)={B_i^a|x_t∈B_i^a}

Parmi ce set de balles actives, le set des balles pertinantes est défini, qui correspond aux balles permettant la création d'un enfant ou non pleine.

Les balles filles ont :

un rayon de
\frac{1}{2}
la balle mere,
leur centre contenue dans la balle mère,
et ne doivent pas se recouvrir avec les balles du même niveau, ie le centre x_t ne doit pas appartenir à une autre sphere.

Ayant sélectionné les balles potentielles, calcul de la fonction de poids pour chaque action possible :

u(x_t, a)=min[avg_t(.)+conf_t(.)+size_t(.)]

L'action choisie est celle maximisant la fonction de poids.

$ a^* = argmin u(x_t, a)$

On met ensuite à jour toutes les balles correspondant à l'action choisie.

q_t + 1 = q_t + r^{a_t^*}(x_t) n_t + 1 = n_t + 1

6.2.4 Actions et pénalisations

L'algorithme en lui même fonctionne si on lui fourni :

Une fonction de distance permettant de comparer deux éléments,
Les actions possibles A = {a₁,...,a_k}
Définir la fonction de récompense

Dans la publication originelle, deux tests de sécurité sont définis. Les trois actions dans ce cas sont :

A1 : ne rien faire, laisser le mail arriver
A2 : Appliquer le test de type 1. si le test est passé correctement, alors le mail est transmis, sinon le mail est marqué comme SPIT
A3 : Appliquer le test de sécurité 2, selon la procédure A2

La variable prise en compte est le temps des appels, en prenant comme durée moyenne des appels SPIT de 30 secondes, contre 120 pour les appels normaux.

On notera Exp(j)=exp(j) Les tests ont un cout : le fait de faire le test est déja un signe de défaillance.

Pour les spams :

Action 1 :

r_A1(x)=Exp(30)

Action 2 & 3 dans le cas où le test est passé :

r_A2 − 3(x)=Exp(30)−Cout(A2 − A3)

Si les tests 2 et 3 ne sont pas passé :

r_A2 − 3(x)=100 − Cout(A2 − A3)

Pour les non-spams :

Action 1 :

r_A1(x)=Exp(120)

Action 2 & 3 dans le cas où le test est passé :

r_A2 − 3(x)=Exp(120)−Cout(A2 − A3)

Si les tests 2 et 3 ne sont pas passé :

r_A2 − 3(x)=100 − Cout(A2 − A3)

6.3 Modeling Review Spam Using Temporal Patterns and Co-Bursting Behavior

November 2016

Hidden Markov Model with 2 states :

Active, time between posts is "fast"
Inactive, time between posts is "slow"

The state is determined given the time difference between

The posting position is modelised by a Poisson Law ( = Exponential distribution).

Poisson law :

f(λ, x)=λexp(−λx) Si x > 0 f(λ, x)=0 Sinon

E(X) = \frac{1}{\lambda}

λ grand : temps entre deux évenements court

This is done in order to discover which is a "raised rater" Some account are trained to publish real review, for after being a spammer account.

Non-spammer :

Active -> Inactive Similarily
Inactive -> Active Differently

Spammer :

Active -> Inactive Differently
Inactive -> Active Similarily

Similarly : it's around the same time that they change of state (Vertical Bar : with an emission state of a given time, we know that it will change to the other state)

Differently : They change of state after a certain time (Horizontal bar)

1 State of the Art :

1.1 Impact of spam

2 Mathematical ranking of the classification models :

2.1 Related fields

2.1.1 Spamming activities on twitter

2.1.2 URL

2.2 Preprocessing

2.3 Machine Learning classification algorithms

2.3.1 Online algorithms

2.4 KNN k-Neirest Neighbours

3 Feature selection algortihm

3.1 LSA Latent Semantic Analysis

3.2 Probabilistic LSA.

3.3 Chi Square

3.4 MI: Mutual Information

3.5 GINI: Improved Gini index

3.6 IG : Information Gain

3.7 OCFS : Orthogonal Centroid Feature Selection

3.8 CMFS : Comprehensively Measure Feature Selection

3.9 DICE coefficient

3.10 Deviations from Poisson Feature Selection (DFPFS)

3.11 DIA association factor

3.12 DSC : DICE similarity coefficient

3.13 SNA : Social Network Analysis

3.14 Graph clustering

3.15 Analysis of click streem

3.16 Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers

3.17 Social Reputation

3.18 SNA : ROBUST REPUTATION-BASED RANKING ON MULTIPARTITE RATING

3.19 ENWalk: Learning Network Features for Spam Detection in Twitter

3.19.1 Type of spammers

3.19.2 Fraudulance

3.20 Twitter: The paradigm-shift of social spambots

3.21 Discovery, Retrieval, and Analysis of ’Star Wars’ botnet in Twitter

3.22 Detecting sockpuppet in opinion deceptive Spam

3.23 TF-IDF, IR Information Retrieval

3.24 Challenge-response system

3.25 HIN Hetereogeous Information Network

3.26 Ensemble classifier

3.26.1 Node centrality (1978)

3.26.2 cluster analysis

3.26.3 Topological structure

3.27 MELA : Message Linguistic Analysis

3.28 MPA : Message Pattern Analysis

4 To do

4.1 Preprocessing

4.1.1 Text preprocessing

4.1.2 Tokenization

4.1.3 Weighting scheme

4.1.4 Classifier

4.1.5 Features

4.2 To pay attention :

5 Targets

5.1 Opinion spaming

6 Related publications

6.1 Estimating the Prevalence of Deception in Online Review Communities

6.2 Contextual Multi-armed Bandits for the Prevention of Spam in VoIP Networks - CMABFAS

6.2.1 Données

6.2.2 Variables mises en jeux

6.2.3 Fonctionnement de l'algorithme

6.2.4 Actions et pénalisations

6.3 Modeling Review Spam Using Temporal Patterns and Co-Bursting Behavior

6.4 DDoS answer