SPAM : an unsolicited, unwanted e-mail that was sent indiscriminately
SPIT is an acronym for spam in internet telephony
[Siponen and Stucke, 2006].
SQL injection
Social spam >Spam destiné à un utilisateur d'un réseau social comme MySpace, Facebook ou LinkedIn. Le spam a habituellement comme objectif de promouvoir une idée, un produit ou un site Web.
Backscatter SPAM > Spam for "subscribe to the mailling list" or "Out of the office"
Image spam : > Use a OCR (Optical Caracter Recognition)
Pump and dump > Pump and dump is a scheme that attempts to boost the price of a stock through recommendations based on false, misleading or greatly exaggerated statements. The perpetrators of this scheme, who already have an established position in the company's stock, sell their positions after the hype has led to a higher share price. This practice is illegal based on securities law and can lead to heavy fines.
Ponzi scheme : Pyramid scheme > The last person on the chain is the looser.
Post Truth / Poop and scoop > A highly illegal practice occurring mainly on the internet. A small group of informed people attempt to push down a stock by spreading false information and rumors. If they are successful, they can purchase the stock at bargain prices, as the overall marketplace will have sold off the security, causing the price to fall dramatically.
link farming spamming trending topics phishing aggresive posting with social bot
Detection with:
Features :
a, "and", "or", ...
Study the terms present into the email
Transform words to the original form, ie no plurial, conjugaison ...
Change of format, to use Machine Learning
Depending on the availability of the label of each data:
Depending on the way you process, iterate and update your model:
Extreme Learning Machine (ELM): Hybrid features + Spherical classification
Qualities of a detection system
Source : Malicious URL Detection using Machine Learning Mars 2017
Three kinds :
Passive Aggressive Learning
Used more for descriptive features (statistical properties of lexical f.)
Improvement : PA-I, PA-II
Confidence Weighted
Same as PA, but use second order info - Passiveness : Avoid the distribution to change too much from the previous one - Aggressiveness : Correct the prediction mistake AND the classification confidence
Improvement : Adaptive Regularisation of Weights (ARoW) for non separable data
Used more for lexical features.
If two document are similar based on the similarity of the words
D = {D1, ..., Dn} Set of documents
X = {X1, ..., Xm} Set of words present in the corpus
di = {xi1, ..., xim} Frequency of each word in the document di
ti = {x1j, ..., xnj} Frequency of the word j in all the documents
X is a m × n matrices (row are t, col are d)
Correlation:
Single Value Decomposition:
X = UΣVT
Allow you to go to the concept space.
s : Strength that we give to a background information
x: Proba that an unseen word will belong to a spam
Kullback-Leibler divergence for decision trees
Cluster based approach
Click down - Click Up - Wheel - move - double click
2017
Local clustering coefficient
number of effective / number of possible links
|ev| total number of edges of v neighbours Kv Indegree + outdegree of v
Betweeness centrality
δst : Number of shortest paths between s and t δst(v) that go through v n : number of vertices of the graph
Bidirectional Links Radio
Représente la reciprocité de deux comptes
Nbilink number of bidirectional links N[fing] number of following
Average neighbours followers
May 2017
Unsupervised learning for detecting spamming attack over ranking websites, based on the relationship between users Here, it try to rank between identified users and rank they assign. The dataset contain tupples of (id, marks, item, timestamp)
Robustness : if we are able to resist to spamming attack
Push attack : Try to increase the rating of an article
Nuke attack : Try to decrease the rating
Random spamming : Assign random marks to random items
Love/Hate attack : See Push / Nuke attacks. Select one item to increase and one other to decrease
Reputation attack : Select one item to decrease / increase, and select SEVERAL and rate them with something similar (so the scores of the latests do not change)
NB : Bad detection rate (70%, We can do better) April 2017
Defining first the success rate for Twitter point of view :
Follow-flood spammer sru < 1 They add a lot of friends and flood the network. It is not very successful
Vigilant spammer sru > 1 They add new content, create a community, develop relationships
Tweets are considered as fraudulent if :
Flood spammer : 0.86, Vigilant : 0.34
Janvier 2017
April 2017
Purposes of bots :
Bot : Machine able to send message to specific target based on a code Botmaster : Owner of the bots Botnet : Network of bot from the same code
Mars 2017
The problem here is that spamming people can take an identity, spam with it, and change to another. So the aim is to reconstruct, assemble a global identity of this user given its multiple identity activities.
They use spy sets, which are labelled sets transformed to unlabelled set in some way ...[TODO : learn more about]
Spy induction is to recover the labels given the neirest and farthest neighbors
Authorship Attribution (AA) : Given a set of authors with their set of documents, and a set of unatributed documents, reassemble author. Assume a big enough data set per author (which makes the problem different from sockpuppet)
Authorship Verification (AV) : Given a set of document written by a user, determine if a new document has been writted by the user. Allow to detect imposters
SockPuppet detection : detectability with SVM and PartOfSpeach, for detecting several ids of a user
... Not the most interesting paper... But develop the concept of Farthest Neighbors
No proper dataset, the data have been generated with 40 people or so ...
G = (V, E) r > 1 types of nodes and s > 1 types of relations.
If 2 links are of the same type, the starting and ending nodes of the two links are the same.
for vi, 1 − −(ei)− − vi, 2 and vj, 1 − −(ej)− − vj, 2.
If ei, ej ∈ lk then: - vi1, vj1 ∈ nl - vi2, vj2 ∈ nm
The nodes belongs to the same type, but at not necessarily equal ! There is a hierarchy level
Help to use new and old filter in order to evolve
Find central segment
WVSM : Word-Vector Space Model n-grams Longuest common substring
CTA : Call To Action
Behavioral Metadata of the review, but not on the content Early time frame, Threshold rating deviation of review
Linguistic Features extracted from the content First person, exclamation symbols
User behavioral burtiness, grade given the businesses, Negative Ratio
User linguistic Average content similarity Maximum content similarity
For a binary classification :
For a n-classification : Classes of emails
April 2012
Before :
Juillet 2012
Decomposition en arbre du probleme MAB, avec réduction de la variance au fur et à mesure de la précision
Utilises les données de headeur des messages. Utilise la distance de Hamming pour comparer les différences sur 16 champs.
Dans le problème MAB, il y a un set d'actions A = {a1,...,ak} qui représente les décisions qui peuvent être prise à chaque étape.
Bia = {1,...,nBa} : index de la ieme balle
nBa : nombre de balles au moment de l'action a
x(Bia)∈χ : localisation de la balle Bia
r(Bia)∈[0,1] : rayon de la balle Bia
nt(Bia) : nombre d'échantillon présent dans Bia au temps t
qt(Bia) : récompense au temps t
d(x, x(Bia)) : distance d'un point x à une balle. Elle y appartient si la distance est inférieure à 1.
Sizet(.)=2λr(.)
with (.) = Bia
A Ball is full if conft(Bia)<r(Bia)
Initialisation
Pour chaque action a, une balle de rayon 1 (recouvrant tout l'espace) est définie. Le nombre d'élement dans chaque balle est nul n0(B1a)=0
q0 = 0, pas de récompense, r(B1a)=1 Taille maximale
Etape iterative
A chaque instant t arrive une donnée de position xt. L'ensemble des balles dans laquelle xt se positionne est calculé. Cela donne l'ensemble des balles actives Aa(xt)={Bia|xt∈Bia}
Parmi ce set de balles actives, le set des balles pertinantes est défini, qui correspond aux balles permettant la création d'un enfant ou non pleine.
Les balles filles ont :
Ayant sélectionné les balles potentielles, calcul de la fonction de poids pour chaque action possible :
u(xt, a)=min[avgt(.)+conft(.)+sizet(.)]
L'action choisie est celle maximisant la fonction de poids.
$ a^* = argmin u(x_t, a)$
On met ensuite à jour toutes les balles correspondant à l'action choisie.
qt + 1 = qt + rat*(xt) nt + 1 = nt + 1
L'algorithme en lui même fonctionne si on lui fourni :
Dans la publication originelle, deux tests de sécurité sont définis. Les trois actions dans ce cas sont :
La variable prise en compte est le temps des appels, en prenant comme durée moyenne des appels SPIT de 30 secondes, contre 120 pour les appels normaux.
On notera Exp(j)=exp(j) Les tests ont un cout : le fait de faire le test est déja un signe de défaillance.
Pour les spams :
rA1(x)=Exp(30)
rA2 − 3(x)=Exp(30)−Cout(A2 − A3)
rA2 − 3(x)=100 − Cout(A2 − A3)
Pour les non-spams :
rA1(x)=Exp(120)
rA2 − 3(x)=Exp(120)−Cout(A2 − A3)
rA2 − 3(x)=100 − Cout(A2 − A3)
November 2016
Hidden Markov Model with 2 states :
The state is determined given the time difference between
The posting position is modelised by a Poisson Law ( = Exponential distribution).
Poisson law :
f(λ, x)=λexp(−λx) Si x > 0 f(λ, x)=0 Sinon
λ grand : temps entre deux évenements court
This is done in order to discover which is a "raised rater" Some account are trained to publish real review, for after being a spammer account.
Non-spammer :
Spammer :
Similarly : it's around the same time that they change of state (Vertical Bar : with an emission state of a given time, we know that it will change to the other state)
Differently : They change of state after a certain time (Horizontal bar)