Machine Learning in Spam Filtering (II)
Support Vector Machine(SVM)
SVM is a domain-based algorithm. It focuses major on boundaries rather than the distribution of the data set. It searches for a maximum distance between the positive/negative case and can deal with large number of features without feature selection. Based on this characteristic, the main difficulty is heterogeneous setting of users: a different standard, like different word choice, will result in a redrawing boundaries, while a global filter will be highly biased.
- Haider et al.(2007) apply SVM by an incremental supervised clustering algorithm, because they believe spam messages are generated by a template.
- n-gram + information gain + SVM is proved to be effective by Kanaris et al. (2007)
Artificial Neural Network(ANN)
ANN is a large concept that I will heavily go through in latter passages. To generalize all the ANN algorithms(like Kalman Filter, Self-Organizing Map, deep belief network etc.), I will use the definition from Wikipedia:
An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another. The applications in spam filtering are no big differences from those applied to other places.
Logistic Regression is a categorization by regression. For the algorithm proposed by Goodman and Yih,
it can be described as a simple linear model, whose features are words in the body and headers of each message; weights for the model are trained using online gradient descent of a logistic regression model. In other words, this algorithm is just our previous model mentioned in the second section. We just need an additional gradient descent to update weights, or to “maximize the log of the probability of the training data”.
Artificial Immune System
One of the important Artificial Immune System Algorithms that I ran into when I was researching on novelty detection was negative selection algorithm. That algorithm basically generates anti-bodies in the training stage, and comparing new data with those anti-bodies to detect novel cases.
Because most of the algorithms mentioned above, like ANN, logistic regression, artificial immune system maintain a similar application, the major take away from Spam Filter is Bag of Words(BoW). It is using a vector, that may or may not have extracted features, go through traditional machine learning algorithms and have nice results.
##References Goodman, J., & Yih, W. (2006). Online discriminative spam filter training. In Proc of the third conf on email and anti-spam. Guzella, Thiago S., and Walmir M. Caminhas. “A Review of Machine Learning Approaches to Spam Filtering.” Expert Systems with Applications, vol. 36, no. 7, 2009, pp. 10206–10222., doi:10.1016/j.eswa.2009.02.037.