Machine Learning for Anomaly Detection
An anomaly data point is that which differs significantly from other data points, and is oftentimes very difficult to identify. Various statistical techniques can be used to help identify anomaly values, but this article is dedicated to a few machine learning techniques.
Unsupervised Techniques
- k-means clustering
- Isolation Forest
- One Class Support Vector Machine
The Data
For this project I focused on three separate datasets that can all be found on the Numenta Anomaly Benchmark (NAB) GitHub webpage.
- Amazon East Coast AWS datacenter CPU usage
- Temperature of an internal component of an industrial machine
- Ambient temperature in an office space
Each of these datasets contain only two features, time and value.
Methodology
- Feature Engineering — to expand the features available and hopefully improve upon the machine learning performance.
- Dimensionality Reduction using PCA — k-means clustering algorithms use Euclidean distance and thus perform poorly if the data contains more than two features (and since we apply feature engineering to expand the features this is a necessary step).
- Outlier Fraction = 0.001 — All of these algorithms require a value set to determine how many anomalies to detect. This value can be played with to try and capture more or less anomalies, and very dependent on the data and how many anomalies you expect to find.
Feature Engineering
For the first dataset there a lot of fluctuation minute to minute along with hour by hour and day by day. So for additional features I pulled minutes, hours and days.
There didn’t seem to be as much fluctuation for the industrial machine part, so I simply used hours and days as additional features.
For the office temperature, there was a clear pattern day to day during the week days, along with during the weekends. So for this data I extracted hours, days, working hours, and weekdays as additional features.
PCA and K-means Clustering
Once all the data had the additional features included, principal component analysis was applied to reduce the data down to two principal features. Once this was applied, the elbow method was utilized to attempt to find the best cluster value for each dataset. Below you can see all of the k-means scores for each:
The cluster value is open to interpretation but most oftentimes the ‘elbow method’ is utilized here, where we take the lowest number of cluster values that also gets us closest to a k-means score. For each dataset I chose clusters=10, clusters=8, and clusters=5 respectively.
Below we can see the clusters for each dataset, along with the anomaly values detected (second image). These anomaly values are calculated by finding the centroid for each cluster, and comparing the distance of each point in the cluster to that centroid. Those values that fall outside of the 0.001 threshold value set, are flagged as anomalies.
We can then overlay the anomaly values on top of the original data to see a clear picture of what anomalies this method is detecting.
Isolation Forest
Isolation forest is a type of decision tree that was first proposed for anomaly detection in 2008. After applying feature engineering it was very easy to apply this technique using scikit-learn in Python. Below we can see the anomalies detected using this method:
One Class Support Vector Machine
Support vector machines are a method for dividing data using a higher dimensional plane, using each feature as a dimension. The Python implementation of this can be found on the scikit-learn documentation. The anomalies detected using this method can be seen below:
Conclusion
As we can see the SVM consistently predicts more anomaly values than the other two methods. Of course when playing with these algorithms it’s advisable to play around with the various hyperparameters available, along with the outlier fraction. This is completely reliant on the data involved and should be tested with on a case by case bases. Depending on the field/data you are applying these techniques to, there might be methods that work better and should be considered. As an example I applied these techniques to credit card data, using fraud as an anomaly, and found that k-means clustering was identifying the fraud the best when clusters=7 and outlier fraction = 0.1. Eventually if there is enough data we can transition away from the unsupervised aspect of this and start working towards labeled anomalies as a supervised problem.