When we design many ML applications, particularly classification systems, we come across a common problem wherein the dataset will be skewed in nature, i.e. we have some classes of data that may be seen more than other types. Take the example of the number of indicators of compromise (IoC) compared to threat actors in feeds for threat intelligence in the cyber security domain.
While building such models for classification, we need to ensure they are not biased towards the class that has more data. As an example, consider a dataset where there are 50 threat actors and 2,000 IoCs. If the model predicts all IoCs but no threat actors, its accuracy would be almost 98% in theory but clearly not very useful in practice. As such, the model has a high tendency to be biased towards the ‘IoC’ class.
ML approaches in natural language CTI datasets suffer from this imbalance because of a significant cold start problem: industry-wide, there are so few analysts that there is not enough data to accurately train traditional models. While expensive techniques such as forcing analysts to train models themselves can overcome this, they tend to require such significant investment that new operational problems emerge. Because the sources of threat intelligence – cyber threat intel analysts – are so few in number, there aren’t enough data points in the entire industry for conventional methods to generate balanced datasets.
A dataset with skewed class proportions is called an imbalanced-dataset. Classes that make up a large proportion are called majority classes and the smaller proportion ones are called minority classes. When we need to build a classification system with such data, we need to ensure that the imbalanced nature of the dataset is handled in a proper way, as without it we might end up with a sub-optimal biased classifier.
Strategies to handle an imbalanced dataset
We can categorize most of the techniques to handle the imbalance dataset into three classes:
- Data driven methods
- Algorithmic methods
- Synthetic data driven methods
1. Data driven methods:
Data driven methods mostly involve resampling of the original dataset. The main objective here is to either increase the frequency of the minority class (over-sampling) or decrease the frequency of the majority class (under-sampling). This balances the dataset, as it results in approximately the same number of instances for all the classes, which in turn results in better models.
- Over-sampling
Oversampling can be defined as adding/keeping more samples of the minority class as compared to the majority class. It can be a good choice when you don’t have a lot of data to train your models. The advantage of oversampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, the key disadvantage is that it might make the resulting model prone to overfitting.
There are various versions of Over-sampling, some of the popular ones in practice are:
- Random Over-sampling
- Modified synthetic minority oversampling
- Informed Over Sampling
Check out MSMOTE (2009) – Shengguo Hu & Liang or SMOTE for high-dimensional class-imbalanced data (2013) – Rok Blagus & Lara Lusa to learn more about these techniques.
- Under-sampling
Undersampling can be defined as removing some samples of the majority class until its size is similar to that of the minority class. Undersampling can be a good choice when you have a lot of data to work with. However, since we are removing information we may lose valuable signals, which could lead to underfitting and poor generalization on our testing dataset.
2. Algorithmic methods:
The main idea here is to use various classifiers, most often in a staged manner and so improve the accuracy of the final resulting ensemble when we aggregate their predictions.
There are two main methodologies through which this can be accomplished:
- Bootstrap Aggregating
In this method we generate several different bootstrap training samples with replacement, train the classifiers on each bootstrap separately, and finish with an aggregation of the resulting predictions. The advantage of this method is that it reduces the overfitting in our models, improves the misclassification rate, and performs better with noisy examples resulting in a better, stable model with reliable/improved accuracy and other metrics.
Bootstrap aggregating, or Bagging for short, is a very robust method and we will delve deeper into this in our next post, particularly with its application to token classification/NER on imbalanced datasets.
- Boosting
This is another classical ensemble learning technique, which essentially involves the use of various weak base learners, which are combined together to get a strong learner/classifier with higher accuracy. There are several ways this is done, such as adaboost, xgboost, catboost etc. To learn more about this method, check out “The boosting approach (2003) – Schapire“.
3. Synthetic Data driven methods:
A technique similar to resampling is to create synthetic samples. In this approach, we essentially generate samples similar to that of the training distribution that we have, which are then used for training the model. There are multiple ways this can be done, the most popular method is called SMOTE (Synthetic Minority Oversampling Technique), which uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.
In Elemendar we have used such techniques for creating tagged datasets to train semi-supervised LSTM RNN models which recognise entities and keyphrases from the source text as identified through word vectors. This helps us overcome the paucity of machine-readable data points on human-derived CTI, which makes conventional NLP inference too inaccurate for entity extraction.