In statistics, synthetic minority oversampling technique (SMOTE) is a method for oversampling samples when dealing with imbalanced classification categories within a dataset. The problem with doing statistics inferences and modeling on imbalanced datasets is that the inferences and results from those analyses will be biases towards the majority class. Other solutions to addressing the problem of imbalanced data is to do undersampling of the majority class to be equivalently represented in the data with the minority class. But compared with the method of undersampling, SMOTE will oversample the minority category.[1][2]
Limitations
SMOTE does come with some limitations and challenges:[3]
SMOTE-N: accounts for nominal features, with the nearest neighbors algorithm being computed using the modified version of Value Difference Metric (VDM), which looks at the overlap of feature values over all feature vectors
ADASYN: use a weighted distribution for different minority class examples according to their level of difficulty in learning[5][6]
Borderline-SMOTE: only the minority examples near the borderline are over-sampled[5][7]
SMOTE-Tomek: applying Tomek links to the oversampled training set as a data cleaning step to remove samples overlapping the category boundaries[8]
SMOTE-ENN: uses the Edited Nearest Neighbor Rule, which removes any example whose class label differs from the class of at least two of its three nearest neighbors[8]
Algorithm
The SMOTE algorithm can be abstracted with the following pseudocode:[2]
if N < 100; then
Randomize the T minority class samples
T = (N/100) ∗ T
N = 100
endif
N = (int)(N/100)
k = Number of nearest neighbors
numattrs = Number of attributes
Sample[ ][ ]: array for original minority class samples
newindex: keeps a count of number of synthetic samples generated, initialized to 0
Synthetic[ ][ ]: array for synthetic samples
for i <- 1 to T
Compute k nearest neighbors for i, and save the indices in the nnarray
Populate(N , i, nnarray)
endfor
Populate(N, i, nnarray):
while N != 0
Choose a random number between 1 and k, call it nn
for attr <- 1 to numattrs
Compute: dif = Sample[nnarray[nn]][attr] − Sample[i][attr]
Compute: gap = random number between 0 and 1
Synthetic[newindex][attr] = Sample[i][attr] + gap ∗ dif
endfor
newindex++
N = N − 1
endwhile
return
where
N is the amount of SMOTE, where the amount of SMOTE is assumed to be a multiple of one hundred
T is the number of minority class samples
k is the number of nearest neighbors
Populate() is the generating function for new synthetic minority samples
If N is less than 100%, the minority class samples will be randomized, as only a random subset of them will have SMOTE applied to them.
Implementations
Since the introduction of the SMOTE method, there have been a number of software implementations:
^Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2011-06-09), "SMOTE: Synthetic Minority Over-sampling Technique", Journal of Artificial Intelligence Research, 16: 321–357, arXiv:1106.1813, doi:10.1613/jair.953