This section describes the algorithms studied. As we focus on ensemble-based semi-supervised learning from imbalanced class distributions, specifically ensembles of self-training and co-training classifiers, we will first provide background on self-training and co-training, and also on ensemble learning. Then, we will describe the supervised ensemble approach used as a baseline in our evaluation, and finally, our proposed self-training and co-training ensemble variants.
Self-training
Self-training, also known as self-teaching or bootstrapping, is an iterative meta-algorithm that can be wrapped around any base classifier. Yarowsky [37] originally introduced self-training and applied it to a natural language processing problem, namely word-sense disambiguation. The first step in self-training is to build a classifier using the labeled data. Then, the labeled dataset is augmented with the most confidently predicted instances from the unlabeled pool, and the model is rebuilt. The process is repeated until a criterion is met, e.g., until the unlabeled dataset has been fully classified or a fixed number of iterations has been reached. In our work, we classify a sub-sample of unlabeled data at each iteration (as opposed to all unlabeled data) in order to increase computation speed. The most confidently classified instances are assigned the predicted class and used to retrain the model. The remaining instances, classified with less confidence, are discarded. The algorithm iterates until the unlabeled dataset has been exhaustively sampled.
Co-training
Blum and Mitchell [38] introduced co-training, also an iterative meta-algorithm, to solve the problem of identifying course pages among other academic web pages. Similar to self-training, co-training is applicable to any base classifier. Unlike self-training, which is a single view algorithm, co-training requires two independent and sufficient views (a.k.a., feature representations) of the same data in order to learn two classifiers. At each iteration, both classifiers label the unlabeled instances and the labeled training data of one classifier is augmented with the most confidently labeled instances predicted by the other classifier. Similar to self-training, in our work we classify only a sub-sample of unlabeled data at each iteration. Instances from the sub-sample classified with small confidence are discarded. The algorithm iterates until the unlabeled dataset has been exhaustively sampled.
Ensembles
Ensemble learning exploits the idea that combinations of weak learners can lead to better performance. Moreover, it is known that diversity among subclassifiers is an important constraint for the success of ensemble learning [38, 39]. However, learning Naïve Bayes classifiers from bootstrap replicates will not always lead to sufficiently "diverse" models, especially for problems with highly imbalanced distributions. In order to ensure sufficient variance between the original training data subsets of our highly imbalanced datasets, we used a technique initially recommended by Liu et al. [39], who proposed training each subclassifier of the ensemble on a balanced subset of the data, providing subclassifiers with the opportunity to learn each class equally, while the ensemble continues to reflect the original class distribution. An implementation of this technique by Li et al. [11] proved to be successful for the problem of sentiment classification, and was used as inspiration in our work.
Supervised Lower Bound
Generally, supervised models trained only on the available labeled data are used as baselines for semi-supervised algorithms. Thus, the hypothesis that unlabeled data helps is verified against supervised models that entirely ignore unlabeled instances. Because our focus is on ensemble methods and ensembles of classifiers typically outperform single classifiers, the lower bound for our approaches is an ensemble of supervised classifiers. Specifically, we train ensembles of Naïve Bayes classifiers using resampled balanced subsets and use their averaged predictions to classify the test instances. This approach is referred to as the Lower Bound Ensemble (LBE).
Ensembles inspired by the original approach: CTEO and STEO
In [11], co-training classifiers were augmented with the topmost confidently labeled positive and negative instances, as found by classifiers trained on balanced labeled subsets. The authors set the number of iterations at 50, and classified all unlabeled instances at each iteration. Moreover, the two views of the co-training classifiers were created at each iteration, using "dynamic subspace generation" (random feature splitting into two views), in order to ensure diverse subclassifiers.
However, this exact approach did not produce satisfactory results in our case, so we modified the algorithm from [11] in order to better accommodate our problem. We named the resulting approach Co-Training Ensemble inspired by the Original approach (CTEO). We also experimented with a variant where co-training was replaced with self-training, and named this variant Self-Training Ensemble inspired by the Original approach (STEO). The pseudocode for both CTEO and STEO variants is illustrated in Algorithm 1. As can be seen, Steps 7-9 are described for co-training (first line) and self-training (second line, in italic font), separately.
The first modification we made to the original ensemble-based approach, for both self-training and co-training variants, is that we kept the features fixed, i.e., used "static" instead of "dynamic subspace generation." For co-training, we used a nucleotide/position representation as one view, and a 3-nucleotide/position representation as the second view, under the assumption that each view is sufficient to make accurate predictions, and the views are (possibly) independent given the class.
The second modification we made is that we did not classify all unlabeled instances at each iteration; instead, we classified only a fixed subsample of the unlabeled data, as proposed in the classical co-training algorithm [38]. This alteration speeds up the computation process. The last modification that we made is that once a subsample was labeled and the top most confidently labeled instances were selected to augment the originally labeled dataset, we simply discard the rest of the subsample, thereby differing from the classical co-training approach [38] and from the original co-training ensemble approach [11]. This change also leads to faster computation times and, based on our experimentation, reduces the risk of adding mistakenly labeled instances to the labeled set in subsequent iterations. Furthermore, the last two adjustments lead to a fixed number of semi-supervised iterations, i.e., as the algorithm ends when the unlabeled data pool is exhausted. We use a subsample size that is dependent on the dataset size, and selected such that the algorithm iterates approximately the same number of times (50) for each set of experiments, for a certain imbalance degree. After the iterations terminate, the ensemble is used to classify the test set by averaging the predictions of the constituent subclassifiers.
An important observation regarding Step 9 in Algorithm 1 is that, in the case of co-training, when the two classifiers based on view1 and view2, respectively, make their predictions, an instance is added to the pseudolabeled set P only if (1) no conflict exists between the classifiers, i.e., both classifiers agree on the label, and (2) one classifier predicts the label with high confidence, while the other predicts the same label with low confidence. These conditions ensure that the two views inform each other of their best predictions, thereby enhancing each other's learning.
Algorithm 1 Ensembles inspired by the original approach [11] - CTEO/STEO
1: Given: a training set comprised of labeled and unlabeled data D = (D
l
, D
u
), |D
l
| ≪ |D
u
|
2: Create U by picking S random instances from D
u
and update D
u
= D
u
- U , S = sample size
3: Generate N balanced subsets from D
l
: Dl1, . . ., D
ln
4: repeat
5: Initialize P = ∅
6: for i = 1 to N do
7: CT: Train subclassifiers Ci1 on view1 and Ci2 on view2 of balanced subset D
li
ST: Train subclassifier C
i
on combined views of balanced subset D
li
8: CT: Classify instances in U using the classifiers Ci1 and Ci2
ST: Classify instances in U using subclassifier C
i
9: CT: Use Ci1 and Ci2 to select 2 positive and 2 negative instances and add them to P
ST: Use C
i
to select 2 positive and 2 negative instances, and add them to P
10: end for
11: Augment each balanced subset with the instances from P
12: Discard remaining unused instances from U
13: Create a new unlabeled sample U and update D
u
= D
u
- U
14: until U is empty (i.e., the unlabeled data is exhausted)
As mentioned above, STEO differs from the co-training based ensemble, CTEO, at Steps 7-9 in Algorithm 1: instead of using two subclassifiers trained on two different views, only one classifier is built using all features (view1 and view2 combined), and then this classifier is used to select the best two positive predictions and the best two negative predictions. Because each subclassifier in CTEO contributes one positive and one negative instance, after one iteration, the set P of pseudo-labeled instances contains 2N positive instances and 2N negative instances. Therefore, in STEO, we add the top two positives and top two negatives as predicted by the same subclassifier C
i
in order to maintain an augmentation rate identical to the augmentation rate in CTEO. After the semi-supervised iterations terminate, the ensemble is used to predict the labels of the test set. The predictions of every subclassifier in the ensemble on a test instance are combined via averaging, and the resulting probabilities represent the final class distribution of the instance.
Ensembles using dynamic balancing with positive: STEP and CTEP
The following two approaches use the dynamic balancing technique proposed in [15], found to be successful for the classical self-training algorithm when the dataset exhibits imbalanced distributions. The dynamic balancing occurs during the semi-supervised iterations of the algorithm and uses only the instances that the classifier (or subclassifiers in the ensemble) predicted as positive to augment the originally labeled set. In the ensemble context, subclassifiers are used to select the most confidently predicted positive instances. These variants are named Co-Training Ensemble with Positive (CTEP) and Self-Training Ensemble with Positive (STEP), and illustrated in Algorithm 2. As before, the co-training and self-training variants differ at Steps 7-9. For CTEP, during Step 9, the instance classified as positive with topmost confidence in one view and low confidence in the second view is added to P, and vice-versa. For STEP, the two most confidently labeled positive instances are added to P, such that the augmentation rate is identical to that from CTEP.
Algorithm 2 Ensembles using dynamic balancing with positive - STEP/CTEP
1: Given: a training set comprised of labeled and unlabeled data D = (D
l
, D
u
), |D
l
| ≪ |D
u
|
2: Create U by picking S random instances from D
u
and update D
u
= D
u
- U , S = sample size
3: Generate N balanced subsets from D
l
: Dl1, . . ., D
ln
4: repeat
5: Initialize P = ∅
6: for i = 1 to N do
7: CT: Train subclassifiers Ci1 on view1 and Ci2 on view2 of balanced subset D
li
ST: Train subclassifier C
i
on combined views of balanced subset D
li
8: CT: Classify instances in U using subclassifiers Ci1 and Ci2
ST: Classify instances in U using subclassifier C
i
9: CT: Use Ci1 and Ci2 to select 2 positive instances and add them to P
ST: Use C
i
to select 2 positive instances and add them to P
10: end for
11: Augment each balanced subset with the instances from P
12: Discard remaining unused instances from U
13: Create a new unlabeled sample U and update Du = Du - U
14: until U is empty (i.e., the unlabeled data is exhausted)
Ensembles that distribute the newly labeled instances: CTEOD and STEOD
Our next semi-supervised ensemble variants are based on CTEO and STEO, respectively, and distribute the most confidently labeled instances among the classifiers in the ensemble. They are referred to as Co-Training Ensemble Original Distributed (CTEOD) and Self-Training Ensemble Original Distributed (STEOD), and shown in Algorithm 3. In CTEOD and STEOD, as opposed to CTEO and STEO, instances are distributed such that each balanced subset receives two unique instances, one positive and one negative, from each view, instead of adding all instances from P to every balanced subset. The idea that motivated this change was that different instance distributions would ensure a certain level of diversity for the constituent classifiers of the ensemble. In Algorithm 3, the co-training and self-training variants differ at Steps 6-8. As can be seen, the main difference compared to CTEO and STEO is at Step 9, where classifier Ci1 trained on view1 is augmented with the top positive and top negative instances as predicted by classifier Ci2 trained on view2, and vice-versa. Therefore, each balanced subset is augmented with two positive instances and two negative instances, and the ensemble better conserves its initial diversity.
Algorithm 3 Ensembles that distribute newly labeled instances - CTEOD/STEOD
1: Given: a training set comprised of labeled and unlabeled data D = (D
l
, D
u
), |D
l
| ≪ |D
u
|
2: Create U by picking S random instances from Du and update D
u
= D
u
- U , S = sample size
3: Generate N balanced subsets from D
l
: Dl1, . . ., D
ln
4: repeat
5: for i = 1 to N do
6: CT: Train subclassifiers Ci1 on view1 and Ci2 on view2 of balanced subset D
li
ST: Train subclassifier C
i
on combined views of balanced subset D
li
7: CT: Classify instances in U using subclassifiers Ci1 and Ci2
ST: Classify instances in U using subclassifier C
i
8: CT: Use Ci1 and Ci2 to select 2 positive instances and 2 negative instances
ST: Use C
i
to select 2 positive instances and 2 negative instances
9: Augment current balanced subset, D
li
, with selected positive and negative instances
10: end for
11: Discard remaining unused instances from U
12: Create a new unlabeled sample U and update D
u
= D
u
- U
13: until U is empty (i.e., the unlabeled data is exhausted)
Ensembles that distribute only positive instances - CTEPD and STEPD
Our last semi-supervised ensemble variants are based on CTEP and STEP. We again use the dynamic balancing technique from [15] that adds only positive instances in the semi-supervised iterations. In addition, instances are distributed among the balanced labeled subsets, such that diversity is maintained and the subclassifiers are trained on diverse enough instance subsets, thus increasing the diversity of the constituent ensemble classifiers. The resulting variants are named Co-Training Ensemble with Positive Distributed (CTEPD) and Self-Training Ensemble with Positive Distributed (STEPD), and shown in Algorithm 4. The co-training and self-training variants differ at Steps 6-8. Overall, at each iteration, 2N unique positive instances augment the ensemble in which N is the imbalance degree since two instances originated from each co-training subclassifier. More specifically, each of the N subclassifier receives two positive instances, different from the instances received by the other subclassifiers.