java-wowa-training
The WOWA operator (Torra) is a powerfull aggregation operator that allows to combine multiple input values into a single score. This is particulary interesting for detection and ranking systems that rely on multiple heuristics. The system can use WOWA to produce a single meaningfull score.
A Java implementation of WOWA is available at https://github.com/tdebatty/java-aggregation.
The WOWA operator requires two sets of parameters: p weights and w weights. In this project we use a genetic algorithm to compute the best values for p and w weights. For the training, the algorithm uses a dataset of input vectors together with the expected aggregated score of each vector.
This project is a Java implementation of the PHP wowa-training project.
Installation
Using maven :
<dependency>
<groupId>be.cylab</groupId>
<artifactId>java-wowa-training</artifactId>
<version>0.0.4</version>
</dependency>
https://mvnrepository.com/artifact/be.cylab/java-wowa-training
Usage
public static void main(String[] args) {
Logger logger = Logger.getLogger(Trainer.class.getName());
logger.setLevel(Level.INFO);
int population_size = 100;
int crossover_rate = 60;
int mutation_rate = 10;
int max_generation = 110;
int selection_method = TrainerParameters.SELECTION_METHOD_RWS;
int generation_population_method = TrainerParameters.POPULATION_INITIALIZATION_RANDOM;
TrainerParameters parameters = new TrainerParameters(logger, population_size,
crossover_rate, mutation_rate, max_generation, selection_method, generation_population_method);
//Input data
List<List<Double>> data = new ArrayList<>();
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.5, 0.8)));
data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));
//Expected aggregated value for each data vector
List<Double> expected = new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4, 0.5, 0.6));
//Create object for the type of Solution (fitness score evaluation)
SolutionDistance solution_type = new SolutionDistance(data.get(0).size());
//Create trainer object
Trainer trainer = new Trainer(parameters, solution_type);
AbstractSolution solution = trainer.run(data, expected);
//Display solution
System.out.println(solution);
}
The example above will produce something like:
SolutionDistance{
weights_w=[0.1403303611048977, 0.416828569516884, 0.12511121306189063, 0.1872211165629538, 0.1305087298401635],
weights_p=[0.0123494228072248, 0.10583088288437666, 0.5459452827654444, 0.17470250892324257, 0.1611718492107217],
distance=8.114097675242476}
The run method returns a solution object, consisting of p weights and w weights to use with the WOWA operator, plus the total distance between the expected aggregated values that are given as parameter, and the aggregated values computed by WOWA using these weights.
The method run can be used with ArrayList as the above example or with file name. One of these json file names contains the data and the second contains the expected results.
Parameters description
-
population_size : size of the population in the algorithm. Suggested value : 100
-
crossover_rate : defines the percentage of population generated by crossover. Must be between 1 and 100. Suggested value : 60
-
mutation_rate : define the probability of random element change in the population. Must be between 1 and 100. Suggested value : 15
-
selection_method : Determine the method used to select element in the population (for generate the next generation). SELECTION_METHOD_RWS for Roulette Wheel Selection and SELECTION_METHOD_TOS for Tournament Selection.
-
max_generation : Determine the maximum number of iteration of the algorithm.
-
generation_population_method: Determine the method used to generate the initial population. POPULATION_INITIALIZATION_RANDOM for a full random initialization and POPULATION_INITIALIZATION_QUASI_RANDOM for a population with specific elements.
Solution type
The algorithm is built to be used with different methods to evaluate the fitness score of each chromosome. Two different criteria are already implemented : distance and AUC.
- Distance: for each element in the population, the WOWA function us computed on all examples of the dataset. hen, the difference between the WOWA result just computed and the result given by the training dataset. All these differences are added to obtain the distance that is the fitness score of a chromosome.The smallest is the distance, the best is the chromosome.
- AUC: the Area Under the Curve (AUC) fitness score is designed for binary classification. The obtain the AUC, the Receiver Operating Characteristics (ROC) is built first. Concretely, the WOWA function is computed on all elements of the training dataset. Then, on these results, the ROC curve is built. The AUC of this ROC curve is the fitness score of an element. The biggest is the AUC, the best is the chromosome.
It is possible to create new Solution type with new evaluation criterion. The new Solution type must inherit of AbstractSolution class and override the method computeScoreTo. It is also necessary to modify the method createSolutionObject method in the Factory class.
Cross-validation
Example
public static void main(String[] args) {
Logger logger = Logger.getLogger(Trainer.class.getName());
logger.setLevel(Level.INFO);
int population_size = 100;
int crossover_rate = 60;
int mutation_rate = 10;
int max_generation = 110;
int selection_method = TrainerParameters.SELECTION_METHOD_RWS;
int generation_population_method = TrainerParameters.POPULATION_INITIALIZATION_RANDOM;
TrainerParameters parameters = new TrainerParameters(logger, population_size,
crossover_rate, mutation_rate, max_generation, selection_method, generation_population_method);
//Input data
List<List<Double>> data = new ArrayList<>();
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.5, 0.8)));
data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));
//Expected aggregated value for each data vector
List<Double> expected = new ArrayList<>(Arrays.asList(1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0,1.0));
//Create object for the type of Solution (fitness score evaluation)
SolutionDistance solution_type = new SolutionDistance(data.get(0).size());
//Create trainer object
Trainer trainer = new Trainer(parameters, solution_type);
HashMap<AbstractSolution, Double> solution = trainer.runKFold(data, expected, 2, 2);
//Display solution
for (Map.Entry val : solution.entrySet()) {
System.out.println(val);
}
}
The method runKFold runs a k folds cross-validation. Concretely, it separates the dataset in k folds. For each folds, a single fold is retained as the validation data for testing the model, and the remaining k − 1 folds are used as training data. The cross-validation process is then repeated k times, with each of the k folds used exactly once as the validation data. The k results can then be averaged to produce a single estimation. For each tested fold, the Area Under the Curve is also computed to evaluate the classification efficiency (works only expected vector that contains 0 and 1).
The code above produces a result similar to:
SolutionDistance{
weights_w=[0.8673383311511217, 0.04564604584006219, 0.0647437341741078, 0.022271888834708403],
weights_p=[0.5933035227430291, 0.10784413855996985, 0.03387258778518031, 0.26497975091182074],
fitness score=2.2260299633096268}=
0.16666666666666666
SolutionDistance{
weights_w=[0.7832984118592771, 0.12307744745817546, 0.07982187970335382, 0.013802260979193624],
weights_p=[0.01945033161182157, 0.3466399858254755, 0.18834296208558235, 0.44556672047712065],
fitness score=1.7056044468736795}=
0.4166666666666667
As output, the method runKFold return a HashMap that contains the best solution for each fold and the AUC corresponding to this solution. The method runKFold takes as argument the dataset (data and expected result) the number of folds used in the cross validation and a value that can increase the number of alert is this number is to low. This method is interesting to increase the penalty to do not detect an alert.
As for a classical learning, the method runKFold can be used as the example above or with json files. In this case, the arguments are String that are the file names.
References
- The WOWA operator : a review (V. Torra)
- Selection method for genetic algorithms (K. Jebari and M. Madiafi)
- Continuous Genetic Algorithms (R. Haupt and S. Haupt)
- A comparison of Active Set Method and genetic Algorithm approches for learning weighting vectors in some aggregation operators (D. Nettleton and V. Torra).