Skip to content
Snippets Groups Projects
Commit 3cd767f9 authored by a.croix's avatar a.croix
Browse files

Update README.md

parent 4370c214
No related branches found
No related tags found
No related merge requests found
Pipeline #1767 passed
......@@ -99,10 +99,73 @@ The **run** method returns a solution object, consisting of p weights and w weig
The algorithm is built to be used with different methods to evaluate the fitness score of each chromosome. Two different criteria are already implemented : *distance* and *AUC*.
- **Distance**: for each element in the population, the WOWA function us computed on all examples of the dataset. hen, the difference between the WOWA result just computed and the result given by the training dataset. All these differences are added to obtain the distance that is the fitness score of a chromosome.The smallest is the distance, the best is the chromosome.
- **AUC*: the Area Under the Curve (AUC) fitness score is designed for binary classification. The obtain the AUC, the Receiver Operating Characteristics (ROC) is built first. Concretely, the WOWA function is computed on all elements of the training dataset. Then, on these results, the ROC curve is built. The AUC of this ROC curve is the fitness score of an element. The biggest is the AUC, the best is the chromosome.
- **AUC**: the Area Under the Curve (AUC) fitness score is designed for binary classification. The obtain the AUC, the Receiver Operating Characteristics (ROC) is built first. Concretely, the WOWA function is computed on all elements of the training dataset. Then, on these results, the ROC curve is built. The AUC of this ROC curve is the fitness score of an element. The biggest is the AUC, the best is the chromosome.
It is possible to create new Solution type with new evaluation criterion. The new Solution type must inherit of *AbstractSolution* class and override the method *computeScoreTo*. It is also necessary to modify the method *createSolutionObject* method in the *Factory* class.
## Cross-validation
### Example
``` java
public static void main(String[] args) {
Logger logger = Logger.getLogger(Trainer.class.getName());
logger.setLevel(Level.INFO);
int population_size = 100;
int crossover_rate = 60;
int mutation_rate = 10;
int max_generation = 110;
int selection_method = TrainerParameters.SELECTION_METHOD_RWS;
int generation_population_method = TrainerParameters.POPULATION_INITIALIZATION_RANDOM;
TrainerParameters parameters = new TrainerParameters(logger, population_size,
crossover_rate, mutation_rate, max_generation, selection_method, generation_population_method);
//Input data
List<List<Double>> data = new ArrayList<>();
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.5, 0.8)));
data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));
//Expected aggregated value for each data vector
List<Double> expected = new ArrayList<>(Arrays.asList(1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0,1.0));
//Create object for the type of Solution (fitness score evaluation)
SolutionDistance solution_type = new SolutionDistance(data.get(0).size());
//Create trainer object
Trainer trainer = new Trainer(parameters, solution_type);
HashMap<AbstractSolution, Double> solution = trainer.runKFold(data, expected, 2, 2);
//Display solution
for (Map.Entry val : solution.entrySet()) {
System.out.println(val);
}
}
```
The method runKFold runs a k folds cross-validation. Concretely, it separates the dataset in k folds. For each folds, a single fold is retained as the validation data for testing the model, and the remaining k − 1 folds are used as training data. The cross-validation process is then repeated k times, with each of the k folds used exactly once as the validation data. The k results can then be averaged to produce a single estimation.
For each tested fold, the Area Under the Curve is also computed to evaluate the classification efficiency (works only expected vector that contains 0 and 1).
The code above produces a result similar to:
```
SolutionDistance{weights_w=[0.8673383311511217, 0.04564604584006219, 0.0647437341741078, 0.022271888834708403],
weights_p=[0.5933035227430291, 0.10784413855996985, 0.03387258778518031, 0.26497975091182074],
fitness score=2.2260299633096268}=0.16666666666666666
SolutionDistance{weights_w=[0.7832984118592771, 0.12307744745817546, 0.07982187970335382, 0.013802260979193624],
weights_p=[0.01945033161182157, 0.3466399858254755, 0.18834296208558235, 0.44556672047712065],
fitness score=1.7056044468736795}=0.4166666666666667
```
As output, the method **runKFold** return a HashMap that contains the best solution for each fold and the AUC corresponding to this solution.
The method **runKFold** takes as argument the dataset (data and expected result) the number of folds used in the cross validation and a value that can increase the number of alert is this number is to low.
This method is interesting to increase the penalty to do not detect an alert.
## References
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment