diff --git a/README.md b/README.md index 809c203140aec350948040584d1cfc9b0627230b..8e03f7e3dfc195da67840f3935224497b44482fc 100644 --- a/README.md +++ b/README.md @@ -99,10 +99,73 @@ The **run** method returns a solution object, consisting of p weights and w weig The algorithm is built to be used with different methods to evaluate the fitness score of each chromosome. Two different criteria are already implemented : *distance* and *AUC*. - **Distance**: for each element in the population, the WOWA function us computed on all examples of the dataset. hen, the difference between the WOWA result just computed and the result given by the training dataset. All these differences are added to obtain the distance that is the fitness score of a chromosome.The smallest is the distance, the best is the chromosome. -- **AUC*: the Area Under the Curve (AUC) fitness score is designed for binary classification. The obtain the AUC, the Receiver Operating Characteristics (ROC) is built first. Concretely, the WOWA function is computed on all elements of the training dataset. Then, on these results, the ROC curve is built. The AUC of this ROC curve is the fitness score of an element. The biggest is the AUC, the best is the chromosome. +- **AUC**: the Area Under the Curve (AUC) fitness score is designed for binary classification. The obtain the AUC, the Receiver Operating Characteristics (ROC) is built first. Concretely, the WOWA function is computed on all elements of the training dataset. Then, on these results, the ROC curve is built. The AUC of this ROC curve is the fitness score of an element. The biggest is the AUC, the best is the chromosome. It is possible to create new Solution type with new evaluation criterion. The new Solution type must inherit of *AbstractSolution* class and override the method *computeScoreTo*. It is also necessary to modify the method *createSolutionObject* method in the *Factory* class. +## Cross-validation +### Example +``` java +public static void main(String[] args) { + + Logger logger = Logger.getLogger(Trainer.class.getName()); + logger.setLevel(Level.INFO); + int population_size = 100; + int crossover_rate = 60; + int mutation_rate = 10; + int max_generation = 110; + int selection_method = TrainerParameters.SELECTION_METHOD_RWS; + int generation_population_method = TrainerParameters.POPULATION_INITIALIZATION_RANDOM; + + + TrainerParameters parameters = new TrainerParameters(logger, population_size, + crossover_rate, mutation_rate, max_generation, selection_method, generation_population_method); + + //Input data + List<List<Double>> data = new ArrayList<>(); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4))); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4))); + data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4))); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.5, 0.8))); + data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3))); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1))); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4))); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4))); + data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4))); + data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3))); + data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1))); + + + //Expected aggregated value for each data vector + List<Double> expected = new ArrayList<>(Arrays.asList(1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0,1.0)); + //Create object for the type of Solution (fitness score evaluation) + SolutionDistance solution_type = new SolutionDistance(data.get(0).size()); + //Create trainer object + Trainer trainer = new Trainer(parameters, solution_type); + + + HashMap<AbstractSolution, Double> solution = trainer.runKFold(data, expected, 2, 2); + //Display solution + for (Map.Entry val : solution.entrySet()) { + System.out.println(val); + } + } +``` +The method runKFold runs a k folds cross-validation. Concretely, it separates the dataset in k folds. For each folds, a single fold is retained as the validation data for testing the model, and the remaining k − 1 folds are used as training data. The cross-validation process is then repeated k times, with each of the k folds used exactly once as the validation data. The k results can then be averaged to produce a single estimation. +For each tested fold, the Area Under the Curve is also computed to evaluate the classification efficiency (works only expected vector that contains 0 and 1). + +The code above produces a result similar to: +``` +SolutionDistance{weights_w=[0.8673383311511217, 0.04564604584006219, 0.0647437341741078, 0.022271888834708403], +weights_p=[0.5933035227430291, 0.10784413855996985, 0.03387258778518031, 0.26497975091182074], +fitness score=2.2260299633096268}=0.16666666666666666 +SolutionDistance{weights_w=[0.7832984118592771, 0.12307744745817546, 0.07982187970335382, 0.013802260979193624], +weights_p=[0.01945033161182157, 0.3466399858254755, 0.18834296208558235, 0.44556672047712065], +fitness score=1.7056044468736795}=0.4166666666666667 +``` +As output, the method **runKFold** return a HashMap that contains the best solution for each fold and the AUC corresponding to this solution. +The method **runKFold** takes as argument the dataset (data and expected result) the number of folds used in the cross validation and a value that can increase the number of alert is this number is to low. +This method is interesting to increase the penalty to do not detect an alert. ## References