Evaluation - DigestPath2019

Task 1: Signet ring cell detection¶

Each team's submission will be ranked by the following evaluation metrics separately first. The average rank of the evaluation metrics of each team will be used to as the overall rank of each team.

Detection evaluation measures include (1) instance-level recall, (2) normal region false positives, (3) Free-response Receiver Operating Characteristic (FROC).

Instance-level recall: For there exists overcrowded regions of Signet ring cell, as well as various appearance, it is impossible to get perfect annotation as shown in image above. The yellow cells may be signet ring cell but unlabeled in overcrowded region. Pathologists can only guarantee the labeled cells are really signet ring cell, while the unlabeled cells, it may be as well. Thus, in this problem we seriously consider instance-level recall, when precision is more than 20%. There are two types of images in test data. Positive images contain signet ring cells and negative images contain don’t. Instance-level recall is the sum of matched ground truth boxes divided by the total number of ground truth boxes, ranging from 0 to 1.

Normal region false positives: Normal region false positives is the average number of false positive predictions in the negative images. For evaluation FPs will be written as Max(100 – Normal region false positives, 0).

FROC: By adjusting confidence threshold, we can get various versions of prediction array. When the numbers of normal region false positives are 1, 2, 4, 8, 16, 32 , FROC is the average recall of these different versions of predictions.

For submission: When you write your prediction result to xmls files, please use your own threshold first. Although we have FROC to deal with threshold problem using 'confidence' , Recall and FP@normal are calculated with ALL predicted cells.

Task 2: Colonoscopy tissue segmentation and classification¶

Each team's submission will be ranked by the following evaluation metrics separately first. The average rank of the evaluation metrics of each team will be used to as the overall rank of each team.

Evaluation measures include lesion segmentation Dice Similarity Coefficient (DSE) and classification classification area under the curve (AUC).

Dice Similarity Coefficient (DSC): The Dice metric measures area overlap between segmentation results and annotations. Dice is computed by where A is the sets of foreground pixels in the annotation and B is the corresponding sets of foreground pixels in the segmentation result, respectively.

Area under the curve (AUC): AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative'). This can be seen as follows:

where X_1 is the score for a positive instance and X_0 is the score for a negative instance. TPR means true positive rate, FPR means False positive rate.

The mask files are jpg images. We strongly advise participants to get binary mask using threshold 128 to train your model. When evaluation, we also use 128 as threshold to get binary mask of your predictions and mask, then do DSE calculation.