The ANODE09 Study

This document describes the ANODE09 study. The overall goal of the study is to compare the various methods available for automatic detection of pulmonary nodules in thoracic CT scans. Those (interested in) participating should read this document carefully. Certain details of the study may not be finalized; return to this web page to check the status of ANODE09. The last modification date will be listed here: June 24, 2010. If anything is not clear, or if you have additional questions, please e-mail Bram van Ginneken (bram.vanginneken@radboudumc.nl). There is now a publication about ANODE09 available. This paper describes many aspects of the study that were previously explained on this page. Please read the paper, only information not contained in the report is listed here. If you have no access to this journal, you can request a reprint by following this link.

Goal and outcome of the study

We hope ANODE09 will yield several results that are worthwhile for the CAD research community:

  • It provides an opportunity for participants to test their algorithm on large common database, representative of what will be encountered in a lung screening setting.
  • The results listed on this website serve as indication how well CAD algorithms perform. The site will remain open and submitted results are dated, so improvements in CAD technology over time can be quantified.

Rules

The collection of the data, the organization of the ANODE09 study, and the maintenance of this website requires a large effort. We are committed to maintaining this site as a public repository of benchmark results for nodule detection on a common database in the spirit of cooperative scientific progress. In return, we ask everyone who uses this site to respect the rules below.

These rules amount to a simple tit for tat: we actively encourage anyone to use this data for testing lung nodule detection algorithms, but you must, in return, send us the results of your method and a document that describes your method. The score of your algorithm and your description will be publicly available on this site.

We do not claim any ownership or rights to the algorithms or uploaded documents, and do not want to create any obstacles for publishing methods that use the ANODE09 data.

The following rules apply to those who register a team and download the data:

Motivation for the study

There are several, related, reasons why we are organizing this study. One is that the competitive nature of this study may stimulate research groups to improve their algorithms, following the paradigm of revolution through competition. Another one is that we believe progress in the development of computer-aided diagnosis systems is impeded by the fact that it is hard to compare the relative merits of different CAD systems and approaches when published performance measures are obtained on different data sets, and these data sets are not accessible to other researchers. The ANODE09 study allows a more direct comparison because the same data and the same evaluation procedure is used for all submitted results.

The latter is the reason why the reference annotations for the test data cannot be downloaded, and will not be made available in the future. Previous experiences with making data sets publicly available have taught us that if we released the 'truth' for the test data, researchers would perform slightly or vastly different types of evaluations. It could also lead to data picking, i.e. reporting results on a subset of scans or nodules only. While different ways of evaluating algorithms exist for good reason, and while it may make perfect sense to apply an algorithm only to a subset of a database, all of this leads to incomparable results reported in papers, even though the same database was used. To avoid this, we decided on the current procedure, which ensures that each system is evaluated in exactly the same way.

Training data

For this study, 5 example scans and 50 test scans can be downloaded. There is no separate data set available to train algorithms. Of course you are also allowed to use your own training data for any algorithm you develop and apply to the ANODE09 test data set. However, you should not use the ANODE09 test data in any way for training your system as this would positively bias and thus invalidate your results. We have no way of checking if teams use the (results on) test data to train, tune or tweak their performance, but we ask you to not do that.

Data and data format

Downloaded files end with .tar.bz2. They should first be decompressed with bzip2 and subsequently untarred. Many programs, for example the free program 7zip, can do this.

Each downloaded file contains original CT scans, stored in Meta format. This format stores an image as an ASCII readable header file with extension .mhd and a separate binary file for the image data with extension .raw. This format is ITK compatible; documentation is available here. An application that can read the data is SNAP. If you want to write your own code to read the data, note that in the header file you can find the dimensions of the scan, and the voxel spacing. In the raw file the values for each voxel are stored consecutively with index running first over x, then y, then z. The pixel type is short.

The annotation file that is distributed with the example scans is an ASCII readable text file that contains one finding per line. Each line holds the scan name (example01 to example05); the x, y, and z coordinate of each finding in voxels; and the type of finding (1 for a true nodule, 2 for a finding that is not a relevant nodule according to the protocol used in the screening study from which the data originates, but that may represent a lesion. If a CAD system marks a finding marked as '2' this does not count as a false positive).

All data has been provided by the Nelson study, the largest CT lung cancer screening trial in Europe. More information about the acquisition process and the screening study from which the data originates is provided in this paper and in the technical report.

Evaluation protocol, reference standard and submitting results

This is described in detail in the ANODE09 paper.

Results are evaluated with free receiver operating characteristic (FROC) analysis. This means that the sensitivity (part of true nodules in all test scans detected by the system) is plotted as a function of the average number of false positive markers per scan. To obtain a point on the FROC curve, only those findings of a CAD system whose degree of suspicion is above a threshold t are selected, and the number of false positives and true positives is determined. All thresholds t that define a unique point on the FROC curve are evaluated. Between these points, straight lines are drawn to produce the FROC curve. The point with the lowest false positive rate is connected to (0,0). From the point with the highest false positive rate, the FROC curve is extended by a straight horizontal line.

To extract a single score from the FROC curve, we measure the sensitivity at 7 predefined false positive rates: 1/8, 1/4, 1/2, 1, 2, 4, and 8 FPs per scan. Note that since we connect points on the FROC with straight lines as outlined above, we can always exactly compute these sensitivities from the curve, even if there is no threshold t that precisely produces these false positive rates. These 7 sensitivities are averaged to obtain an overall score of a system. Clearly a perfect system will have a score of 1 and the lowest possible score is 0. Most CAD systems in clinical use today have their internal threshold set to operate somewhere between 1 to 4 false positives per scan on average (some systems allow the user to vary the threshold). To make the task more challenging, we included lower false positive rates in our evaluation. This determines if a system can also identify a significant percentage of nodules with very few false alarms, as might be needed for CAD algorithms that operate more or less autonomously.

From the previous exposition, it should be clear that to obtain a good score, systems should include enough findings in their results to reach the point of 8 FPs per scan. It is probably also beneficial to include enough different values for the degree of suspicion to produce a decent number of unique points on the FROC curve. In the extreme case that all findings are assigned the same degree of suspicion, there will be only one point on the curve defined, and a straight line will be drawn from (0,0) to this point, and a horizontal line will extend from that point to the right. Finally, note that no more than 2000 findings will be processed. If a submitted result file contains more findings, they will be sorted on the supplied degrees of suspicion and findings not among the 2000 most suspicious will be discarded.

Format of the result file

The result file must be a simple ASCII readable text file that contains a string and four numbers per line. Each line holds one finding. The strings or numbers must be separated by spaces or tab characters.

The string is the name of the scan in which the finding is located. Possible values for the string are test01, test02, ... , test50. This string is followed by 3 numbers giving the x, y, and z coordinates of the finding, respectively. They can be floating point values if desired (use a decimal point, not a comma). Note that the first voxel in the supplied data has coordinates (0,0,0). Coordinates are given in 'voxel coordinates', not in world coordinates or millimeters. You can verify if you use the correct way of addressing voxels by checking the supplied coordinates for the example data, the locations should refer to locations that hold a nodule.

The final number is a degree of suspicion that should be higher for findings more likely to represent true nodules (can be a floating point value, use a decimal point in that case). The FROC curve of the system will be determined by thresholding this value.

Between the strings and the numbers should be one or more spaces or tab characters. The order of the lines is irrelevant.

The following is an example result file:

test07 108 212 254.443 0.313433
test20 228 234 106.316 0.791044
test14 94 328 316 0.432836
test41 218 298 109 561 0.238806
test04 388 156 363 0.432836
test04 182 401 238 0.865671
test08 330 306.8 148.8 0.820895
test01 116 124 45 0.373134
test15 194 208 312 0.507463

This file contains 9 findings (obviously way too few). There are 8 unique likelihood values. This means that there will be 8 unique thresholds that produce a distinct set of findings (each threshold discards finding below the threshold. That means there will be 8 points on the FROC curve.

How often can I submit results?

Teams can upload as often as they want, but all submitted results will appear on the website and should be substantially different from previous submissions. These differences must be evident from the submitted pdf description file. In other words, you cannot submit a long series of results, using the same pdf file. This is done to avoid clutter on the result page and to avoid 'training on the test set'.

If you have developed a new algorithm, and you are already listed on the site with another system, you can either use the same team or create a new team for your new algorithm.

Currently we do not offer the possibility for teams to remove submitted results. If you believe there are good reasons to remove certain results that you have submitted, please contact Bram van Ginneken (bram.vanginneken@raboudumc.nl).

Description of the algorithms; checklist

With every result file that is submitted on this website, a pdf file that describes the system that generated the results must be uploaded. Ideally this document is a paper (scientific publication or technical report) describing the system that has been used to generate the results in such detail that others can reimplement it.

References to accessible literature in the description are also acceptable, there is no need to repeat information that can be found in such publications. If you have not yet written a detailed paper, or have submitted this for publication and do not want it to become publicly available yet, or if you have other reasons why you want to withhold detailed information about your method, please indicate the reasons for this in the pdf file you submit and describe the system only briefly, using the checklist given below. If your system is commercial and you do not want to divulge details, please provide whatever you can release, and state the name and the version of the system and the company that is associated with it.

The reason we require the upload of a descriptive document with every submitted result is that we believe that it is far less interesting to report results of systems whose working is unknown. We reserve the right to not process results that are not accompanied by a suitable description.

If desired you can change the pdf file that describes your system. Send the new pdf and the submission number (listed at the top of the result page) it applies to to Bram van Ginneken (bram.vanginneken@radboudumc.nl).

For convenience, we provide a checklist below of items that we believe should be mentioned in a description of a CAD system for nodule detection in thoracic CT scans.

  • Does your system use training data, if so describe the characteristics of the training data.
  • Give the overall structure of the algorithm. In many cases this will be lung segmentation, candidate detection, feature computation for each candidate, classification of candidates
  • Briefly describe each step in the structure of the algorithm (how were the lung segmented, how were candidates detected, what features were used, what classifier was used).
  • List limitations of the algorithm. Is the algorithm specifically designed to detect only certain types of nodules? What size ranges was the algorithm optimized for? Can the method detect nodules connected to vasculature, the pleural surface, can it detect non-solid and part-solid nodules? Was it optimized to work for scans with thick or thin slices, are other technical scan parameters expected to influence detection performance?
  • Was the algorithm trained with example data? If so, describe the characteristics of the training data.
  • If the algorithm has been tested on other databases, you could consider including those results.