Dermatologist-level classification of skin cancer with deep neural ...

Actions

¹ Department of Electrical Engineering, Stanford University, Stanford, California, USA.

² Department of Dermatology, Stanford University, Stanford, California, USA.

³ Department of Pathology, Stanford University, Stanford, California, USA.

⁴ Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA.

⁵ Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA.

⁶ Department of Computer Science, Stanford University, Stanford, California, USA.

¹ Department of Electrical Engineering, Stanford University, Stanford, California, USA.

² Department of Dermatology, Stanford University, Stanford, California, USA.

³ Department of Pathology, Stanford University, Stanford, California, USA.

⁴ Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA.

⁵ Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA.

⁶ Department of Computer Science, Stanford University, Stanford, California, USA.

Esteva A, et al. Nature. 2017 Jun 28;546(7660):686. doi: 10.1038/nature22985. Nature. 2017. PMID: 28658222 Skin cancer, the most common human malignancy, is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. Deep convolutional neural networks (CNNs) show potential for general and highly variable tasks across many fine-grained object categories. Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images-two orders of magnitude larger than previous datasets-consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists. Outfitted with deep neural networks, mobile devices can potentially extend the reach of dermatologists outside of the clinic. It is projected that 6.3 billion smartphone subscriptions will exist by the year 2021 (ref. 13) and can therefore potentially provide low-cost universal access to vital diagnostic care.

Illustrative example of the inference procedure using a subset of the taxonomy and mock training/inference classes. Inference classes (for example, malignant and benign lesions) correspond to the red nodes in the tree. Training classes (for example, amelanotic melanoma, blue nevus), which were determined using the partitioning algorithm with maxClassSize = 1,000, correspond to the green nodes in the tree. White nodes represent either nodes that are contained in an ancestor node’s training class or nodes that are too large to be individual training classes. The equation represents the relationship between the probability of a parent node, u , and its children, C ( u ); the sum of the child probabilities equals the probability of the parent. The CNN outputs a distribution over the training nodes. To recover the probability of any inference node it therefore suffices to sum the probabilities of the training nodes that are its descendants. A numerical example is shown for the benign inference class: P _benign = 0.6 = 0.1 + 0.05 + 0.05 + 0.3 + 0.02 + 0.03 + 0.05.

Confusion matrices for the CNN and both dermatologists for the nine-way classification task of the second validation strategy reveal similarities in misclassification between human experts and the CNN. Element ( i, j ) of each confusion matrix represents the empirical probability of predicting class j given that the ground truth was class i , with i and j referencing classes from Extended Data Table 2d. Note that both the CNN and the dermatologists noticeably confuse benign and malignant melanocytic lesions—classes 7 and 8—with each other, with dermatologists erring on the side of predicting malignant. The distribution across column 6—inflammatory conditions—is pronounced in all three plots, demonstrating that many lesions are easily confused with this class. The distribution across row 2 in all three plots shows the difficulty of classifying malignant dermal tumours, which appear as little more than cutaneous nodules under the skin. The dermatologist matrices are each computed using the 180 images from the nine-way validation set. The CNN matrix is computed using a random sample of 684 images (equally distributed across the nine classes) from the validation set.

a–i , Saliency maps for example images from each of the nine clinical disease classes of the second validation strategy reveal the pixels that most influence a CNN’s prediction. Saliency maps show the pixel gradients with respect to the CNN’s loss function. Darker pixels represent those with more influence. We see clear correlation between the lesions themselves and the saliency maps. Conditions with a single lesion ( a–f ) tend to exhibit tight saliency maps centred around the lesion. Conditions with spreading lesions ( g–i ) exhibit saliency maps that similarly occupy multiple points of interest in the images. a , Malignant melanocytic lesion (source image: https://www.dermquest.com./imagelibrary/large/020114HB.JPG ). b , Malignant epidermal lesion (source image: https://www.dermquest.com/imagelibrary/large/001883HB.JPG ). c , Malignant dermal lesion (source image: https://www.dermquest.com/imagelibrary/large/019328HB.JPG ). d , Benign melanocytic lesion (source image: https://www.dermquest.com/imagelibrary/large/010137HB.JPG ). e , Benign epidermal lesion (source image: https://www.dermquest.com/imagelibrary/large/046347HB.JPG ). f , Benign dermal lesion (source image: https://www.dermquest.com/imagelibrary/large/021553HB.JPG ). g , Inflammatory condition (source image: https://www.dermquest.com/imagelibrary/large/030028HB.JPG ). h , Genodermatosis (source image: https://www.dermquest.com/imagelibrary/large/030705VB.JPG ). i , Cutaneous lymphoma (source image: https://www.dermquest.com/imagelibrary/large/030540VB.JPG ).

a , Identical plots and results as shown in Fig. 3a, except that dermatologists were asked if a lesion appeared to be malignant or benign. This is a somewhat unnatural question to ask, in the clinic, the only actionable decision is whether or not to biopsy or treat a lesion. The blue curves for the CNN are identical to Fig. 3. b , Figure 3b reprinted for visual comparison to a .

Our classification technique is a deep CNN. Data flow is from left to right: an image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases. The 757 training classes are defined using a novel taxonomy of skin disease and a partitioning algorithm that maps diseases into training classes (for example, acrolentiginous melanoma, amelanotic melanoma, lentigo melanoma). Inference classes are more general and are composed of one or more training classes (for example, malignant melanocytic lesions—the class of melanomas). The probability of an inference class is calculated by summing the probabilities of the training classes according to taxonomy structure (see Methods). Inception v3 CNN architecture reprinted from https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

a , A subset of the top of the tree-structured taxonomy of skin disease. The full taxonomy contains 2,032 diseases and is organized based on visual and clinical similarity of diseases. Red indicates malignant, green indicates benign, and orange indicates conditions that can be either. Black indicates melanoma. The first two levels of the taxonomy are used in validation. Testing is restricted to the tasks of b . b , Malignant and benign example images from two disease classes. These test images highlight the difficulty of malignant versus benign discernment for the three medically critical classification tasks we consider: epidermal lesions, melanocytic lesions and melanocytic lesions visualized with a dermoscope. Example images reprinted with permission from the Edinburgh Dermofit Library ( https://licensing.eri.ed.ac.uk/i/software/dermofit-image-library.html ).

a , The deep learning CNN outperforms the average of the dermatologists at skin cancer classification using photographic and dermoscopic images. Our CNN is tested against at least 21 dermatologists at keratinocyte carcinoma and melanoma recognition. For each test, previously unseen, biopsy-proven images of lesions are displayed, and dermatologists are asked if they would: biopsy/treat the lesion or reassure the patient. Sensitivity, the true positive rate, and specificity, the true negative rate, measure performance. A dermatologist outputs a single prediction per image and is thus represented by a single red point. The green points are the average of the dermatologists for each task, with error bars denoting one standard deviation (calculated from n = 25, 22 and 21 tested dermatologists for keratinocyte carcinoma, melanoma and melanoma under dermoscopy, respectively). The CNN outputs a malignancy probability P per image. We fix a threshold probability t such that the prediction

\hat{y}

for any image is

\hat{y} = P \geq t

, and the blue curve is drawn by sweeping t in the interval 0–1. The AUC is the CNN’s measure of performance, with a maximum value of 1. The CNN achieves superior performance to a dermatologist if the sensitivity–specificity point of the dermatologist lies below the blue curve, which most do. Epidermal test: 65 keratinocyte carcinomas and 70 benign seborrheic keratoses. Melanocytic test: 33 malignant melanomas and 97 benign nevi. A second melanocytic test using dermoscopic images is displayed for comparison: 71 malignant and 40 benign. The slight performance decrease reflects differences in the difficulty of the images tested rather than the diagnostic accuracies of visual versus dermoscopic examination. b , The deep learning CNN exhibits reliable cancer classification when tested on a larger dataset. We tested the CNN on more images to demonstrate robust and reliable cancer classification. The CNN’s curves are smoother owing to the larger test set.

Here we show the CNN’s internal representation of four important disease classes by applying t-SNE, a method for visualizing high-dimensional data, to the last hidden layer representation in the CNN of the biopsy-proven photographic test sets (932 images). Coloured point clouds represent the different disease categories, showing how the algorithm clusters the diseases. Insets show images corresponding to various points. Images reprinted with permission from the Edinburgh Dermofit Library ( https://licensing.eri.ed.ac.uk/i/software/dermofit-image-library.html ).

Leachman SA, et al. Nature. 2017 Feb 2;542(7639):36-38. doi: 10.1038/nature21492. Epub 2017 Jan 25. Nature. 2017. PMID: 28150762 Free PMC article. Dart A. Nat Rev Cancer. 2017 Mar;17(3):142. doi: 10.1038/nrc.2017.10. Epub 2017 Feb 10. Nat Rev Cancer. 2017. PMID: 28184043 Lev-Tov H. Sci Transl Med. 2017 Mar 1;9(379):eaam9856. doi: 10.1126/scitranslmed.aam9856. Sci Transl Med. 2017. PMID: 28251908 van der Waal I. Oral Dis. 2018 Sep;24(6):873-874. doi: 10.1111/odi.12668. Epub 2017 Apr 17. Oral Dis. 2018. PMID: 28326656 Gershenwald JE, et al. Nat Rev Clin Oncol. 2017 May;14(5):267-268. doi: 10.1038/nrclinonc.2017.55. Epub 2017 Apr 19. Nat Rev Clin Oncol. 2017. PMID: 28422116 Brinker TJ, et al. Eur J Cancer. 2019 Sep;119:11-17. doi: 10.1016/j.ejca.2019.05.023. Epub 2019 Aug 8. Eur J Cancer. 2019. PMID: 31401469 Maron RC, et al. Eur J Cancer. 2019 Sep;119:57-65. doi: 10.1016/j.ejca.2019.06.013. Epub 2019 Aug 14. Eur J Cancer. 2019. PMID: 31419752 Brinker TJ, et al. Eur J Cancer. 2019 Apr;111:148-154. doi: 10.1016/j.ejca.2019.02.005. Epub 2019 Mar 8. Eur J Cancer. 2019. PMID: 30852421 Wells A, et al. J Cutan Pathol. 2021 Aug;48(8):1061-1068. doi: 10.1111/cup.13954. Epub 2021 Jan 26. J Cutan Pathol. 2021. PMID: 33421167 Review. Höhn J, et al. J Med Internet Res. 2021 Jul 2;23(7):e20708. doi: 10.2196/20708. J Med Internet Res. 2021. PMID: 34255646 Free PMC article. Review. Kwon KW, et al. PLoS One. 2024 Mar 13;19(3):e0297536. doi: 10.1371/journal.pone.0297536. eCollection 2024. PLoS One. 2024. PMID: 38478548 Free PMC article. Abdullah M, et al. Heliyon. 2024 Feb 29;10(5):e26938. doi: 10.1016/j.heliyon.2024.e26938. eCollection 2024 Mar 15. Heliyon. 2024. PMID: 38468922 Free PMC article. Goshisht MK. ACS Omega. 2024 Feb 19;9(9):9921-9945. doi: 10.1021/acsomega.3c05913. eCollection 2024 Mar 5. ACS Omega. 2024. PMID: 38463314 Free PMC article. Review. Huang X, et al. BMJ Open Respir Res. 2024 Mar 9;11(1):e002226. doi: 10.1136/bmjresp-2023-002226. BMJ Open Respir Res. 2024. PMID: 38460976 Free PMC article. Hoier D, et al. J Cancer Res Clin Oncol. 2024 Mar 8;150(3):115. doi: 10.1007/s00432-024-05627-3. J Cancer Res Clin Oncol. 2024. PMID: 38457085 Free PMC article. AR063963/US National Institutes of Health/International R21 AG044815/AG/NIA NIH HHS/United States R01 NS089533/NS/NINDS NIH HHS/United States AG020961/US National Institutes of Health/International R01 AG009521/AG/NIA NIH HHS/United States UL1 TR001085/TR/NCATS NIH HHS/United States Show all 6 grants