Jump to Content


International evaluation of an AI system for breast cancer screening


Scott Mayer McKinney *, Marcin T. Sieniek *, Varun Godbole *, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian *, Trevor Back, Mary Chesus, Greg C Corrado *, Ara Darzi *, Mozziyar Etemadi *, Florencia Garcia-Vicente *, Fiona J Gilbert *, Mark Halling-Brown *, Demis Hassabis, Sunny Jansen *, Alan Karthikesalingam, Christopher J Kelly, Dominic King, Joseph Ledsam, David Melnick *, Hormuz Mostofi *, Bernardino Romera Paredes, Lily Peng *, Joshua Jay Reicher *, Richard Sidebottom *, Mustafa Suleyman, Daniel Tse *, Kenneth C. Young *, Jeffrey De Fauw, Shravya Shetty * (*external authors)

Breast cancer is the second leading cause of death from cancer in women, but outcomes have been shown to improve if caught and treated early. This is why many countries around the world have set up breast cancer screening programmes, aiming to identify breast cancer at earlier stages of the disease, when treatment can be more successful.

However, interpreting mammograms (breast x-rays) remains challenging, as evidenced by the high variability of experts’ performance in detecting cancer. In this collaborative research with Google Health & Cancer Research UK Imperial Centre, Northwestern University, and Royal Surrey County Hospital now published in Nature, we developed an AI system capable of surpassing clinical specialists from the UK and US in predicting breast cancer from mammograms, as confirmed by biopsy.

Breast cancer screening datasets

Breast cancer screening programmes vary from country to country. In the US, women are typically screened every one to two years, and their mammograms are interpreted by a single radiologist. In the UK, women are screened every three years, but each mammogram is interpreted by two radiologists, with an arbitration process in case of disagreement. We utilised large datasets collected in both countries to develop and evaluate this AI system.

The UK evaluation dataset consisted of a random sample of 10% of all women with screening mammograms at two sites in London between 2012 and 2015. It included 25,856 women, 785 of which had a biopsy, and 414 women with cancer that was diagnosed within three years of imaging. These de-identified data was collected as part of the OPTIMAM database effort by Cancer Research UK, and are subject to strict privacy constraints.

The US evaluation dataset consisted of de-identified screening mammograms of 3,097 women collected between 2001 and 2018 from one academic medical centre. We included images from all 1,511 women who were biopsied during this time period and a random subset of women who never underwent biopsy. Among the women who received a biopsy, 686 were diagnosed with cancer within 2 years of imaging.

A mammogram showing four images – two of each breast from different angles.

Each mammogram has four images - two of each breast from different angles.

Assessing the performance of the AI system

We compared the performance of the AI system against decisions made by individual human specialists in the original screening visit. In this evaluation, we found that the AI had an absolute reduction in false positives (women incorrectly referred for further investigation) of 5.7% for US subjects and 1.2% for UK subjects, and a reduction in false negatives (women incorrectly missed for further investigation) of 9.4% for US subjects and 2.7% for UK subjects, compared to human experts. See the paper for more extensive results.

The AI system accurately predicts, solely from screening mammograms, whether a patient will have a biopsy positive for breast cancer (note that only a small fraction of screening visits will result in a biopsy). It does so more accurately than individual human specialists, with lower false positive rates (for cancer prediction within three years for the UK dataset, and within two years for the US dataset). The top left indicates peak performance, with no false positives or false negatives. Credit: McKinney et al, Nature

Generalisation across populations

To evaluate whether the AI system was able to generalise across populations and screening settings, we ran an experiment in which the AI was only allowed to learn from data from UK subjects, and then evaluated it on data from US subjects. This experiment showed that the AI system still surpassed human expert performance on US data.

This is an encouraging avenue for future research and gives more confidence about the robustness of the AI system. It might be possible that an AI diagnostic system could be beneficial even when used in areas where there is not a significant history of screening mammography on which to train it.

Future research & potential applications

We’ve yet to determine how to best deploy an AI system for clinical use in mammography. However, we investigated one possible such scenario by using the AI system as a “second reader”. We simulated this by treating the prediction of the AI system as an independent second opinion for every mammogram, taking the place of the ‘second reader’ in the UK ‘double reading’ system. When the AI and the clinician disagreed, the existing arbitration process would take place. In these simulated experiments, we showed that an AI-aided double-reading system could achieve non-inferior performance to the UK system with only 12% of the current second reader workload.

Further research, including prospective clinical studies, will be required to understand the full extent to which this technology can benefit breast cancer screening programmes.