Regulatory responses and approval status of artificial intelligence medical devices with a focus on China
As indicated in Table 1, the absence of an official database for AIMDs approved in Europe presents significant challenges in collecting and comparing associated information. Therefore, this paper focuses on a comparative analysis between the USA and China. For a detailed analysis of Europe, interested readers are referred to previous studies7, with a comprehensive review available in the supplementary Tables 1 and 2. According to the list published by the FDA on October 5, 2022, 521 devices have been included, with the first approval dating back to 1995. This number substantially exceeds the 59 AIMDs approved by the NMPA. Despite potential underestimation of the number of AIMDs in China, there is still a large gap between the total number of devices at the FDA and NMPA. Another contributing factor is the different interpretations and definitions of AI; a broader definition by the FDA has led to more AI-enabled medical devices being approved since 1995. Besides, the start year of AI adoption in healthcare in China is not as early as that in the USA. In China, the first AIMD was not proposed until 2020, known for its FFR computation using deep learning algorithms. From the information provided by the NMPA, it is observed that 50 out of the 59 devices explicitly mention the use of deep learning, which indicates that the scope of AIMDs in China is largely confined to deep learning. Whereas, the FDA approved AIMDs might not only cover the third-generation AI (DL-based), but also the first-generation (rule-based) and second-generation AIs (ML-based).
The analysis of AIMD approvals
A common trend observed is the significant approval of AIMDs for radiology use, with 23 devices (39%) in China and 392 devices (75·2%) in the USA. This trend can be attributed to the continuously growing volume of radiological imaging data, which disproportionately exceeds the number of available trained radiologists9. The primary motivation behind the development and integration of AI-enabled tools in radiology has been to alleviate the increasing workload on radiologists and to enhance clinical care efficiency. The remarkable success of deep learning algorithms in image detection, segmentation, and classification tasks has led to the wider application of AI in computer-assisted diagnostics to meet clinical needs. Compared to other medical specialties, radiology data are more readily accessible and provide a rich resource for scientific and medical exploration.
During our study period, seven devices received approval from both the FDA and the NMPA. Notably, six of these were first approved by the FDA in the USA before gaining approval in China, as shown in Fig. 7. The exception is Keya’s DeepVessel FFR, which received NMPA approval first. A previous study indicated that most AI-based medical devices commonly receive CE marking in Europe and subsequent approval in the USA, initially gaining approval in Europe10. It was suggested that the FDA’s regulatory approach for medical devices may be more rigorous than Europe’s. Following this logic, companies might prefer to first seek FDA approval before pursuing approval in China. However, it should be noted that the approval time is neither reflecting strictness nor efficiency of a country’s regulation. In addition, we also identified the home-based countries of the companies, which ideally might be a related factor to their first approval county. It turned out that all of their home-based countries were China, and there was not much correlation with the approval time.
Among the 521 AI-based medical devices with FDA approval, 500 (96%) were approved through the 510(k) regulatory pathway, 18 through the De-novo pathway, and only three via the Premarket approval pathway. However, it is noteworthy that all devices were classified as Class II in the USA. In contrast, all the identified AIMDs approved in China were reviewed as Class III devices. Such a difference in the class distribution might be due to the difference in AIMDs classification between the USA and China. Specifically, the determination of AIMD categories relies on a combination of the intended purposes of the product and the maturity of the algorithm used. If the algorithm has low maturity (i.e., not approved before or its safety and effectiveness have not been proved before) and is used for decision-support purposes, then the associated AIMDs should be classified as class III types. Since the first AIMD approved by NMPA was in 2020, most algorithms are new in the field of medical devices, thus being classified as class III.
China’s regulatory framework includes point-by-point requirements during the review process, as outlined in regulations and standards, aiming to establish a comprehensive life-cycle supervision model. While all regulatory bodies require manufacturers to provide evidence of safety and performance, the NMPA has published a review guideline detailing how manufacturers should provide such evidence. This guideline includes explicit checklists for manufacturers to follow, encompassing the entire algorithm development process from algorithm specification, data collection, algorithm selection and training, to algorithm testing, and validation. It essentially requires developing a ‘digital twin’ of the algorithm to ensure traceability, enabling digital oversight of the entire process and the potential to reproduce the algorithm model. This can be interpreted as using “rules” approach, while FDA uses “standards” approach. In rules approach, rules or formulas are well established to simplify and speed up the review process, and it can be less noisy, i.e. more consistent for easier comparison. Whereas for US, standards approach may provide more room for manufacturer for innovations. Such difference is in line with the fact that most devices try to obtain FDA approval first.
The analysis of regulatory requirements
The NMPA review guideline places special emphasis on an evidence-based approach relying on data, highlighting the significance of data sufficiency, diversity and representativeness, and addressing data bias in AI algorithm development. Most AI algorithms, particularly those based on deep learning, are heavily data-driven. With the same model structure, the use of different training data can lead to vastly different model parameters. Therefore, training a model on an “inappropriate” dataset could result in poor performance. The guidelines strongly recommend using training data from multiple sources, such as various hospitals, and ensuring balance across disease subtypes and demographic subgroups. This is consistent with the findings from a previous study6, highlighting deep learning models at a single site alone can mask weakness in the models and lead to worse performance across multiple sites. In that case study, significant performance drop-offs were observed when comparing within-site tests to evaluations on other sites. To prove data sufficiency, the guideline recommends depicting a curve of model performance over the amount of training data used to demonstrate the model performance across increasing number of training size. As the size of the training data increases, the model performance is expected to improve and eventually stabilize, which implies that enough data has been provided. Regarding algorithm testing, while balance in the testing data distribution is not required, it should reflect real-world scenarios by ensuring high prevalence similarity in testing data to the target population.
In addition, stratified performance evaluations across disease subtypes, device types, different geographical regions, and demographic subgroups are specified for analysis. By quantifying how AI performs across diverse populations, it helps to explore the factors (e.g., device types, disease types, demographic groups) that might impact the model performance, thereby improving the explainability of the devices and gaining a more comprehensive understanding. Moreover, Unlike the general performance evaluation on the testing data, which is the most essential part and mandatory for each AIMD, adversarial and stress tests are advisable if applicable, to explore the potential of AI models. Besides the above functional performance evaluation (e.g. sensitivity and specificity for classification tasks, DICE metric for segmentation tasks), non-functional performance evaluation such as timeliness is also important to report and submit. With the increasing complexity of AI algorithms, running time can vary significantly, while waiting time is always an important indicator for user satisfactory.
Noted that before data is ready for AI model development, the data preparation stage that includes data collection, data organization, data preprocessing and data annotation, is also an indispensable part of the AI product lifecycle. All these steps are related to data quality, and data annotation should be given special attention since the performance of the developed AI models directly depends on the quality of data annotation. Thus, the details of how the dataset is curated and annotated are required to be specified as well.
We believe that the regulatory strategy, which requires recording every possible useful piece of information generated during the algorithm development process, is crucial for advancing transparency and enhancing understanding of these medical devices among various stakeholders (e.g., users, buyers, researchers). This rich information, encompassing data aspects, safety, and clinical performance, not only enables learning from the strengths and weaknesses of the medical devices but also facilitates more intuitive comparisons between devices with similar applications. The NMPA has even developed regulation documents for specific diseases, pre-defining the information that manufacturers need to provide and the performance indicators that should be evaluated. As illustrated in Fig. 1, the review guideline outlines general principles, while the standards serve as complementary instructions, and the disease/application-specific regulations detail the variations in requirements across different device types. By adhering to consistent guidelines for similar devices, the information provided by manufacturers becomes uniform, simplifying comparative analysis. However, a common issue in both the USA and China is that this information is currently accessible only to regulators and reviewers, with the public receiving vague and limited details. More transparency from the FDA and NMPA would be beneficial, offering companies and researchers a clearer understanding of the current landscape, potential gaps for clinical improvement, and enabling stakeholders to compare and assess these tools more effectively for informed decision-making regarding AI commercial solutions for clinical use. The recent surge in academic progress has been matched by a flurry of activity in companies, with a multitude of commercial AI-based solutions now available (as shown in Fig. 3). For instance, 11 devices have been approved for lung nodule detection and 8 devices for diagnosing diabetic retinopathy or chronic glaucoma. Without sufficient information from the NMPA, it is challenging for radiologists to choose the most suitable commercial AI solution for their specific needs.
Even though there is a growing trend of AIMDs approvals, the number of approved AIMDs is still far smaller than the AI models reported in research. In research, there are also guidelines followed by researchers for reporting and assessing the risk of bias in AI models, such as TRIPOD-AI11, SPIRIT-AI12, CONSORT-AI10, DECIDE-AI13 and so on. These reporting guidelines are being developed in close collaboration with key stakeholders consisting of clinicians, computer scientists, journal editors, researchers, industry leaders, regulators, funders, policy makers, patient groups and so on, to generate a consensus applicable across the community. Furthermore, groups from typically underrepresented regions are also engaged to ensure the guidelines are applicable and international across a global scale14. By contrast, different regulatory agencies use highly variable methods for definition, production, approval, marketing and postmarketing surveillance. In addition, guidelines typically focus on a part of development pathway for AI systems with an emphasis on scientific evaluation and reporting, whereas government regulations cover the whole lifecycle supervision, paying attention to practical implementation and ethic aspects. A recent review also placed the legal and regulatory aspects in the group of key elements of ethics15. Moreover, government regulations must be complied with through a series of systematic review before AIMDs approval, and the approved AIMDs are held under the sole accountability of the manufacturer. There was a study finding out that, among their study cases, journal publications did not acknowledge the key limitations of the available evidence identified in regulatory documents16. Given the somewhat different focuses of research guidelines and regulatory assessment, as well as the differences across regulatory jurisdictions35, there are gaps between the guidelines and government regulations. These gaps are part of the reason why the number of approved AI models is far smaller than the number published in research. It is also important to note that, besides regulations, which serve as a downstream gatekeeper for the AI industry, upstream factors in the AI supply chain, such as AI base model development and dataset quality, could possibly be another part of the reason that need to be aware of.
link