Ability of Current Machine Learning Algorithms to Predict and Detect Hypoglycemia in Patients With Diabetes Mellitus: Meta-analysis

Background Machine learning (ML) algorithms have been widely introduced to diabetes research including those for the identification of hypoglycemia. Objective The objective of this meta-analysis is to assess the current ability of ML algorithms to detect hypoglycemia (ie, alert to hypoglycemia coinciding with its symptoms) or predict hypoglycemia (ie, alert to hypoglycemia before its symptoms have occurred). Methods Electronic literature searches (from January 1, 1950, to September 14, 2020) were conducted using the Dialog platform that covers 96 databases of peer-reviewed literature. Included studies had to train the ML algorithm in order to build a model to detect or predict hypoglycemia and test its performance. The set of 2 × 2 data (ie, number of true positives, false positives, true negatives, and false negatives) was pooled with a hierarchical summary receiver operating characteristic model. Results A total of 33 studies (14 studies for detecting hypoglycemia and 19 studies for predicting hypoglycemia) were eligible. For detection of hypoglycemia, pooled estimates (95% CI) of sensitivity, specificity, positive likelihood ratio (PLR), and negative likelihood ratio (NLR) were 0.79 (0.75-0.83), 0.80 (0.64-0.91), 8.05 (4.79-13.51), and 0.18 (0.12-0.27), respectively. For prediction of hypoglycemia, pooled estimates (95% CI) were 0.80 (0.72-0.86) for sensitivity, 0.92 (0.87-0.96) for specificity, 10.42 (5.82-18.65) for PLR, and 0.22 (0.15-0.31) for NLR. Conclusions Current ML algorithms have insufficient ability to detect ongoing hypoglycemia and considerate ability to predict impeding hypoglycemia in patients with diabetes mellitus using hypoglycemic drugs with regard to diagnostic tests in accordance with the Users’ Guide to Medical Literature (PLR should be ≥5 and NLR should be ≤0.2 for moderate reliability). However, it should be emphasized that the clinical applicability of these ML algorithms should be evaluated according to patients’ risk profiles such as for hypoglycemia and its associated complications (eg, arrhythmia, neuroglycopenia) as well as the average ability of the ML algorithms. Continued research is required to develop more accurate ML algorithms than those that currently exist and to enhance the feasibility of applying ML in clinical settings. Trial Registration PROSPERO International Prospective Register of Systematic Reviews CRD42020163682; http://www.crd.york.ac.uk/PROSPERO/display_record.php?ID=CRD42020163682


Introduction
Hypoglycemia is a major barrier to achieving the tight glycemic control in patients with diabetes mellitus (DM) that is required to delay the progression of late DM-related complications. Although many patients exhibit symptoms of hypoglycemia such as anxiety, heart palpitations, and confusion, a significant number have diminished ability to recognize these hypoglycemic symptoms [1,2], which is defined as "impaired awareness of hypoglycemia" [3]. This impaired awareness can lead to severe hypoglycemia, which is associated with seizures, coma, and death. Real-time glucose monitoring can help patients maintain optimal glycemic control while avoiding symptomatic or asymptomatic hypoglycemia [4]. However, the traditional monitoring method, intermittent glucose monitoring by finger stick, provides only a limited number of readings and is unlikely to detect hypoglycemia of a short duration. Continuous glucose monitoring (CGM) typically produces a reading every 5 minutes and can alert the patient to not only the occurrence of hypoglycemia but also impending hypoglycemia [5]. Accuracy of CGM has progressively improved, with overall measurement errors reduced by twofold than in the first commercially available CGM devices introduced in 2000 [5].
However, even if CGM advancements enabled patients to continuously track their subcutaneous glucose levels, the statistical disadvantage of the CGM data stream would remain as a major limitation. The autocorrelation of the CGM reading vanishes after 30 minutes, meaning that the projection of blood glucose levels more than 30 minutes ahead would be inaccurate [6]. This finding suggests that the algorithm for identifying hypoglycemia should consider a patient's contextual information such as diet, physical activity, and medications (including insulin) as well as various features of the CGM trend arrow [7].
Machine learning (ML) algorithms have been widely introduced to diabetes research including those for identification of hypoglycemia. The growing use of mobile health (mHealth) apps, sensors, wearables, and other point-of-care devices, including CGM sensors for self-monitoring and management of DM, have made possible the generation of automated and continuous diabetes-related data and created the opportunity for applying ML to automated decision support systems [8].
Combining ML-based decision support systems with the abundance of generated data has the potential to identify hypoglycemia with greater accuracy.
Conventionally, ML has been applied to detect abnormalities in blood glucose levels using physiological parameters that are highly correlated with hypoglycemia (eg, changes in brain or cardiac electrical activities) [7]. Recently, in addition to the detection of hypoglycemia, ML-based decision support systems have been proposed for predicting hypoglycemia by using various historical data (eg, series of blood glucose data, other laboratory and demographic data, verbal data in medical records, or secure messages suggesting occurrence of hypoglycemic events) [8]. Despite many reports of ML algorithms for detecting or preventing hypoglycemia, their abilities have not been comprehensively or quantitatively assessed. This meta-analysis aims to assess the current ability of ML algorithms to detect or predict hypoglycemia in patients with DM.

Protocol Registration
The study protocol has been registered in the international prospective register of systematic reviews (PROSPERO; Registration ID: CRD42020163682).

Literature Searches
We used Dialog to perform the electronic literature searches. The platform allows users to access and search 96 databases of peer-reviewed literature. Publication dates ranged from January 1, 1950, to September 14, 2020. Search terms consisted of 2 elements: (1) thesaurus and text words related to ML and (2) text terms related to hypoglycemia and thesaurus terms related to glucose monitoring or blood glucose. The use of the thesaurus term was limited to 2 databases: EMBASE (EMTREE terms) and MEDLINE (MeSH terms). The above 2 elements were combined using the BOOLEAN operator "AND" (Multimedia Appendix 1). Manual searches were added to review reference lists in relevant studies. If eligible studies were obtained from the reference lists, the reference lists in those studies were also examined. Manual searches were continued until no eligible study was found in the references lists.
Study inclusion criteria were (1) all participants had DM; (2) study endpoint was hypoglycemia; (3) researchers clarified that they originally trained the ML algorithm using training data to build a model for detecting or predicting hypoglycemia or the same researchers trained the ML algorithm in a previous study; (4) the model's performance was tested using the test data; and (5) sensitivity and specificity for detection or prediction of hypoglycemia were presented or could be calculated.
Exclusion criteria were (1) an event-based study (ie, specificity could not be estimated because nonhypoglycemia data were not included in the test data), (2) a case study (ie, training and test data were derived from only 1 patient), and (3) a 2 × 2 contingency table consisting of the number of true positives, false positives, false negatives, and false positives could not be reproduced. If studies met all of the inclusion criteria but did not allow the reproduction of a 2 × 2 contingency table, we asked the corresponding author of these studies for the total number of test data sets (N-total) and events (N-hypo) so that we could reproduce the 2 × 2 table. If the same test data were shared by 2 or more eligible studies, we chose the most updated study in which the ML algorithm was considered to show the best performance.
The outcome of meta-analyses of diagnostic or prognostic tests is the extent of consistency between an index test and a reference standard. The index test is defined as a new test that is proposed when the method for perfectly diagnosing a target condition in all individuals does not exist or cannot be used. In this meta-analysis, it corresponded to an ML algorithm that classified the input data as either hypoglycemia or nonhypoglycemia. The reference standard is defined by a procedure that is considered the best available method for categorizing participants into having or not having a target condition. In this meta-analysis, it corresponded to methods for diagnosing hypoglycemia in clinical practice, which included measurement of glucose levels, the International Classification of Diseases (ICD) code for hypoglycemia, or experts' subjective judgment. Evaluation item was the ability of ML algorithms to detect hypoglycemia (ie, alert to hypoglycemia coinciding with its symptoms) or the ability to predict hypoglycemia (ie, alert to hypoglycemia before its symptoms have occurred). In studies that assessed the ability for detection, data used for the index test (ie, the ML algorithm) and data used for a reference standard (ie, diagnosing hypoglycemia) had to be examined at the same time. In studies assessing predictive ability, the data input into the ML algorithm had to be examined before the diagnosis of hypoglycemia.

Data Extraction
Data were extracted by two authors (SK and KF) Disagreements were resolved by discussion with a third author (HiS). We fundamentally selected 1 datum if there were 2 or more extractable data for a set of test data in an individual study. If an individual study tested 2 or more ML classification methods or 2 or more models for 1 ML classifier, we extracted the datum related to the classifier or model that the study proposed as the best. If 2 or more different results were presented for the same model depending on the prediction window or horizon, we extracted data on the result in relation to the longest prediction window or horizon.
The following study characteristics were extracted: first author, publication year, evaluated item (ie, detecting or predicting hypoglycemia), country, type of DM (ie, type 1 or type 2), number of study participants, N-total, N-hypo, mean or range of the patients' age, time of day of hypoglycemic events, place of supposed hypoglycemic episode (ie, experimental, in-hospital, and out-of-hospital), ML algorithm used for classification into hypoglycemia and nonhypoglycemia, threshold of glucose level for hypoglycemia, method for diagnosing hypoglycemia, method for separating the database into training and test data, and profiling data that were input into ML algorithms for performance testing.

Study Quality
To evaluate study quality, we used a revised tool to assess diagnostic accuracy of studies (QUADAS-2). The QUADAS-2 consists of 4 domains: selection of participants, index test, reference standard, and flow and timing. All 4 domains were used for assessment of risk of bias and the first 3 domains were used to assess the consensus of applicability. Each domain has 1 query in relation to the risk of bias or applicability consisting of 7 questions (Multimedia Appendix 2) [9]. A "Yes" answer was assigned 1 point.

Data Synthesis
The ability of ML algorithms to detect hypoglycemia and predict hypoglycemia was independently assessed. For data that were used to test the model's performance, the number of true positives, false positives, true negatives, and false negatives was calculated. The set of 4 data was pooled with a hierarchical summary receiver operating characteristic (HSROC) model [10]. Indicators for the model's performance included sensitivity, specificity, positive likelihood ratio (PLR), which is calculated as (sensitivity/[1-specificity]), and negative likelihood ratio (NLR), which is calculated as ([1-sensitivity]/specificity). Study heterogeneity was assessed by calculating I 2 values for PLR and NLR based on a multivariate random-effects meta-regression that considered within-and between-study correlations [11] and classifying them into quartiles (0% to <25%, low; 25% to <50%, low-to-moderate; 50% to <75%, moderate-to-high; >75%, high) [12]. Publication bias was statistically assessed as proposed by Deeks et al [13], wherein the logarithm of the diagnostic odds ratio is regressed against its corresponding inverse of the square root of the effective sample size.
Sensitivity analyses were added, and the analysis was limited to studies that shared similar characteristics in terms of the type of DM, time of day when hypoglycemia occurred, place of supposed hypoglycemic events, and the profiling data input into the ML algorithm. It is of note that at least four data sets are necessary to perform these sensitivity analyses because the HSROC model has 4 parameters: sensitivity, specificity, accuracy, and threshold. A two-sided P-value <.05 was considered statistically significant. All statistical analyses were performed using Stata 16 (StataCorp).

Principal Findings
Overall, the PLR and NLR of ML algorithms for detecting hypoglycemia were 4.05 and 0.26, respectively. These estimates were almost unchanged throughout several sensitivity analyses that were limited to studies that shared 1 characteristic in common. According to the Users' Guide to Medical Literature with regard to diagnostic tests [56], the PLR should be 5 or more to moderately increase the probability of persons having or developing a disease and the NLR should be 0.2 or less to moderately decrease the probability of having or developing a disease after taking the index test. In summary, the current ML algorithms had insufficient ability to detect the occurrence of hypoglycemia. However, that would not mean that ECG or EEG monitoring in combination with ML, which was the case with 79% (11/14) of the included studies, was useless in detecting hypoglycemia. For example, for patients with both DM and high cardiovascular risk, in particular, those who are vulnerable to cardiac arrhythmias, using ECGs for detecting hypoglycemia is useful considering that a hypoglycemia-induced arrhythmia could contribute to increased cardiovascular mortality [57]. Similarly, for patients with repeated episodes of hypoglycemia, the combination of ML and EEG was indicated to be beneficial to prevent hypoglycemia-induced neuroglycopenia resulting in cognitive impairment and ultimately death, because blood glucose levels alone do not appear to predict that condition [58].
Thus, the clinical applicability of these devices should be evaluated by the individual's risk of hypoglycemia and its related arrhythmia and neuroglycopenia as well as the overall ability of algorithms for ML.
The overall sensitivity, specificity, PLR, and NLR for predicting hypoglycemia were 0.80, 0.92. 10.42, and 0.22, respectively. Applying the above described guidelines for diagnostic tests to these results, it is worth considering the use of current ML algorithms as a tool for alerting patients to impending hypoglycemic events. In addition, it is considered that a test with a PLR over 10 has a particularly strong power to alter posttest probability of the targeted disease compared with pretest probability [56]. If a positive test result were to be received, patients with DM who are administered hypoglycemic treatments would be strongly recommended to pay more attention to the possibility of impeding hypoglycemic events than they would before receiving the predictive test for hypoglycemia. However, considering that the PLR and NLR values indicate relative risk (ie, risk of disease at posttest compared with that at pretest), the accuracy of predictive ability depends on patients' risk of hypoglycemia in daily life. For example, even a less than 10% false-positive rate (8% in this meta-analysis) may be acceptable in patients at high risk of hypoglycemia but not in low-risk individuals due to too frequent false alarms. In such a case, there is fear that these patients will ignore the alarms and therefore miss the opportunity to take corrective action when the alarm is indeed true [59]. It is emphasized that the utility of ML algorithm depends on the extent of the patient's risk of hypoglycemia. In addition, as indicated in the "Results" section, there was high between-study heterogeneity among studies. Specifically, when limiting analyses to the studies that predicted nocturnal hypoglycemia, the predictive ability was insufficient (pooled estimate: 3.98 for PLR; 0.31 for NLR). Considering that nocturnal hypoglycemia is the most common type of hypoglycemia among all hypoglycemic episodes [60], continued research is needed for further development of ML algorithms to predict hypoglycemia.
Several limitations of this meta-analysis should be addressed. First, the principal major limitation is the pooling of studies among which there was much variability in the type of DM, profiling data for detecting or predicting hypoglycemia, time of day when hypoglycemic events occurred, setting of supposed hypoglycemic events, and ML classification methods. In particular, although the ability for predicting hypoglycemia depended largely on the ML classification methods [33], this meta-analysis did not consider the difference in the test performance among various ML methods. Instead, the meta-analysis focused on ML's comprehensive ability across studies using data in relation to the best model in each study, if 2 or more models existed, rather than comparisons among 2 or more models within 1 study. Given that generalization of evidence is among the most important roles in all meta-analyses, the issue of the variation in ML methods, in particular, the difference between old and new ML techniques, might be beyond the scope of this meta-analysis. Nevertheless, it should be emphasized that successful application of ML lies in the correct understanding of the advantages and disadvantages of different ML methods. Second, only 3 studies exclusively targeted patients with type 2 DM. With the increasing use of insulin to treat type 2 DM in the elderly, the prevalence of hypoglycemia is likely to escalate. In addition, the response to hypoglycemia is different between type 1 and type 2 DM [61]. Future studies should aim to develop and validate ML algorithms for detecting or predicting hypoglycemia in type 2 DM. Third, in most of the included studies, the ML classification models were developed in an experimental setting or by using previously recorded data as training and testing data instead of live data. Future studies need to train and test the algorithm on data from DM patients in everyday clinical practice to determine feasibility.

Conclusion
Overall, current ML algorithms have insufficient ability to detect ongoing hypoglycemia and considerable ability to predict hypoglycemia in patients with DM receiving hypoglycemic treatments. However, the clinical applicability of these ML algorithms should be evaluated according to patients' risk profiles such as for hypoglycemia and its associated complications (eg, arrhythmia, neuroglycopenia) as well as the average ability of the ML algorithm. Continued research is required to further develop ML algorithms to enhance their feasibility, considering the inaccuracy of CGM in the hypoglycemic range, the increased prevalence of hypoglycemia in the elderly, and increasing evidence for the effectiveness of tight glycemic control in preventing microvascular complications [62].