Performance of Risk Assessment Models for Prevalent or Undiagnosed Type 2 Diabetes Mellitus in a Multi-Ethnic Population—The Helius Study

Background: Most risk assessment models for type 2 diabetes (T2DM) have been developed in Caucasians and Asians; little is known about their performance in other ethnic groups. Objective(s): We aimed to identify existing models for the risk of prevalent or undiagnosed T2DM and externally validate them in a multi-ethnic population currently living in the Netherlands. Methods: A literature search to identify risk assessment models for prevalent or undiagnosed T2DM was performed in PubMed until December 2017. We validated these models in 4,547 Dutch, 3,035 South Asian Surinamese, 4,119 African Surinamese, 2,326 Ghanaian, 3,598 Turkish, and 3,894 Moroccan origin participants from the HELIUS (Healthy LIfe in an Urban Setting) cohort study performed in Amsterdam. Model performance was assessed in terms of discrimination (C-statistic) and calibration (Hosmer-Lemeshow test). We identified 25 studies containing 29 models for prevalent or undiagnosed T2DM. C-statistics varied between 0.77–0.92 in Dutch, 0.66–0.83 in South Asian Surinamese, 0.70–0.82 in African Surinamese, 0.61–0.81 in Ghanaian, 0.69–0.86 in Turkish, and 0.69–0.87 in the Moroccan populations. The C-statistics were generally lower among the South Asian Surinamese, African Surinamese, and Ghanaian populations and highest among the Dutch. Calibration was poor (Hosmer-Lemeshow p < 0.05) for all models except one. Conclusions: Generally, risk models for prevalent or undiagnosed T2DM show moderate to good discriminatory ability in different ethnic populations living in the Netherlands, but poor calibration. Therefore, these models should be recalibrated before use in clinical practice and should be adapted to the situation of the population they are intended to be used in.


Introduction
There is evidence that a substantial proportion of people living with T2DM may be undiagnosed [1]. Due to its asymptomatic nature at onset, T2DM diagnosis is often delayed. Consequently, by the time of diagnosis, most patients have one or more vascular complications [2]. Therefore, early detection of those with T2DM is essential for the initiation of relevant interventions, which could prevent or delay associated complications. Based on recent systematic reviews [3,4], a substantial number of risk assessment models have been developed for prevalent or undiagnosed T2DM. However, about 20% of these models have not been externally validated [5][6][7]. Evaluation of model performance in the population where the model was developed, gives optimistic results. Therefore, an external validation in an independent population is necessary to examine the model's generalizability [8]. This may even be more urgent as most of these models were developed in Caucasian and Asian populations [3,4]. Different ethnic groups have varying risks of T2DM. For example, South Asian origin populations have a higher risk compared to Caucasian populations [9]. Furthermore, T2DM risk factors could differ by ethnic groups, hence risk factors that are informative in one ethnic group may be uninformative in another. The burden of T2DM in migrant ethnic minorities is an increasing health concern in Europe. Due to factors like socio-economic status, high-risk migrant groups are usually under-treated. This has led to higher T2DM prevalence among migrant groups than the native people [10]. Therefore, more attention for screening of high-risk individuals in these ethnic minorities is warranted. Yet, very few studies (<2%) have externally validated risk assessment models for prevalent or undiagnosed T2DM in different ethnic populations than the development population (validation C-statistics ranging from 0.59-0.80) [3]. Therefore, little is known about the performance of such models across different ethnic groups. We therefore aimed to identify existing risk assessment models for the risk of prevalent or undiagnosed T2DM and externally validate them in a large multi-ethnic cohort including validation in the ethnic subgroups represented therein.

Search strategy and selection
From literature, we identified existing risk assessment models for the risk of prevalent or undiagnosed T2DM and subsequently externally validated them in a large multi-ethnic cohort. We searched PubMed for relevant studies conducted in humans until December 2017. The search string is presented in Appendix A. Furthermore, we performed a reference search of the papers identified based on our search to find more relevant articles.
Articles were included if 1) the primary aim of the article was the development of a new risk assessment model for prevalent or undiagnosed T2DM (including impaired glucose regulation), 2) the risk assessment models had at least two prediction variables, 3) participants were adults, and 4) the articles were written in English. Articles were excluded if 1) they only validated an already existing risk assessment model, 2) the population was pre-selected for risk factors or disease (i.e., not the general population), and/or 3) the models included variables not present in our study and of which we could not use a proxy variable as substitute. For example, we excluded models with predictors such as monthly income and gestational diabetes because we did not have such variables and reliable proxies in our dataset.
The titles and abstracts of all articles were independently screened by one person (MO) as well as a group of reviewers (FR, IV, and MBS). The full text review for all the articles was done by one reviewer (MO) as well as a group of reviewers (FR and IV). One reviewer (MO) extracted data from all the articles, while another reviewer (IV) duplicated the data extraction from a third of the articles, randomly selected. In articles that had more than one risk assessment model, the model that was recommended by the authors was chosen. The extracted data items included the first author's name, year of publication, country, population, population age, outcome definition, number of cases and sample size, risk predictors in the model, statistical model, measures of model performance, and whether the model was internally or externally validated ( Table 1). Conflicts between reviewers were solved by review from at least one other reviewer to reach consensus (IV and MBS).

HELIUS (Validation study)
The HELIUS study is a population-based, multi-ethnic cohort which has been described in detail elsewhere [12,13]. In brief, baseline data was collected between 2011-2015 and included adults (18-70 years) living in Amsterdam, the Netherlands. It included people of Dutch, Ghanaian, South Asian Surinamese, African Surinamese, Moroccan, and Turkish ethnic origin. Participants were randomly sampled, stratified by ethnic group, from the Amsterdam municipality register. The HELIUS study has been approved by the Academic Medical Center (AMC) Ethical Review Board. All participants provided written informed consent.  We used baseline data of all participants in whom questionnaire data as well as data from the physical examination were available (n = 22,165). We excluded those ethnic groups with small numbers (n = 500) and those of other/unknown ethnic origin (n = 48). We additionally excluded participants who had missing data on the prevalence of T2DM (n = 98). Therefore, 21,519 participants were available for analyses, including 4,547 Dutch, 3,035 South Asian Surinamese, 4,119 African Surinamese, 2,326 Ghanaian, 3,598 Turkish, and 3,894 Moroccan participants (Figure 1).

Predictors and other variables in HELIUS
A questionnaire was used to measure age, sex, smoking status (current/never/former), alcohol use in the past 12 months (yes/no), and family history of diabetes (yes/no/unknown). Physical activity (total minutes/week) was measured using the validated Short Questionnaire to Assess Health-Enhancing Physical Activity (SQUASH) [14], where time spent on various activities during a normal week in the past few months was recorded. Participants also brought their prescribed medications at baseline to record their medication use (e.g., use of antihypertensive drugs).
Physical examination was used to measure blood pressure (mmHg) and anthropometric measures (weight in kg, height, hip, and waist circumference in cm). Fasting blood samples were obtained to measure levels of HbA1c (mmol/mol), glucose (mmol/l), triglycerides (mmol/l), total cholesterol (mmol/l), HDL (mmol/l), and LDL (mmol/l). More information on these measurements has been described elsewhere [13].

Ethnicity in HELIUS
Ethnicity was defined according to the participant's country of birth as well as that of his/her parents [15]. Specifically, a participant is considered to be of non-Dutch ethnic origin if: 1) he/she was born abroad and has at least one parent born abroad (first generation) or 2) he/she was born in Netherlands but both his/her parents were born abroad (second generation).
For the Dutch sample, we invited people who were born in the Netherlands and whose parents were born in the Netherlands. After data collection, participants of Surinamese ethnic origin were further classified according to self-reported ethnic origin (obtained by questionnaire) into 'African', 'South-Asian', 'Javanese', or ' other'.

T2DM assessment in HELIUS
T2DM was considered to be present if: the participants' fasting glucose level was ≥7.0mmol/l, and/or the HbA1c level was ≥7% (53mmol/mol), and/or the participant was using glucose-lowering medication,

Data analysis
We used the published regression coefficients and intercept values from the original models for validation. We contacted the authors to obtain the coefficients of the original model if missing. From these, we calculated the probabilities of having undiagnosed and/or prevalent T2DM in our study. We did this for the whole population and stratified by ethnic group. We replaced the original predictor with a proxy variable if a direct match was not available in our dataset. We used family history of diabetes (i.e., if any of the parents and/or siblings has been diagnosed with diabetes) even when the model specified history of diabetes in parents only or siblings only. We used a total activity of less than 600 minutes/week as proxy for sitting time more than six hours. Total activities less than 600, 600 to 3,000, and above 3,000 minutes/week, were used as proxies for models with low, moderate, and high physical activity categories. For models that had ethnicity dichotomised as white versus others, we used the Dutch and Turkish ethnic groups as white and the other ethnic groups as others (Supplementary  Table 2). Furthermore, we defined the outcome variable in our study as close as possible to the outcome definition in the development population (e.g., if the original model predicted risk of undiagnosed T2DM, we also defined our outcome as undiagnosed T2DM).
Model performance was assessed using measures of discrimination and calibration. Discrimination denotes the ability of the model to distinguish between those at high risk of having T2DM from those at low risk, while calibration indicates the ability of the model to correctly estimate the absolute risks. Discrimination was assessed using the area under the curve (AUC) also known as the C-statistic. A C-statistic of 0.5 reflects a random guess, whereas a C-statistic of 1.0 reflects perfect discrimination. We considered C-statistics between 0.6-0.7 as poor, >0.7-0.8 as moderate, >0.8-0.9 as good, and >0.9 as excellent discrimination [16]. We statistically compared the highest versus the lowest AUC across ethnic groups to ascertain if they were different across groups. This was done using the bootstrapping method (2,000 bootstraps) in the R package pROC [17].
Calibration was evaluated using the Hosmer-Lemeshow goodness of fit tests (non-significant values indicate adequate calibration). We additionally inspected the calibration plots visually. Miscalibration due to differences between the prevalence of T2DM in our study and the development populations, were ' corrected' for by adjusting the intercept (recalibration). This is done by adding a correction factor to the intercept of the original model, more information on this is published elsewhere [18].
Most predictors had missing values less than 1% with the exception of family history of T2DM (11.9%). Therefore, we did a multiple imputation of the missing values with 10 imputations using the multivariate imputation by chained equations (MICE) package in R.
All statistical analyses were conducted using R version 3.2.5.

Results
Our database search yielded 1,250 hits and 14 articles were added from the reference lists of the identified articles. After removing the duplicates (n = 1) and reviewing the titles and abstracts, 1,202 articles were excluded (Figure 2). After a full-text screening of the remaining 61 articles, 35 articles were excluded. Reasons for exclusion are provided in Supplementary File 1 (excel file attached). Twenty-six articles met our inclusion criteria, however, one of these articles had HbA1c as a predictor and was excluded because HbA1c is used for the outcome definition in our study. In total, 25 articles were included for validation ( Table 1). Four articles reported risk assessment models for men and women separately [7,[19][20][21], thus the current study validated a total of 29 risk assessment models from 25 articles. Based on the four domains assessed in PROBAST, the overall ROB of the studies was generally low for the domains on predictors (100%) and outcome (97%), while it was high for most studies on the analysis domain (86%). The majority of the studies (62%) had a low ROB for the domain on participants (Supplementary Figure 1). Only 7% and 3% of the studies had unclear information to assess their ROB in the domains on participant and outcome, respectively (Supplementary Figure 1).
As depicted in Table 1, the internal validation AUCs ranged from 0.67-0.88 in the development populations. One study [19] did not report AUCs but only the Nagelkerke r 2 (0.104 for men and 0.031 for women), which is the proportion of variance accounting for T2DM explained by the models. Seven studies [7,19,21,24,25,29,37] did not report an external validation of their models while three [5,23,34] only compared their models with the performance of other validated models. Sample sizes ranged from 429 [27] to 41,809 [31]. All these models included T2DM predictors that can be assessed non-invasively (i.e., without blood sampling or other invasive assessments).
As depicted in Table 2, on average, the risk assessment models showed moderate to good discrimination, with external validation AUCs varying between 0.77-0.92 in Dutch, 0.66-0.83 in South Asian Surinamese,  In general, the AUCs were consistently lowest among Ghanaians and highest among the Dutch. The highest and the lowest AUCs were statistically different (p-value < 0.05) across ethnic groups for all the models. There were no clear patterns in model performance, depending on whether the models were developed in Asian, Caucasian, or multi-ethnic populations. The four models developed separately for men and women showed that, across the ethnic groups, the AUCs for women were generally slightly higher than those for men. There were no clear patterns in model performance, depending on whether the models were developed in Asian, Caucasian, or multi-ethnic populations.
Calibration was poor for all models (Hosmer-Lemeshow p < 0.001) except one [26], but only in the Dutch (Hosmer-Lemeshow p-value = 0.39). Generally, from the calibration plots, the majority of the models overestimated T2DM in all ethnic groups, especially at higher predicted risks. However, calibration improved after recalibration, see for example Lawati et al. calibration plots (Supplementary Figure 2). The recalibrated models had lower Hosmer-Lemeshow statistics, although the tests were still statistically significant (p < 0.05). Since the ranking of the predicted risks is not affected by recalibration, discrimination did not change.

Discussion
An external evaluation of the performance of 29 non-invasive risk assessment models for prevalent or undiagnosed T2DM in a multi-ethnic cohort showed that the models had a moderate to good discriminatory ability per ethnic group. However, the AUCs were heterogeneous across ethnic populations, with the AUCs being consistently lowest among the Ghanaians and highest among the Dutch, compared to other ethnic groups. Furthermore, no clear patterns in performance of the models were witnessed depending on whether the models were developed in Asian, Caucasian, or multi-ethnic populations. The models showed poor calibration with most overestimating the predicted probabilities which improved, to some extent but not adequately, after recalibrating the models.
Our study had several strengths. All models were identified through a literature search. Our validation sample was large with participants from six ethnic populations, thereby enabling us to carry out validation stratified by ethnic groups. We additionally tried to define the outcome variable as close as possible to the outcome definition in the development population. Finally, we included undiagnosed T2DM cases, therefore reducing the false negative cases which might arise when using only verified T2DM cases. Although, undiagnosed T2DM ascertainment was partly based on a single fasting plasma glucose, which might lead to some false positives. Limitations of our study include the lack of certain predictors, nevertheless, we attempted to use available predictors, as close as possible to the missing predictors, as proxy where applicable. We also did not have a suitable proxy or data on gestational diabetes which is an important risk factor in women. The two-hour glucose was not used to ascertain T2DM in our study and thus we might have underestimated T2DM cases. Finally, we only did validation in the ethnic groups living in Amsterdam which may affect our results generalizability to similar ethnic groups living in their original countries.
Our study is unprecedented, hence, no direct comparisons can be made because there is no review, to the best of our knowledge, which has externally validated risk assessment models for prevalent or undiagnosed T2DM in a multi-ethnic population. However, there have been recent systematic reviews comparing different models for prevalent and or undiagnosed T2DM [3,4]. Unfortunately, only about 2% of these models have been externally validated in different ethnic populations than the development population, with a moderate performance on average (validation AUCs ranging from 0.59-0.80) [3]. Another study assessed the prevalence of known and newly detected T2DM, developed a risk model, and validated it in Hindustani Surinamese, African Surinamese, and Dutch ethnic groups in the Netherlands [42]. The Hindustani Surinamese had the highest T2DM prevalence while the Dutch had the lowest. The model performed moderately well with AUCs ranging from 0.74-0.80, but no calibration results were reported. All these results are somewhat congruent with our results.
Another study by Gray et al. [39], developed a risk assessment model for undiagnosed T2DM in a multiethnic cohort with 24% non-Caucasians. This model was externally validated in a multi-ethnic (26.3% non-Caucasians) study called STAR (Screening Those At Risk) [39]. The same model performed similarly in our study with an AUC of 0.75 in the total population and a range of 0.71-0.81 in the different ethnic groups. However, this and similar models are validated in populations with limited ethnic variation consisting predominantly of Caucasians and Asians. Therefore, due to lack of studies stratifying for other different ethnic groups, we cannot directly compare our results for the different ethnic groups.
Generally, it is expected that a risk assessment model would perform worse in an independent external validation dataset than the development dataset. Nevertheless, most models had moderate discrimination with some having better AUCs in our study than in their development study. This could be explained by the differences in heterogeneity between our study and theirs [43]. Larger heterogeneity in the validation study might lead to higher AUCs than the development population.
The majority of the models overestimated T2DM risk, especially in those with higher observed risk in some ethnic groups. This could partly be explained by poor calibration of the model in the development dataset and differences such as the prevalence of T2DM, how the outcome and predictors were measured between the validation and the development populations. These differences might influence predictions during validation thereby affecting calibration [43]. Although, we ' corrected' for the difference in T2DM prevalence by adjusting the intercept of the models which consequently improved calibration, but not sufficiently. The disagreement between predicted and observed risks that remained even after intercept adjustment could be because of the sensitivity of the Hosmer-Lemeshow test to sample size. With a large sample size, small differences between predicted and observed risks are likely to be considered significant even though good calibration from visually inspecting the calibration plot is evident.
However, overestimation of T2DM risk in those with higher observed risk might not directly affect public health strategies like screening. Certain thresholds for absolute disease risk are usually set before initiation of public health strategies. Therefore, overestimation of risk beyond this threshold might not cause a change in the intervention strategies. However, some models overestimated T2DM risk in those with lower observed risk. These predictions could well be under the thresholds for intervention initiation, therefore such models should be calibrated before use in clinical practice.
Performance was consistently low in the South Asian Surinamese, African Surinamese, and the Ghanaian populations compared to the other ethnic groups. This could partly be explained by the fact that risk factors for T2DM may differ across ethnic groups and it is plausible that some of the risk factors that are very informative in these populations were not captured in any of the models. Informative risk factors such as socio-economic status (e.g., education level or income) and access to healthcare vary substantially between ethnic groups. Unfortunately, none of the models that we validated had these variables. Therefore, validating a model with only the original risk factors within another population is likely suboptimal. Moreover, the models showed similar performance in the African Surinamese and Ghanaian populations, and likewise, in the Turkish and Moroccan populations. This suggests that, perhaps, performance is comparable in ethnic groups having more or less similar cultural beliefs and practices like religion, diet choice, cooking habits, etc. These factors might have an influence on the predictors of T2DM risk and how they are measured across ethnic groups. For example, assessing alcohol use in ethnic groups predominant with believers in a religion that prohibits alcohol use, might lead to 'socially desirable' responses. These potential differences in the way predictors are measured across ethnic groups could also in part explain the heterogeneity in model performance.
We expected the validity to be better when evaluating models developed in a similar ethnic group as the validation ethnic group. Nevertheless, there were no clear patterns in this regard, which could be due to the fact that generally the models were not ethnic specific, hence not directly analogous to the ethnic groups in our study. Additionally, some of the models were not developed in ethnic populations living in a different country than their country of origin. Studies have shown that migrant ethnic groups might be affected by underlying factors related to their new host country and country of origin, and these factors influence their risk for T2DM. Similar studies have also shown that those with migrant status have a higher risk for T2DM [10], so validating models developed in ethnic populations living in their country of origin in populations with ethnic groups with migrant status might explain the lack of clear patterns. This is perhaps strengthened by the good performance of the models in the Dutch population without a migrant status.
Model performance, in terms of discrimination, was generally better across the ethnic groups in women compared to men for the studies that developed models separately for both sexes. These results are in line with a validation study of prediction models for T2DM in a European population that also showed higher performance in women than men [44]. There is evidence to suggest that the pathogenesis of T2DM may be different between men and women. Men, compared to women, usually have higher fasting glucose levels and a lower BMI before the onset of T2DM [45,46]. Previous studies have also shown that several factors predicting T2DM differ between sexes (e.g., family history of diabetes) [47,48], possibly explaining, in part, the difference in performance for some of these models. Therefore, risk assessment models for T2DM should likely be formulated separately for men and women.
The models' discrimination was heterogeneous and calibration was poor, therefore, external validation, including recalibration, is necessary before use in clinical practice. We additionally recommend that such models should be adapted to the situation of the population they are intended to be used in and more such models developed or improved for certain ethnic subgroups (e.g., South Asian Surinamese, African Surinamese, and Ghanaian populations).
In conclusion, existing risk assessment models for prevalent or undiagnosed T2DM showed moderate to good discriminatory ability in different ethnic populations living in the Netherlands. However, these models had poor calibration with most of them overestimating the predicted probabilities for T2DM. Furthermore, these models show heterogeneous discrimination per ethnic group with consistently lower performance in the South Asian Surinamese, African Surinamese, and Ghanaian populations, hence the need for external validation of such models in different ethnic groups and possibly an appropriate adaptation to the local setting before clinical use.

Data Accessibility Statement
Restrictions apply to the availability of data generated or analyzed during this study to preserve patient confidentiality or because they were used under license. The corresponding author will on request detail the restrictions and any conditions under which access to some data may be provided.

Additonal Files
The additional files for this article can be found as follows: