U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Cover of Using Predictive Models to Improve Care for Patients Hospitalized with COVID-19

Using Predictive Models to Improve Care for Patients Hospitalized with COVID-19

, MD, MPH, , PhD, , PhD, MStat, , MD, , PhD, , PhD, , MD, , MD, MSc, , MD, MPH, , MBBS, MA, , MD, MSCE, , MD, MPH, , PhD, MPH, MPhil, , PhD, MS, , MD, , MS, , and .

Author Information and Affiliations

Structured Abstract

Background:

The COVID-19 pandemic has presented significant burdens for the United States. The purpose of this project was to generate evidence for critical questions of clinical decision-making by identifying clinical and social risk factors for COVID-19 outcomes as well as developing and validating electronic health record (EHR) data-based prediction models for patient outcomes.

Objectives:

This project was conducted in close collaboration with the INSIGHT clinical research network (CRN), which is part of PCORnet®, the National Patient Centered Clinical Research Network. This project had 3 aims:

  • Aim 1: Predict the intensive care unit (ICU) need among patients hospitalized for COVID-19.
  • Aim 2: Predict the risk of mortality among patients hospitalized for COVID-19.
  • Aim 3: Predict the course and outcome of intubation among patients hospitalized for COVID-19.

Given the rapid shift in ICU designations in New York City (NYC) hospitals during the pandemic, the ICU label in the EHR data may not accurately and comprehensively identify all critical care patients. We therefore combined aim 1 and 3 and used intubation as a proxy for ICU need.

Methods:

Our revised 2 specific aims of this project are (1) to predict the risk of intubation among patients hospitalized for COVID-19 and (2) to predict the risk of mortality among patients hospitalized for COVID-19. In this project, we used a COVID-19 research database developed by the INSIGHT CRN. The database includes EHR data from 5 health systems: Weill Cornell Medicine, Columbia University, Montefiore, Mount Sinai Hospital, and New York University. For aim A, we developed logistic regression models, random forests, and classification and regression tree (CART) models to predict the need for intubation from COVID-19 using data from the beginning of the pandemic (March 1, 2020, through February 8, 2021). We considered a broader set of variables as candidate predictors, including patient demographics, baseline comorbidities, and presenting laboratory tests. Our main predictors were demographic characteristics (eg, age, sex, race, ethnicity), clinical comorbidities (eg, hypertension, hyperlipidemia, chronic obstructive pulmonary disease [COPD], cancer, coronary artery disease, heart failure, asthma, diabetes), and vital signs (eg, body mass index, systolic and diastolic blood pressure). We also measured the time course of the pandemic as the number of weeks from March 2020 (the beginning of the pandemic in NYC) and measured social vulnerability as the quintiles of the Social Deprivation Index (SDI) at the zip code level among NYC residents. For aim B, we constructed logistic regression models, random forests, and CART models to predict morality. The analysis used the same demographic characteristics (eg, age, sex, race, ethnicity), clinical comorbidities (eg, hypertension, hyperlipidemia, chronic obstructive coronary disease, cancer, coronary artery disease, heart failure, asthma, diabetes), and vital signs (eg, body mass index, systolic and diastolic blood pressure) as in aim A as key predictors. Across aims A and B, we derived and validated 4 biologically distinct subphenotypes of patients with COVID-19 by using clustering analysis to understand the heterogeneity of clinical characteristics among COVID-19. We also examined the variation in our key clinical outcomes (eg, in-hospital mortality and intubation) across subphenotypes.

Results

  • Aim A: A total of 30 016 patients from the INSIGHT COVID-19 database were used in the analysis, including 11 254 patients who were seen in the emergency department and discharged home and 18 762 hospitalized patients (2902 patients were intubated and 3554 patients died during the study period). The prediction models using logistic regressions and random forests for intubation performed. The range of the area under the receiver operating characteristic curve (AUROC) across the folds was 0.66 to 0.74 for logistic regression, 0.66 to 0.73 for random forests, and 0.53 to 0.54 for CART. Our logistic models were well calibrated, as indicated by the low Brier scores (<0.25), and the joint test of the hypotheses that showed the intercept = 0 and slope = 1 for the calibration curves for all models (P > .99 for all models). Including time (ie, number of weeks from March 1, 2020, to the date of the COVID-19 encounter) and interactions of time, with main predictors modestly improving the prediction accuracy of intubation using logistic regression and random forests. Including SDI quintiles did not improve the prediction accuracy of intubation for any method. We identified 4 clinically distinct subphenotypes, and our analysis demonstrated variations in the rates of intubation across these subphenotypes. Subphenotype I consisted of more young and female patients and had the lowest rate of intubation. Subphenotype IV included patients who were older and predominantly male, with more abnormal values across all clinical variables; this group had the highest intubation rate.
  • Aim B: The models using logistic regressions and random forests performed well in predicting in-hospital mortality. The AUROC across the folds ranged from 0.78 to 0.85 for logistic regression, 0.79 to 0.85 for random forests, and 0.64 to 0.68 for CART. Including time and interactions of time with the main predictors slightly improved the prediction accuracy using logistic regression and random forests. Including SDI quintiles did not improve prediction accuracy. Similarly, mortality varied across subphenotypes. Subphenotype I had the lowest rate of mortality, while subphenotype IV had the highest rate of mortality.

Conclusions:

Our models using logistic regressions and random forests to predict in-hospital mortality and intubation showed good performance in a large sample of patients with COVID-19 in NYC. We validated the prediction models across different courses of the pandemic and across different patient neighborhood socioeconomic status. We also derived 4 subphenotypes to understand the clinical heterogeneity of patients with COVID-19 and examined demographics, clinical characteristics, and COVID-19 outcomes across these subphenotypes. These findings provide important evidence to improve outcomes among patients with COVID-19. The prediction models and subphenotypes can be implemented in health systems using their real-time EHR data and identify patients at high risk for adverse outcomes. Clinicians can target early interventions to these patients and perhaps improve patient outcomes.

Limitations:

Although we drew on a robust COVID-19 patient cohort from 5 major health systems in NYC, findings may not be generalizable to other patients in the NYC area or patients in other parts of the country because of the evolution of the virus. We were not able to extract presenting symptoms and other nonroutinely collected clinical data when we developed prediction models, and we did not include clinical encounters at health systems outside the INSIGHT CRN in our clinical analyses. We used data from patients with COVID-19 in the first 2 waves of the pandemic in NYC. Therefore, findings may not be generalizable to the later waves. Our prediction models need to be validated using patients with COVID-19 from later waves. Furthermore, our study used data when vaccines were not widely available, and our models should be reevaluated and updated in a vaccinated cohort.

Background

The COVID-19 pandemic has been an unprecedented public health crisis globally, including in the United States, where New York City (NYC) became the initial epicenter in March 2020.1-4 As of early June 2021, NYC reported approximately a million confirmed cases, more than 100 000 hospitalizations, and 33 000 confirmed deaths.5 As the pandemic continues in the United States and globally, better understanding the clinical characteristics and outcomes for patients with COVID-19 is important to inform clinical decision-making and public health policy.

The course of COVID-19 is clinically distinct from other coronaviruses.1-4,6,7 Given the surge of hospitalization among patients with COVID-19 in the United States and many other countries, clinicians and clinical leaders are facing significant uncertainty in making optimal decisions about triaging, discharging, resource allocation, and staffing based on a lack of robust evidence-based decision support tools.8-10

The academic community has responded quickly to this public health crisis by developing various prediction models.11-13 A review published early in the pandemic identified 107 models to predict hospitalization of, diagnosis of, and prognosis for patients with COVID-19.14 Serious biases were identified from these models, however, primarily because of poorly collected data, small sample sizes, and biased sample selection, among other methodologic concerns.14,15 For example, most studies have focused on patients from a single hospital or a single health system.2-4,7,8 There is significant variation, however, in clinical characteristics and outcomes across health systems, and findings from a single health system may not be generalizable to other patient populations. In addition, most studies have not followed patients after hospital discharge because data on postdischarge outcomes are often unavailable.3,4,7,8 Although some studies on prognostic predictive modeling of COVID-19 outcomes are available, this study is unique in the richness and longitudinality of its data.16-18

COVID-19 is heterogeneous, and our understanding of the biological mechanisms of host response to the viral infection remains limited. Previous studies have uncovered substantial variation in the host response to SARS-CoV-2 and the variable clinical manifestations of this disease, including respiratory failure, kidney injury, and cardiovascular dysfunction.19-23 The pathophysiology of differential organ dysfunction in COVID-19 remains unclear across varied patient populations. In this context, there is a need to identify subphenotypes of COVID-19, which is a promising route to disentangle the heterogeneity of the disease and has seen notable achievements in studying complex syndromic diseases.24 In COVID-19, identification of meaningful clinical subphenotypes may benefit pathophysiologic study, clinical practice, and clinical trials.

In this project, we used 1 of the largest COVID-19 electronic health record (EHR) data sets, including a diverse group of patients with COVID-19 from 5 major health systems in NYC, to develop and validate prediction models for in-hospital mortality and intubation for patients hospitalized with COVID-19. Findings will be helpful to inform patient-centered clinical decision-making and improve outcomes for patients hospitalized with COVID-19.

Our objectives for developing prediction models were to identify clinical risk factors of in-hospital mortality and intubation and to understand how these risk factors changed over time and over socioeconomic subgroups. The rationale for the emphasis on how risk factors changed over time was a recognition of the evolving landscape of preventive measures (eg, vaccines), treatments (eg, medications), and disease management. The rationale for the emphasis on how risk factors changed over socioeconomic subgroups was the mounting evidence of how socioeconomic factors determine who gets COVID-19, which led us to investigate whether such disparities exist when patients present to hospitals.

In close collaboration with the INSIGHT clinical research network (CRN), this project had 3 aims:

  • Aim 1. Predict the intensive care unit (ICU) need among patients hospitalized with COVID-19.
  • Aim 2. Predict the risk of mortality among patients hospitalized with COVID-19.
  • Aim 3. Predict the course and outcome of intubation among patients hospitalized with COVID-19.

For aim 1, our clinical research team was aware that NYC hospitals provided ICU care in non-ICU locations during the peak of the pandemic. These rapidly shifting ICU designations were not consistently captured in the EHR data. Therefore, the ICU label in the EHR data would not accurately and comprehensively identify all critical care patients. Our critical care experts, however, determined that intubation was better captured in the EHR and served as a more accurate outcome for patients who require critical care. As a result, we combined aim 1 and aim 3 and used intubation as a proxy for ICU need. Therefore, our revised specific aims for this project are:

  • Aim A. Predict the risk of intubation among patients hospitalized with COVID-19.
  • Aim B. Predict the risk of mortality among patients hospitalized with COVID-19.

This project was built on expertise and experience gained from the parent grant, “Identifying and Predicting Patients with Preventable High Utilization.” First, our experience with the INSIGHT clinical database from the parent grant expedited this supplement project. In the parent grant, we used the INSIGHT clinical database for 1 million Medicare patients and linked it with Medicare claims data and social determinants of health (SDOH) data. Through this work, we gained familiarity with the structure and data elements available in the INSIGHT database and developed methods and algorithms to identity patient medical, behavioral, and social characteristics. Second, we used our expertise in machine learning (ML)-based prediction models that we developed in the parent grant. We developed prediction models for patients with high preventable utilization by using ML methods in the parent grant, such as random forest models. We used similar methods in this grant to predict patient outcomes among those with COVID-19. We used the expertise we obtained through the parent grant, including linkages between INSIGHT and external databases and applying ML algorithmic methods to clinical data, to conduct this supplemental project in a timely manner.

Participation of Patients and Other Stakeholders

As a key component of this project, our team engaged with various stakeholders across the NYC health care landscape to structure and evaluate our aims, goals, analysis, and dissemination of this work. Because of the rapid societal changes and operational limitations that the COVID-19 pandemic brought to our research processes, including various state and city policies (eg, stay-at-home order), we continued to use the stakeholder engagement approach from the parent grant, including previous guidance from various health system stakeholders through several patient and stakeholder advisory committees. These inputs included prior discussions with advisory committees on effective methods for implementation and dissemination of predictive models developed in the parent grant and subsequently in this COVID-19 enhancement award. To ensure that the results from this COVID-19-specific work would be clinically meaningful and add value to the existing research base, we engaged several leading clinician researchers early in the project with experience caring for patients with COVID-19 across Weil Cornell Medicine, Memorial Sloan Kettering Cancer Center, and NewYork-Presbyterian Hospital to accurately address the most pressing issues and patient outcomes. The team included Dr Dhruv Khullar, a hospitalist and assistant professor of health policy and economics; Dr Edward Schenck, an intensivist and assistant attending physician in pulmonary critical care; Dr James Flory, an endocrinologist and assistant professor of population health sciences; Dr Justin Choi, a hospitalist in internal medicine; Dr Nathaniel Hupert, associate professor of population health sciences; Dr Parag Goyal, assistant professor of medicine and a cardiologist; and Dr Peter Steel, assistant professor of emergency medicine and director of clinical services. All these research team members brought a unique perspective from their daily encounters with patients with COVID-19. These clinician researchers shared lessons and challenges from their experiences on the front lines treating patients during the pandemic, providing our research with insights into the most salient issues of COVID-19 care for both patients and clinicians.

Early in this project, the clinician researchers shaped the nuanced questions that reflected the most pressing issues in clinical care of patients with COVID-19. This step was crucial to ensure that our team was asking the right questions in our efforts to develop the most useful research products. For example, these experts raised important questions about high-risk comorbidities and how best to manage the care of these patients. As many clinicians suggested, we paid special attention to SDOH and COVID-19 outcomes in this project. In contrast to the existing literature, this work took a more holistic approach to understanding and helping patients with COVID-19.1,25-28 In addition, this work illuminated some of the specific social determinants that may put individuals at higher risk.

Through their clinical work, several of our clinician researchers, including Drs Khullar, Goyal, Steel, and Schenck, directly engaged with patients with COVID-19, and their conversations with patients and families played a critical role in informing this research. In particular, these clinicians frequently encountered how social challenges—crowded housing, food insecurity, the need to continue to work despite safety concerns—were concentrated among patients from specific communities, which in turn increased their risk of infection and having adverse COVID-19-related outcomes. For many patients, it was difficult to develop a safe discharge plan because they could not safely isolate at home or did not have stable housing. Another question revolved around predicting which patients would go on to develop severe COVID-19 disease. Patients and families often asked about prognosis, which was challenging to provide in the absence of robust outcomes data and predictive modeling. The questions this project aimed to understand, including the clinical and socioeconomic factors involved in determining severe outcomes, were informed by our clinical team members' experiences working with and engaging directly with their patients.

Additionally, Dr Kaushal was a prominent leader on sound policy for managing the COVID-19 pandemic, helping government officials develop safe policies to address salient societal issues that have been a product of the pandemic. Her piece on reopening schools led to multiple other speaking and interview opportunities to further inform and guide public policy efforts to safely reopen schools.29 Among these speaking engagements was a webinar with the US Chamber of Commerce Foundation; she also had interviews with several publication outlets to further contribute to the broader discourse on data-driven public policy during the pandemic.30-33

Methods

Study Overview

This project aimed to identify clinical and social risk factors for COVID-19 outcomes as well as develop and validate EHR data-based prediction models for these patient outcomes. In aim A, we developed models to predict the risk of intubation among patients hospitalized with COVID-19. In aim B, we developed models to predict the risk of in-hospital mortality among patients hospitalized with COVID-19. Both aims used logistic regressions and ML methods (ie, random forests and classification and regression tree [CART]) with a COVID-19-specific patient data set that the INSIGHT CRN developed.

Across aims A and B, we derived and validated 4 biologically distinct subphenotypes of patients with COVID-19 using clustering analysis to understand the heterogeneity of clinical characteristics among the patients. We also examined the variation in our key clinical outcomes (ie, mortality and intubation) across subphenotypes.

Study Setting

In this project, we used EHR data from the INSIGHT CRN. INSIGHT is the largest urban clinical network in the nation. Bringing together top academic medical centers across NYC, INSIGHT collects comprehensive clinical records for 13 million unique patients. New York City was disproportionately affected by the COVID-19 pandemic. In response to the urgent needs to support COVID-19-related research, INSIGHT rapidly developed a COVID-19 research database accessible to institutions and researchers. The database includes data from 5 of its partnering institutions: Weill Cornell Medicine, Columbia University, Montefiore, Mount Sinai Hospital, and New York University. The INSIGHT COVID-19 research database contains deidentified clinical data—a limited data set, with patients and sites masked in the database. All patient and facility information is also masked in our analysis and results.

The INSIGHT CRN is a member of PCORnet®, the National Patient Centered Clinical Research Network. PCORnet® is funded by PCORI and represents a diverse set of patients and institutions, ranging from academic medical centers to local community health clinics. This collaboration includes a network of networks made up of 9 CRNs, 2 health plan research networks, and a coordinating center, each representing health systems across the United States to develop and maintain a fully integrated, nationally represented database using a common data model (CDM) of health data to support patient-centered research projects.

Participants

Our patient population included all patients with COVID-19 in the INSIGHT COVID-19 database. Our sample was restricted to adults 18 years of age and older who were admitted to the emergency department (ED), including those who were admitted to hospital later and those who were discharged from the ED, or hospital between March 1, 2020, and February 8, 2021, with confirmed COVID-19, defined as having at least 1 positive laboratory test result on real-time reverse transcription polymerase chain reaction or at least 1 ICD-10 diagnosis code for COVID-19. We excluded patients who were in a nursing home before they presented in the ED or were hospitalized (N = 7 441) because we could not rely on those patients' residential zip codes to merge with zip code-level social conditions. The resulting cohort consisted of 30 016 patients with confirmed COVID-19, including 11 254 patients who were seen in the ED and discharged home and 18 762 hospitalized patients.

Study Outcomes

Two outcomes of interest in this project include mortality and intubation. The development of the prediction models used the mortality during a hospital or an ED stay as the outcome. The development of the subphenotypes used the mortality rate 60 days after the COVID-19 confirmation as the outcome. Both prediction models and subphenotypes used intubation, which was defined as mechanical ventilation during a hospital or an ED stay. We first identified the dates of intubations by using the procedure file merged with the encounter file. We identified intubations that fell between the admission date and discharge date of an ED or inpatient encounter among patients with COVID-19.

Data Collection and Sources

The COVID-19 database contains structured EHR data with data elements such as demographics, diagnoses, encounters, procedures, laboratory results, medications, death, and vitals (including height and weight, body mass index [BMI], blood pressure, and smoking status). In collaboration with the research team conducting this project, INSIGHT expanded the traditional PCORnet CDM to include additional COVID-19-related data elements that can be derived from EHRs and loaded into existing Observational Medical Outcomes Partnership/PCORnet tables. These data elements include assisted ventilation, remdesivir use, and COVID-19 antigen test data. There are specific ICD-10 and CPT codes used to identify ventilation records within the COVID-19 data sets.

Although the underlying data for all analyses came from INSIGHT COVID-19 database, there were slight differences in several of our subanalyses in the specific analytical methods, time frame, and inclusion or exclusion criteria used. The subphenotyping analyses used the first version of the INSIGHT COVID-19 database developed by July 2020, including 14 418 patients with COVID-19 who presented to the ED or were hospitalized between March 1 and June 12, 2020, from all 5 health systems affiliated with INSIGHT. The prediction aims used an updated COVID-19 database of 3 health systems developed in March 2021, including patients with confirmed COVID-19 between March 1, 2020, and February 8, 2021.

Analytical and Statistical Approaches

Prediction Model Predictors

Our primary analysis focused on predicting the need for intubation and in-hospital mortality as well as evaluating whether the effect of risk factors on the risk of severe COVID-19 outcomes (intubation and death) temporally changed over the course of the pandemic and across the socioeconomic strata of NYC.

Predictors were identified by considering their availability in the EHR across all health systems and input from the literature and clinicians. Once important predictors were identified, data extractions from EHRs were carried out. Among the data extracted, all laboratory values were excluded from the analysis because of the high degree of missing data. All other predictors were used in the data analysis, and no variable selection took place. The primary predictors include demographics, baseline comorbidities, and vital signs. Demographics included age, sex, race (White, Black, Asian, other, or unknown), and ethnicity (Hispanic, non-Hispanic, or unknown). Age was categorized into 4 groups: 18 to 24 years, 25 to 44 years, 45 to 64 years, and 65 years and older. Sex included female and male. Baseline comorbidities included hypertension, diabetes, coronary artery disease (CAD), heart failure (HF), chronic obstructive pulmonary disease (COPD), asthma, cancer, obesity, and hyperlipidemia. We identified these conditions by using ICD-10 diagnosis codes complied by the Centers for Medicare & Medicaid Services.34 Vital signs include BMI and systolic and diastolic blood pressure.

To include the temporal change of the COVID-19 outcomes and risk factors, we first examined the temporal distribution of COVID-19 hospitalizations, intubations, and deaths during the period March 1, 2020, through February 8, 2021, by constructing 7-day averages of counts of hospitalizations, intubations, and death, as recorded in the INSIGHT data. To study the temporal change in risk factors for severe COVID-19 outcomes, we categorized the time period between March 1, 2020, through February 8, 2021, into 4 distinct periods: the initial surge (March 2020), the decline in cases (April 1, 2020, to June 8, 2020), the plateau period (June 9, 2020, to November 1, 2020), and the second wave (November 2, 2020, to February 8, 2021). We also included a continuous measure for the time course of the pandemic as the number of weeks from March 1, 2020, to the date of the COVID-19 encounter. Both the categorical and continuous age and time variables were used in the prediction models, and we found similar prediction accuracy. We presented the results of using age as a categorical variable and time as continuous variable in the report.

We linked clinical data with social data at the zip code tabulation area level from the Robert Graham Center for policy studies in family medicine and primary care and the 2018 American Community Survey.35,36 Following the previous literature,37,38 we first used the Social Deprivation Index (SDI) to measure the overall neighborhood social conditions, then divided the cohort into 5 groups based on SDI quintile. The SDI score is a composite score based on 7 socioeconomic characteristics. Although other, similar social indices exist, such as the Area Deprivation Index or Social Vulnerability Index, we chose SDI because it is publicly available at the zip code tabulation area level.35,39

Predictive Modeling

In the parent grant, we used logistic regressions and ML methods to predict patients with high health care utilization and in-hospital mortality. We used logistic regression and random forest models that had good performance in the previous analysis. We also tested a new method—CART—in this study.

For both aims A and B, our main analysis focused on fitting a sequence of 3 logistic regression models for intubation and mortality. In the first model, we included demographic characteristics, comorbidities, and vital signs as predictors. We also included a fixed effect for the facility to control for the variability in clinical practice and policy across the various facilities of the INSIGHT data. Because of the nature of this novel disease, the treatment protocol for patients hospitalized with COVID-19 varied across the health systems within INSIGHT. The treatment protocol also changed over the course of the pandemic as facilities and hospitals gained knowledge and experience with this novel disease. Health systems also instituted policies for treating patients hospitalized with COVID-19 that were influenced by factors such as resource allocation (eg, ICU beds), FDA approval of treatment (eg, remdesivir), and case load in NYC. Similarly, facilities serve diverse patient populations that vary across various socioeconomic indicators. In the second model, we added a main effect of time and included 2-way interactions between time and all predictors of the first model. Similarly, in the third model we included main effects of SDI quintiles and 2-way interactions between SDI quintiles and predictors of the first model. The goal of these sequential models was to test the gain in prediction accuracy with and without the inclusion of interaction effects of our main predictors with time and SDI quintiles.

Because logistic regression models consider simpler or lower-order interactions between predictors and time or SDI, such as 2-way interactions between time and 1 of the other predictors, we sought to explore higher-order interactions (eg, interactions between time and 2 or more other predictors) of time and SDI effects using ML algorithms, including random forests and CART, a method that naturally explores higher-order interactions and produces interpretable and usable clinical decision rules. Although highly interpretable, CARTs tend to have lower prediction accuracy. In contrast, random forests consider an ensemble of decision trees over bootstrapped samples of the data and improve the prediction accuracy at the cost of loss of interpretability. We built the random forest models by averaging more than 5000 trees. For each split in a single tree, we randomly chose the square root of the total number of predictors in the data set (√p) to be considered for that particular split. We used 2 measures to determine the importance of each predictor of random forest models: (1) mean decrease in accuracy—the decrease in accuracy that can be attributed to each predictor by accumulating the improvement in prediction accuracy in each split caused by that particular predictor across all splits in all trees in the forest—and (2) Gini index—a measure of node purity, caused by each predictor in the same way as the mean decrease of accuracy.40 Node purity is a measure of the homogeneity of the labels at a node. It is calculated as the overall variance across all categories of label classes at a node. A smaller value of node purity means that a node contains observations predominately from 1 class. Variable importance is therefore determined by a mean decrease in the Gini impurity index.

To determine the incremental effect of including the interactions of our main predictors with time course during the pandemic and SDI quintiles, we compared the estimates of the area under the receiver operating characteristic curve (AUROC) curve obtained with a 5-fold cross-validation. Because of the evolving clinical landscape of this novel disease, health systems adapted to the emergent need and matured in their clinical practice over time. To ensure the robustness of our results and to mimic this real-world phenomenon, we additionally estimated the AUROC in a time-dependent cross-validation scheme, where the training data were divided into 5 equally sized folds of nonoverlapping time periods based on the date of ED or inpatient admission; models were trained on the first time period of the data (patients admitted earliest at the start of the pandemic to March 31, 2020) and the AUROC based on the model calculated on the next, and so on. Similarly, the awareness of this novel disease and its safety protocols varied widely across socioeconomic strata. Moreover, the effects of lockdown, mode of work (in person or remote), and need to use public transportation for work likely varied substantially across the socioeconomic strata of the city. To ensure that our estimates of prediction accuracy were robust to such variations, we performed a 5-fold cross-validation, where each fold was defined by SDI quintile. Models were trained on 4 quintiles and tested on the remaining quintile, and the process was repeated for each quintile.

It is important to evaluate the calibration of our predictive models to ensure that the predictive probabilities are accurately estimated, which is critical for clinical decision-making. Calibration of our logistic regression models was determined by visual inspection of the calibration curve (plotting actual vs predicted probabilities); estimating the intercept and slope of the calibration curve (a logistic regression of observed outcomes on predicted outcomes in the logit scale); testing whether the intercept is zero and the slope equals 1; and computing the Brier score, which measures the squared difference between the actual outcome and predicted probabilities. We performed the calibration analysis on the whole sample because we did not have a separate internal or external validation cohort. We did not calibrate random forest models because the model is known to provide consistent estimates of prediction probabilities in binary and multiclass classification.41,42 The CART models showed poor discrimination based on the AUROC and are not recommended for future use or estimation of probabilities; therefore, we did not calibrate the CART models.41,42

We did not exclude patients for missing data or variables in analysis. Missing data in predictors (ranged, 3.6%-12.0%) were imputed using a single imputation with a random forest imputation algorithm. Random forest imputation considers higher-order interactions between variables to predict missing values and does not rely on distributional assumptions (eg, normality of data), thereby being a robust strategy for imputation.

Deriving COVID-19 Subphenotypes

Note: The following text was adapted from Su C, Zhang Y, Flory JH, et al. Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health. NPJ Digit Med. 2021;4(1):110. doi:10.1038/s41746-021-00481-w [PMC free article: PMC8280198] [PubMed: 34262117] [CrossRef]

Cohort development

Considering the population diversity of the 5 medical centers that contributed data to the first version of the INSIGHT COVID-19 database, we combined patients from 4 centers and randomly divided them into the development cohort (70%) and internal validation cohort (30%). Patients of the last center were used as the external validation cohort.

Candidate measures for subphenotype derivation

We considered 30 clinical variables associated with COVID-19 onset, symptoms, or outcomes and available in the INSIGHT database as the candidate variables to derive subphenotypes, as shown in Table 1. The variables included inflammatory markers, inflammatory and hepatic markers, hepatic markers, markers of cardiovascular conditions, markers of kidney dysfunctions, markers of hematologic dysfunctions, and oxygen saturation. Seven variables with high missingness (missing more than 70% of values) were excluded, and the remaining 23 variables were used for deriving subphenotypes.

Table Icon

Table 1

Use of the Candidate Clinical Variables in Subphenotype Analysis.

For each patient, we extracted the first value of each clinical variable within the collection window, which was defined as (1) the time period from COVID-19 confirmation until the admission date of the first inpatient encounter, if the patient had an inpatient admission within 14 days after confirmation, or (2) the full 14-day period after COVID-19 confirmation if there were only ED encounters but no inpatient admissions following the COVID-19 confirmation. If there were no record in the collection window, we extracted the last value within 3 days before confirmation.

We also examined other clinical characteristics of the patients, including demographics, comorbidities, and BMI. Demographics included age, sex, and race. Baseline comorbidities included hypertension, diabetes, CAD, HF, COPD, asthma, cancer, obesity, and hyperlipidemia. For each patient, we collected the most recent BMI data. We analyzed the 2 outcomes in aims A and B of intubation and mortality for all patients. The mortality examined across subphenotypes was primarily in-hospital mortality, although some health systems also captured mortality after discharge.

Clustering methods

We first derived subphenotypes by using the development cohort. More specifically, agglomerative hierarchical clustering with Euclidean distance calculation and Ward linkage criterion was applied to the 23 clinical variables after data preparation.43 We used agglomerative hierarchical clustering because it is robust to different types of data distributions and typically produces a dendrogram that visualizes the data structure to help determine the optimal cluster number. Besides the dendrogram, we calculated 21 measures of clustering models provided by the NbClust R package to determine the optimal number of clusters (ie, subphenotypes).44

Subphenotype validation. To evaluate reproducibility, we validated our subphenotypes in 3 ways. First, we performed sensitivity analyses using the development cohort to evaluate (1) sensitivity to quality control of missingess of features and patients with outliers of features and (2) the sensitivity to clustering algorithms. To assess sensitivity to quality control of missingness and patients with outliers of features, we incorporated all 30 candidate variables without excluding any features because of high missingness rate and excluded patients who had outlier values, defined as values out of the range of [μ − 5σ, μ + 5σ], where μ and σ are the mean and SD of a specific feature. Then, similar to the primary analysis, we performed agglomerative hierarchical clustering to re-derive subphenotypes and determined optimal cluster number using dendrogram and NbClust. To assess the sensitivity of the number and characteristics of identified subphenotypes to different clustering algorithms, we re-derived subphenotypes using the Gaussian mixture model, which is a probabilistic model for clustering analysis based on a mixture of Gaussian distributions.45 The optimal cluster number in the Gaussian mixture model was determined by comprehensively considering the Akaike information criterion, bayesian information criterion, and median probability of group membership.

Second, we used the internal validation cohort and re-derived subphenotypes by using the same agglomerative hierarchical clustering with the primary analysis for validation. The optimal cluster number was determined by using the dendrogram and NbClust package.

Changes to the Original Study Protocol

Given the changing clinical environment during the COVID-19 crisis, our team assessed our analytical approach to produce meaningful contributions to the existing COVID-19 literature. Several factors, such as health system ICU capacity during the surge period of COVID-19 cases and changing standards of clinical care, required additional data validation and new methodologies for this research. Our clinician researchers and critical care experts concluded that across NYC, many hospital spaces other than ICUs were used to provide ICU care, and these changes were not captured in the EHR data. Therefore, EHR data would not accurately and comprehensively identify all critical care patients. As a result, the need for ICU designation, as captured in EHRs, would underestimate the disease severity of patients hospitalized with COVID-19 and make the models for predicting ICU need as a proxy of disease severity less reliable. Therefore, our research team determined, in consultation with PCORI, that combining aims 1 and 3 and using intubation as the primary outcome would most accurately enable us to develop models for predicting the severity of COVID-19 and the need for critical care.

Results

Aim A. Predict the Risk of Intubation Among Patients Hospitalized With COVID-19

During the period from March 1, 2020, to February 8, 2021, the distribution of COVID-19 cases and the distribution of severe outcomes (intubation and death) over time are depicted in Figures 1 and 2. Specifically, there was a period of initial surge that peaked toward the end of March, followed by period of decline, followed by the plateau phase, where cases and outcomes were stable; finally, the second wave of cases peaked in late January 2021. During the second wave, the rates of severe COVID-19 outcomes were lower than in the first wave in March-April 2020.

Figure 1. 7-Day Average of COVID-19 Cases.

Figure 1

7-Day Average of COVID-19 Cases.

Figure 2. 7-Day Average of Severe COVID-19 Outcomes of Intubation and Death.

Figure 2

7-Day Average of Severe COVID-19 Outcomes of Intubation and Death.

In the cohort of 30 016 patients with COVID-19, the median (IQR) age was 59.5(43.2-72.4) years, 50.8% were men, 63.5% were of a race other than White, and 36.4% had Hispanic ethnicity. Patients in this cohort had a history of comorbid disorders: 53.6 % had hypertension, 38.6 % had hyperlipidemia, 13.4% had COPD, 25.1 % had CAD, 14.0% had HF, 15.6 % had asthma, and 32.9 % had diabetes. The mean (SD) systolic and diastolic blood pressure for the cohort was 127 (21.7) mm Hg and 73.9 (14.1) mm Hg, respectively. Overall, 11.8% died from COVID-19.

Table 2 provides a detailed breakdown of the risk factors (demographics, history of comorbidities, and vital signs) for the outcome of intubation. All risk factors were significantly different by intubation status, except race and history of asthma.

Table Icon

Table 2

Baseline Characteristics of Patients With COVID-19 Included in the Study, by Intubation Status.

Table Icon

Table 3

Cross-Validated AUROC Comparing 3 Models Across Varying Statistical Methods Predicting Intubation.

The model with primary predictors showed good performance. The AUROC of models with primary predictors ranged from 0.55 to 0.72 across 3 models. An AUROC of 0.50 usually suggests no discrimination or no better ability to predict an outcome than a random guess. An AUROC between 0.70 and 0.80 is considered acceptable, between 0.80 to 0.95 is considered great discrimination, and between 0.95 and 1.0 is excellent. An AUROC of 1.0 indicates perfect discrimination. Inclusion of time and its interactions with the main predictors modestly improved the prediction accuracy for intubation using logistic regression and random forests, but inclusion of SDI quintiles and their interactions with the main predictors did not improve the prediction accuracy of intubation for any method. Of note, both logistic regression and random forest models had similar discrimination, as measured by AUROC. The CART model, however, performed poorly, with the AUROC being slightly higher than chance alone. Our logistic models were well calibrated, as shown in Figure 3. The Brier scores for all models of intubation were low, and all models supported the null hypothesis of the calibration curve having an intercept that equals zero and a slope that equals 1. This finding implies that our logistic models provide accurate estimates of risk probabilities in addition to having good discrimination, which is critical for clinical decision-making.46

Image

Figure 3

Calibrations of Logistic Regressions for Intubation.

The time-dependent cross-validation demonstrated that training and testing the model in patients who were admitted to the ED or hospital at different time periods slightly reduced the prediction accuracy. For the outcome of intubation, the range of AUROC across the folds was 0.70 to 0.79 for logistic regressions, 0.69 to 0.79 for random forests, and 0.50 to 0.62 for CART models. To capture differential effects of the pandemic-related restrictions on different socioeconomic strata, we additionally performed a 5-fold cross-validation strategy that captured this variation, where the folds were determined by each SDI quintile. We trained the model using data from 4 folds and tested the model using data from the remaining fold. We repeated this process 5 times; patients from an SDI quintile were used to test the model each time.

For the outcome of intubation, this 5-fold cross-validation strategy had a slightly reduced AUROC of 0.66 to 0.74 for logistic regressions, 0.66 to 0.73 for random forests, and 0.53 to 0.54 for CART models. Detailed regression output and figures are included in Appendix A Figures 1 through 10 and Appendix A Tables 1 through 4.

For the outcome of intubation, the facility of treatment (P < .001), having a history of HF (P < .001), age (P = .009), and diastolic blood pressure (P < .001) had significant interactions with time in the logistic regression models (Figures 4-7). Facility 1 had the steepest decline in risk of intubation over time, while facility 5 had the least steep decline in risk of intubation over time. The risk of intubation among patients with a history of HF was similar to the risk among patients without a history of HF during the beginning of the pandemic in NYC; however, the risk of intubation was higher among patients with HF compared with patients without a history of HF toward the end of our period of study. Risk of intubation was higher among middle-aged (45-64 years of age) and older (>64 years of age) patients during the beginning of the pandemic and remained higher toward the end of the study period, while risk of intubation among younger adults (18-44 years of age) was lower than among middle-aged and older adults during the entire period of the study. Risk of intubation continued to decline over the course of the pandemic for the quartiles of diastolic blood pressure, the lowest quartile of diastolic blood pressure having the highest risk while the highest quartile had the lowest risk.

Figure 4. The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Facility.

Figure 4

The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Facility.

Figure 5. The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Heart Failure.

Figure 5

The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Heart Failure.

Figure 6. The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Age.

Figure 6

The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Age.

Figure 7. The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Diastolic Blood Pressure.

Figure 7

The Probability of Intubation for COVID-19 (Vertical Axis) Over Time (Weeks From March 1, 2020—Horizontal Axis), by Diastolic Blood Pressure.

Deriving COVID-19 Subphenotypes and the Variation of Intubation

A total of 14 418 patients with confirmed COVID-19 between March 1 and June 12, 2020, treated in the ED or inpatient setting were included for subphenotyping analysis. The development cohort contained 8199 patients, with a median (IQR) age of 65.35 (50.57-75.17) years and consisting of 3787 (46.2%) women, 2036 (24.8%) White patients, and 2155 (26.3%) Black patients. The internal validation cohort, which was created by randomly selecting 30% of patients from the 4 INSIGHT-affiliated health systems, contained 3519 patients with similar patient characteristics compared with the development cohort, with a median (IQR) age of 63.51 (50.95-75,17) years and consisting of 1585 (45.0%) women, 838 (23.8%) White patients, and 915 (26%) Black patients. The external validation cohort contained 2700 patients. It had a median (IQR) age of 65.85 (51.08-77.38) years and consisted of 1305 (48.3%) women, 675 (25.0%) White patients, and 545 (20.2%) Black patients. Across the development, internal, and external cohorts, the overall mortality rates 60 days after COVID-19 confirmation were 18.65%, 19.78%, and 20.59%, respectively.

In the development cohort, the agglomerative hierarchical clustering model identified 4 distinct subphenotypes based on presenting clinical data of the patients. Characteristics, including demographics, clinical variables, comorbidities, clinical outcomes, and medication-based treatments across the 4 subphenotypes, are presented in Table 4. Results of the internal and external validation cohorts are presented in the Appendix A Tables 5 and 6.

Table Icon

Table 4

Characteristics of the Identified Subphenotypes (Development Cohort).

Subphenotype I consisted of 2707 (33.02%) patients. Compared with the others, it included more young (median [IQR] age, 57.45 [42.70-70.02] years) and female (n = 1601 [59.15%]) patients. Those patients had more normal values across all clinical variables and a lower chronic comorbidity burden. The patients also had better clinical outcomes, with a lower rate of intubation (n = 190 [7.02%]).

Subphenotype II consisted of 3047 (37.16%) patients. Compared with other subphenotypes, it included more male patients (n = 1941 [63.70%]) and was likely to have more abnormal inflammatory markers (such as C-reactive protein, erythrocyte sedimentation rate, interleukin 6, lactate dehydrogenase, lymphocyte count, neutrophil count, white blood cell count, and ferritin) and markers of hepatic dysfunctions (such as ferritin, alanine aminotransferase, aspartate aminotransferase, and bilirubin). Subphenotype II had a higher rate of intubation than subphenotype I (n = 527 [17.30%]).

Subphenotype III included 1486 (18.12%) patients, consisting of older (median [IQR] age, 69.45 [57.05-79.62] years) and Black (n = 503 [33.85%]) patients compared with subphenotypes I and II. Those patients in subphenotype III were likely to have more abnormal kidney dysfunction markers (such as blood urea nitrogen, creatinine, chloride, and sodium) and hematologic dysfunction markers (such as dimerized plasmin fragment, hemoglobin, and red blood cell distribution width). Subphenotype III had a lower intubation rate than subphenotypes II and IV (n = 195 [13.12%]).

Subphenotype IV included 959 (11.70%) patients. Compared with other subphenotypes, it included older (median [IQR] age, 75.53 [64.10-84.83] years) and male (n = 588 [61.31%]) patients. Those patients of subphenotype IV had more abnormal values across all clinical variables and higher chronic comorbidity burden than the others. In line with its biological characteristics, subphenotype IV had the worst clinical outcomes, with the highest rate of intubation (n = 242 [25.23%]). In addition, medications, including antibiotics, corticosteroids, and vasopressors, were more frequently used in subphenotype IV.

Figure 8 presents the temporal trends of the COVID-19 subphenotypes since the outbreak in NYC (ie, March 1, 2020) of the development cohort. Except weeks 1 and 14, each of which had few confirmed COVID-19 cases, the composition of the 44 subphenotypes per week evolved over time. In general, the number of patients with confirmed COVID-19 rapidly increased since the first month of the outbreak and reached its peak in week 5 (early April). Subphenotype I (mild symptoms) and subphenotype II (moderate symptoms, low comorbidity burden) dominated the time period before the peak (first 4 weeks since outbreak). In contrast, subphenotype IV (severe symptoms, high comorbidity burden) had a low proportion within the first 4 weeks but showed a largely increased proportion from weeks 6 through 9. Since week 10, the proportion of subphenotype I gradually increased, while other subphenotypes, especially subphenotype IV, decreased. Subphenotype III (moderate symptoms, high comorbidity burden) had a relatively stable proportion over time. Results of the internal and external cohorts are presented in Appendix A Figures 11 and 12.

Figure 8. Temporal Trends of COVID-19 Subphenotypes (Development Cohort).

Figure 8

Temporal Trends of COVID-19 Subphenotypes (Development Cohort).

Aim B. Predict the Risk of In-Hospital Mortality Among Patients Hospitalized With COVID-19

A full description of the COVID-19 patient cohort (N = 30 016) used in our prediction models is provided in Table 2. Table 5 provides a detailed breakdown of the risk factors (demographics, history of comorbidities, and vital signs) by the death outcome. All risk factors were significantly associated with the risk of death with the exception of asthma, which was not associated with risk of death.

Table Icon

Table 5

Baseline Characteristics of Patients With COVID-19 Included in the Study, by the Death Outcome.

Next, we examined the accuracy of the prediction model by using our main predictors of demographic variables, comorbidities, and vital signs (Table 6) to predict death and compared it with the inclusion of time and interactions of time with the main predictors and the inclusion of SDI quintiles and their interactions with the main predictors (Table 6). The goal of this analysis was to determine whether the effect of the risk factors on severe COVID-19 outcomes changed during the pandemic and changed across the socioeconomic strata of NYC. The AUROC of main models ranged from 0.64 to 0.81. Inclusion of time and its interactions with main predictors modestly improved the prediction accuracy for the risk of in-hospital mortality using logistic regression and random forests. Inclusion of SDI quintiles and their interactions with the main predictors, however, did not improve the prediction accuracy of in-hospital mortality for any method. Our logistic models were well calibrated, as shown in Figure 9. The Brier scores for all models were low, and all models supported the null hypothesis of the calibration curve having an intercept that equals zero and a slope that equals 1. This finding implies that our logistic models provide accurate estimates of risk probabilities in addition to having good discrimination, which are critical for clinical decision-making.46

Table Icon

Table 6

Cross-Validated AUROC Comparing 3 Models Across Varying Statistical Methods Predicting In-Hospital Mortality.

Image

Figure 9

Calibrations of Logistic Regressions for Inpatient Mortality.

As a sensitivity analysis, we performed the time-dependent cross-validation scheme to capture the changing clinical landscape of this novel disease. The time-dependent cross-validation produced similar results, demonstrating that training and testing the model in different time periods slightly improved the prediction accuracy. For the outcome of death, the range of AUROC across the folds was 0.78 to 0.85 for logistic regression, 0.79 to 0.85 for random forests, and 0.64 to 0.68 for the CART model. To capture differential effects of the pandemic-related restrictions on different socioeconomic strata, we additionally performed a 5-fold cross-validation strategy on model 1 that captured this variation, where the folds were determined by each SDI quintile. The outcome of death had an AUROC range of 0.76 to 0.79 for logistic regression, 0.77 0.79 for random forests, and 0.59 to 0.61 for the CART model. Detailed regression output and figures are included in Appendix A Figures 3, 4, and 5 through 7 as well as Appendix A Tables 1 through 4.

The modest improvement in prediction accuracy when the course of pandemic (eg, the continuous weeks since March 2020) was included in the model along with its interactions with demographic, comorbidity, and vital sign predictors suggests that risk factors for severe outcomes of COVID-19 did change over the course of the pandemic as clinical practice evolved. To this end, we focused on the 2-way interaction effects that were statistically significant in predicting death in the logistic regression model. For the outcome of death, the probability of death declined over time, but the rate of decline varied significantly (P < .001) across the 3 facilities (deidentified) of the INSIGHT CRN (Figure 10). Facility 1 had the steepest decline in probability of death from COVID-19 over time, while facility 8 had the least steep decline. Probability of death from COVID-19 also declined similarly and significantly over time for different quartiles of systolic blood pressure (P < .001), with the lowest quartile of systolic blood pressure having the highest risk levels during the beginning of the pandemic and continuing to have the highest risk levels toward the end compared with the median systolic blood pressure and the third quartile of systolic blood pressure (Figure 11).

Figure 10. Probability of Death From COVID-19 Over Time, by Facility.

Figure 10

Probability of Death From COVID-19 Over Time, by Facility.

Figure 11. Probability of Death From COVID-19 Over Time, by Systolic Blood Pressure.

Figure 11

Probability of Death From COVID-19 Over Time, by Systolic Blood Pressure.

Deriving COVID-19 Subphenotypes and Their Risk of Mortality

A detailed description of each COVID-19 subphenotype is provided in the section, “Deriving COVID-19 Subphenotypes and the Variation of Intubation.” The rates of mortality for each subphenotype generally follow the same pattern observed with intubation. Subphenotype I had a low 60-day mortality (n = 188 [6.94%]). The overall comorbidity burden of subphenotype II (n = 528 [17.33%]) was closer to that of subphenotype III (n = 337 [22.68%]). As with intubation, subphenotype IV had the highest mortality rate (n = 476 [49.64%]).

Discussion

In this project, we developed prediction models for 2 clinically important outcomes among patients with COVID-19: intubation and in-hospital mortality. Using patient demographics, comorbidities, vital signs, and other characteristics as predictors, these models showed good accuracy in predicting intubation and in-hospital mortality. In addition, we developed and validated subphenotypes of patients with COVID-19 using clustering analysis; we found that patient characteristics and outcomes differ significantly across subphenotypes. These results provided important evidence about the heterogeneity of patients with COVID-19 and can be used to inform clinical decision-making.

The rapid spread of COVID-19 has imposed significant burdens for patients and hospitals in the United States. Prediction models have been an important tool to facilitate clinical decision-making in various clinical settings. Therefore, developing and validating prediction models for patients with COVID-19 have great potential to improve patient outcomes during the pandemic. Prediction models can help clinicians identify patients at a higher risk of having adverse outcomes because of preexisting conditions and other risk factors. Clinicians can tailor treatment for these high-risk patients using information that the prediction models provide. In this study, we used 1 of the largest COVID-19 EHR databases, which includes patients from multiple health systems in NYC, across different stages of the pandemic. By using this unique data set, a major contribution of this study is our validation of findings with a more representative and diverse sample. Our results showed that the prediction models performed consistently across different periods of the pandemic and across different socioeconomic strata. These findings are consistent with studies that show decline in mortality risk over time.47,48 To the best of our knowledge, compared with previous studies with similar objectives, our study is unique in analyzing data from multiple health systems in a diverse US metropolitan city and uses data that span nearly a year since the beginning of the pandemic.23,48-50 These findings indicate that for a novel disease with great morbidity and mortality, prediction models can be developed to identify patients at higher risk of adverse outcomes, and it is important to evaluate the stability of risk predictors over time and over various socioeconomic subgroups.

In addition, we used statistical methods to develop and validate subphenotypes to understand the clinical heterogeneity of patients with COVID-19; we also examined how mortality and intubation varied. All validation approaches confirmed the reproducibility of the 4-cluster structure of the data and clinical characteristics of the identified subphenotypes. The 4 subphenotypes we identified were significantly different in demographics, clinical variables, and chronic comorbidities and were strongly predictive of the mortality outcome. For example, subphenotype IV included more older, male patients; abnormal markers indicating hyperinflammation, liver injury, cardiovascular problems, kidney dysfunctions, and coagulation disorders; and a higher comorbidity burden compared with the other subphenotypes. In contrast, subphenotype I was composed of relatively healthy, younger females who had more normal values across all markers and comorbidity burdens compared with the other subphenotypes. Given the routine collection of the variables used in our analyses, our models and derived subphenotypes can easily be implemented in clinical practice as well as in clinical trial enrollment; they can also be made readily available for clinicians.

Previous studies have examined the temporal trends of COVID-19 outcomes, such as in-hospital mortality rate during the course of the pandemic, but limited attention has been paid to evolving patterns of COVID-19 phenotypes.47,51 We filled this gap in the present study. Our observations suggested varied temporal trends of the identified subphenotypes during the first 14 weeks of the pandemic in NYC. Interestingly, since the COVID-19 outbreak in NYC on March 1, 2020, subphenotypes I and II dominated the period before the peak (first 4 weeks since outbreak), while subphenotype IV was boosted within the second month (April 2020) after spread peak, consistent with the high mortality rate in NYC in April. This finding suggests that younger, biologically strong patients (subphenotypes I and II) developed infections early and boosted the spread, while older, biologically vulnerable patients (subphenotype IV) accounted for the second infections. After that, the proportion of subphenotype I out of all patients confirmed per week gradually expanded, while that of the others—especially subphenotype IV—shrank.

Significant disparities in COVID-19 outcomes have been identified since the beginning of the pandemic. Patients from disadvantaged social conditions are more likely to be infected and experience adverse outcomes.34,52,53 We included the SDI at the patient neighborhood level and its interactions with other major predictors in the prediction model, and we did not find that it increased the prediction accuracy. It is possible that SDI is highly correlated with other predictors in the model, such as patient comorbidities; therefore, it did not further improve the prediction accuracy. For example, previous studies found that patients from socially disadvantaged neighborhoods are more likely to be members of racial/ethnic minority groups and have multiple chronic conditions.54-56 It is also possible that social conditions at the individual patient level (eg, occupation, income, household crowding, education) instead of the neighborhood level are stronger predictors of COVID-19 outcomes. Future studies that include more granular level SDOH information (eg, SDOH at the individual patient level) as a predictor may be warranted.

Limitations

Although this study is a new contribution to the efforts to predict adverse COVID-19 outcomes and parse the biological heterogeneity of COVID-19, several limitations remain.

First, our data-driven approach relied on the availability of patient data. In this study, we developed prediction models and identified subphenotypes by using routinely collected clinical variables correlated with COVID-19 and available in the INSIGHT database. We were not able to extract other important patient characteristics, such as presenting symptoms. Incorporating such data would add new insights for clinical care.

Second, in our study, the analyzed data were collected at ED or hospital presentation, so the time between COVID-19 symptom onset to ED or hospital presentation could be a covariate of disease severity and clinical outcomes. Such data, however, were not available in the INSIGHT database. In addition, deaths in our data were primarily inpatient deaths. Although some health systems tracked deaths after discharge in the first version of the COVID-19 database, it is possible that deaths in the community or other health systems were underreported in our data.

Third, our study was based on ED and inpatient clinical data, such that each patient was characterized in a snapshot. Although it may make the prediction model and subphenotypes available to clinicians in a timely manner, the full use of longitudinal patient data may enable us to capture the complexity of the disease and so develop better prediction models and identify interesting subphenotypes. The collection of multivariate, longitudinal data in large cohorts remains challenging, and modeling such data to develop prediction models and identify subphenotypes requires improved data-driven methods.

Fourth, this is a multiple-institution analysis in NYC. To evaluate the generalizability of the identified subphenotypes, further validation of the data collected from other geographic areas in the United States is needed in future work.

Fifth, AUROCs of our predictive models were estimated using 5-fold cross-validation, and the calibration properties of our models were evaluated in the whole cohort. Our predictive models were designed to demonstrate the predictive ability of common predictors and evaluate how the risk estimates of these predictors on intubation and in-hospital mortality changed over time and across SDI groups. Therefore, further research is needed to externally validate our prediction models.

Sixth, we found that risk of intubation and in-hospital mortality declined over time during the period of the study. Moreover, our study was restricted to data from the period before vaccines became widely available. Therefore, risk of intubation and in-hospital mortality could potentially decrease further in a vaccinated population. Our predictive models should be updated and recalibrated for application to contemporary or future patients with COVID-19. Nevertheless, our predictive models are fundamental for the Alpha variant of the virus and could be useful with current or future variants as this virus continues to evolve. In addition, discoveries regarding the Alpha variant may have implications for future pandemics.

Finally, our risk-prediction models included patients admitted to the ED regardless of whether they were discharged from the ED or admitted. Patients who were seen in the ED without hospitalization had lower probabilities of intubation and death compared with hospitalized patients. Further research should be conducted to investigate risk-prediction models in the subgroup that was hospitalized. In addition, we excluded patients admitted from nursing homes in this study. Patients in nursing homes are more likely to be older and sicker than other patients. Therefore, results from this project may not be generalizable to patients in nursing homes.

Future Research

Large numbers of patients who have been infected with SARS-CoV-2 continue to experience various symptoms long past the time they have recovered from the initial stages of COVID-19. Further research is needed to follow up with patients to understand the long-term impact of COVID-19 on patient outcomes. Prediction models, such as those that we developed, will be useful in identifying patients at high risk of developing adverse long-term outcomes to target effective interventions. Our prediction models were developed using data before vaccines were widely available. Hospitalized patients who are vaccinated may have different outcomes than those who were not vaccinated. Therefore, our should be reevaluated and updated in vaccinated cohorts.57 Our prediction models should also be evaluated in the high-risk subgroup of patients with COVID-19 who are hospitalized from the ED.

Conclusions

Using multi-institutional EHR data from the INSIGHT CRN, we developed prediction models for death and intubation among patients with COVID-19. Our prediction models, which use routinely collected patient demographics, comorbidities, and vital signs, showed strong performance in predicting severe outcomes of COVID-19. With the large sample of patients with COVID-19 in NYC from the INSIGHT CRN, we validated our prediction models across different time periods of the pandemic and across different neighborhood socioeconomic statuses. We also derived 4 subphenotypes to understand the clinical heterogeneity of patients with COVID-19, demonstrating the significant variation in demographic characteristics, clinical characteristics, and COVID-19 outcomes among clustered groups of patients with COVID-19. By using variables routinely collected in EHR systems, our models and biological subphenotypes provide accessible and generalizable tools health care practitioners can use to improve health outcomes among patients with COVID-19.58

References

1.
Richardson S, Hirsch JS, Narasimhan M, et al. Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area. JAMA. 2020;323(20):2052-2059. doi:10.1001/jama.2020.6775 [PMC free article: PMC7177629] [PubMed: 32320003] [CrossRef]
2.
Grasselli G, Zangrillo A, Zanella A, et al. Baseline characteristics and outcomes of 1591 patients infected with SARS-CoV-2 admitted to ICUs of the Lombardy region, Italy. JAMA. 2020;323(16):1574-1581. doi:10.1001/jama.2020.5394 [PMC free article: PMC7136855] [PubMed: 32250385] [CrossRef]
3.
Goodman JL, Borio L. Finding effective treatments for COVID-19: scientific integrity and public confidence in a time of crisis. JAMA. 2020;323(19):1899-1900. doi:10.1001/jama.2020.6434 [PubMed: 32297900] [CrossRef]
4.
Wang D, Hu B, Hu C, et al. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020;323(11):1061-1069. doi:10.1001/jama.2020.1585 [PMC free article: PMC7042881] [PubMed: 32031570] [CrossRef]
5.
COVID-19: data. NYC Health. Accessed June 15, 2022. https://www1​.nyc.gov​/site/doh/covid/covid-19-data.page
6.
Arentz M, Yim E, Klaff L, et al. Characteristics and outcomes of 21 critically ill patients with COVID-19 in Washington state. JAMA. 2020;323(16):1612-1614. doi:10.1001/jama.2020.4326 [PMC free article: PMC7082763] [PubMed: 32191259] [CrossRef]
7.
Inciardi RM, Lupi L, Zaccone G, et al. Cardiac involvement in a patient with coronavirus disease 2019 (COVID-19). JAMA Cardiol. 2020;5(7):819-824. doi:10.1001/jamacardio.2020.1096 [PMC free article: PMC7364333] [PubMed: 32219357] [CrossRef]
8.
Moghadas SM, Shoukat A, Fitzpatrick MC, et al. Projecting hospital utilization during the COVID-19 outbreaks in the United States. Proc Natl Acad Sci U S A. 2020;117(16):9122-9126. doi:10.1073/pnas.2004064117 [PMC free article: PMC7183199] [PubMed: 32245814] [CrossRef]
9.
Weissman GE, Crane-Droesch A, Chivers C, et al. Locally informed simulation to predict hospital capacity needs during the COVID-19 pandemic. Ann Intern Med. 2020;173(1):21-28. doi:10.7326/M20-1260 [PMC free article: PMC7153364] [PubMed: 32259197] [CrossRef]
10.
Cavallo JJ, Donoho DA, Forman HP. Hospital capacity and operations in the coronavirus disease 2019 (COVID-19) pandemic—planning for the Nth patient. JAMA Health Forum. 2020;1(3):e200345. doi:10.1001/jamahealthforum.2020.0345 [PubMed: 36218595] [CrossRef]
11.
Feng C, Wang L, Chen X, et al. A novel artificial intelligence-assisted triage tool to aid in the diagnosis of suspected COVID-19 pneumonia cases in fever clinics. Ann Transl Med. 2021;9(3):201. doi:10.21037/atm-20-3073 [PMC free article: PMC7940949] [PubMed: 33708828] [CrossRef]
12.
Yue H, Yu Q, Liu C, et al. Machine learning-based CT radiomics method for predicting hospital stay in patients with pneumonia associated with SARS-CoV-2 infection: a multicenter study. Ann Transl Med. 2020;8(14):859. doi:10.21037/atm-20-3026 [PMC free article: PMC7396749] [PubMed: 32793703] [CrossRef]
13.
Xie J, Hungerford D, Chen H, et al. Development and external validation of a prognostic multivariable model on admission for hospitalized patients with COVID-19. medRxiv. Preprint posted online April 7, 2020. doi:10.1101/2020.03.28.20045997 [CrossRef]
14.
Wynants L, Van Calster B, Bonten MMJ, et al. Prediction models for diagnosis and prognosis of COVID-19 infection: systematic review and critical appraisal. BMJ. 2020;369:m1328. doi:10.1136/bmj.m1328 [PMC free article: PMC7222643] [PubMed: 32265220] [CrossRef]
15.
Sperrin M, Grant SW, Peek N. Prediction models for diagnosis and prognosis in COVID-19. BMJ. 2020;369:m1464. doi:10.1136/bmj.m1464 [PubMed: 32291266] [CrossRef]
16.
Vaid A, Somani S, Russak AJ, et al. Machine learning to predict mortality and critical events in a cohort of patients with COVID-19 in New York City: model development and validation. J Med Internet Res. 2020;22(11):e24018. doi:10.2196/24018 [PMC free article: PMC7652593] [PubMed: 33027032] [CrossRef]
17.
Levy TJ, Richardson S, Coppa K, et al. Development and validation of a survival calculator for hospitalized patients with COVID-19. medRxiv. Preprint posted online April 27, 2020. doi:10.1101/2020.04.22.20075416 [CrossRef]
18.
Vazquez Guillamet MC, Vazquez Guillamet R, Kramer AA, et al. Toward a COVID-19 score-risk assessments and registry. medRxiv. Preprint posted online April 20, 2020. doi:10.1101/2020.04.15.20066860 [CrossRef]
19.
Tabata S, Imai K, Kawano S, et al. Clinical characteristics of COVID-19 in 104 people with SARS-CoV-2 infection on the Diamond Princess cruise ship: a retrospective analysis. Lancet Infect Dis. 2020;20(9):1043-1050. doi:10.1016/S1473-3099(20)30482-5 [PMC free article: PMC7292609] [PubMed: 32539988] [CrossRef]
20.
Desai N, Neyaz A, Szabolcs A, et al. Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection. Nat Commun. 2020;11(1):6319. doi:10.1038/s41467-020-20139-7 [PMC free article: PMC7725958] [PubMed: 33298930] [CrossRef]
21.
Wiersinga WJ, Rhodes A, Cheng AC, Peacock SJ, Prescott HC. Pathophysiology, transmission, diagnosis, and treatment of coronavirus disease 2019 (COVID-19): a review. JAMA. 2020;324(8):782-793. doi:10.1001/jama.2020.12839 [PubMed: 32648899] [CrossRef]
22.
Gupta S, Wang W, Hayek SS, et al. Association between early treatment with tocilizumab and mortality among critically ill patients with COVID-19. JAMA Intern Med. 2021;181(1):41-51. doi:10.1001/jamainternmed.2020.6252 [PMC free article: PMC7577201] [PubMed: 33080002] [CrossRef]
23.
Domecq JP, Lal A, Sheldrick CR, et al. Outcomes of patients with coronavirus disease 2019 receiving organ support therapies: the International Viral Infection and Respiratory Illness Universal Study Registry. Crit Care Med. 2021;49(3):437-448. doi:10.1097/CCM.0000000000004879 [PMC free article: PMC9520995] [PubMed: 33555777] [CrossRef]
24.
Weng C, Shah NH, Hripcsak G. Deep phenotyping: embracing complexity and temporality—towards scalability, portability, and interoperability. J Biomed Inform. 2020;105:103433. doi:10.1016/j.jbi.2020.103433 [PMC free article: PMC7179504] [PubMed: 32335224] [CrossRef]
25.
Argenziano MG, Bruce SL, Slater CL, et al. Characterization and clinical course of 1000 patients with coronavirus disease 2019 in New York: retrospective case series. BMJ. 2020;369:m1996. doi:10.1136/bmj.m1996 [PMC free article: PMC7256651] [PubMed: 32471884] [CrossRef]
26.
Cummings MJ, Baldwin MR, Abrams D, et al. Epidemiology, clinical course, and outcomes of critically ill adults with COVID-19 in New York City: a prospective cohort study. Lancet. 2020;395(10239):1763-1770. doi:10.1016/S0140-6736(20)31189-2 [PMC free article: PMC7237188] [PubMed: 32442528] [CrossRef]
27.
Goyal P, Choi JJ, Pinheiro LC, et al. Clinical characteristics of COVID-19 in New York City. N Engl J Med. 2020;382(24):2372-2374. doi:10.1056/NEJMc2010419 [PMC free article: PMC7182018] [PubMed: 32302078] [CrossRef]
28.
Petrilli CM, Jones SA, Yang J, et al. Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study. BMJ. 2020;369:m1966. doi:10.1136/bmj.m1966 [PMC free article: PMC7243801] [PubMed: 32444366] [CrossRef]
29.
Das L, Abramson E, Kaushal R. Reopening US schools in the era of COVID-19: practical guidance from other nations. JAMA Health Forum. 2020;1(6):e200789. doi:10.1001/jamahealthforum.2020.0789 [PubMed: 36218528] [CrossRef]
30.
Kamenetz A. Are the risks of reopening schools exaggerated? NPR. October 21, 2020. Accessed June 15, 2022. https://www​.npr.org/2020​/10/21/925794511​/were-the-risks-of-reopening-schools-exaggerated
31.
Brown KV. Testing shows schools aren't propelling COVID-19 outbreaks. Bloomberg. November 3, 2022. Accessed June 15, 2022. https://www​.bloomberg​.com/news/articles/2020-11-03​/testing-shows-schools-aren-t-propelling-covid-19-outbreaks
32.
Watson SK. The risks of three back-to-school plans, ranked. Popular Science. August 19, 2020. Accessed June 15, 2022. https://www​.popsci.com​/story/health/coronavirus-school-reopening-risks/
33.
Brookshire B, Cunningham A, Garcia de Jesús E, Lambert J, Sanders L. Five big questions about when and how to open schools amid COVID-19. Science News. August 4, 2020. Accessed June 15, 2022. https://www​.sciencenews​.org/article/covid-19-coronavirus-kids-schools-opening-when-how-risks
34.
Chronic conditions. Chronic Conditions Data Warehouse. Accessed June 15, 2022. https://www2.ccwdata.org/web/guest/condition-categories-chronic
35.
Social Deprivation Index. Robert Graham Center. Accessed June 15, 2022. https://www.graham-center.org/rgc/maps-data-tools/sdi/social-deprivation-index.html
36.
US Census Bureau. American Community Survey, 2018. Accessed May 21, 2021. https://www​.census.gov​/programs-surveys/acs
37.
Meyers DJ, Mor V, Rahman M, Trivedi AN. Growth In Medicare Advantage greatest among Black and Hispanic enrollees. Health Aff (Millwood). 2021;40(6):945-950. doi:10.1377/hlthaff.2021.00118 [PMC free article: PMC8297509] [PubMed: 34097525] [CrossRef]
38.
Jin J, Agarwala N, Kundu P, et al. Individual and community-level risk for COVID-19 mortality in the United States. Nat Med. 2021;27(2):264-269. doi:10.1038/s41591-020-01191-8 [PubMed: 33311702] [CrossRef]
39.
Butler DC, Petterson S, Phillips RL, Bazemore AW. Measures of social deprivation that predict health care access and need within a rational area of primary care service delivery. Health Serv Res. 2013;48(2 pt 1):539-559. doi:10.1111/j.1475-6773.2012.01449.x [PMC free article: PMC3626349] [PubMed: 22816561] [CrossRef]
40.
Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2009.
41.
Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med. 2012;51(1):74-81. doi:10.3414/ME00-01-0052 [PMC free article: PMC3250568] [PubMed: 21915433] [CrossRef]
42.
Kruppa J, Liu Y, Biau G, et al. Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J. 2014;56(4):534-563. doi:10.1002/bimj.201300068 [PubMed: 24478134] [CrossRef]
43.
Murtagh F, Legendre P. Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? J Classif. 2014;31(3):274-295. doi:10.1007/s00357-014-9161-z [CrossRef]
44.
Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Software. 2014;1(6):1-36. doi:10.18637/jss.v061/i06 [CrossRef]
45.
Reynolds D. Gaussian mixture models. In: Li SZ, Jain AK, eds. Encyclopedia of Biometrics. Springer; 2015:827-832.
46.
Van Calster B, McLernon DJ, van Smeden M, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. doi:10.1186/s12916-019-1466-7 [PMC free article: PMC6912996] [PubMed: 31842878] [CrossRef]
47.
Jorge A, D'Silva KM, Cohen A, et al. Temporal trends in severe COVID-19 outcomes in patients with rheumatic disease: a cohort study. Lancet Rheumatol. 2021;3(2):e131-e137. doi:10.1016/S2665-9913(20)30422-7 [PMC free article: PMC7758725] [PubMed: 33392516] [CrossRef]
48.
Anesi GL, Jablonski J, Harhay MO, et al. Characteristics, outcomes, and trends of patients with COVID-19-related critical illness at a learning health system in the United States. Ann Intern Med. 2021;174(5):613-621. doi:10.7326/M20-5327 [PMC free article: PMC7901669] [PubMed: 33460330] [CrossRef]
49.
Schwab P, Mehrjou A, Parbhoo S, et al. Real-time prediction of COVID-19 related mortality using electronic health records. Nat Commun. 2021;12(1):1058. doi:10.1038/s41467-020-20816-7 [PMC free article: PMC7886884] [PubMed: 33594046] [CrossRef]
50.
Churpek MM, Gupta S, Spicer AB, et al. Hospital-level variation in death for critically ill patients with COVID-19. Am J Respir Crit Care Med. 2021;204(403-411):403-411. doi:10.1164/rccm.202012-4547OC [PMC free article: PMC8480242] [PubMed: 33891529] [CrossRef]
51.
Asch DA, Sheils NE, Islam MN, et al. Variation in US hospital mortality rates for patients admitted with COVID-19 during the first 6 months of the pandemic. JAMA Intern Med. 2021;181(4):471-478. doi:10.1001/jamainternmed.2020.8193 [PMC free article: PMC7756246] [PubMed: 33351068] [CrossRef]
52.
Wadhera RK, Wadhera P, Gaba P, et al. Variation in COVID-19 hospitalizations and deaths across New York City boroughs. JAMA. 2020;323(21):2192-2195. doi:10.1001/jama.2020.7197 [PMC free article: PMC7191469] [PubMed: 32347898] [CrossRef]
53.
Millett GA, Jones AT, Benkeser D, et al. Assessing differential impacts of COVID-19 on Black communities. Ann Epidemiol. 2020;47:37-44. doi:10.1016/j.annepidem.2020.05.003 [PMC free article: PMC7224670] [PubMed: 32419766] [CrossRef]
54.
Zhang Y, Ancker JS, Hall J, Khullar D, Wu Y, Kaushal R. Association between residential neighborhood social conditions and health care utilization and costs. Med Care. 2020;58(7):586-593. doi:10.1097/MLR.0000000000001337 [PubMed: 32520834] [CrossRef]
55.
Zhang Y, Zhang Y, Sholle E, et al. Assessing the impact of social determinants of health on predictive models for potentially avoidable 30-day readmission or death. PLoS One. 2020;15(6):e0235064. https://doi​.org/10.1371/journal​.pone.0235064. [PMC free article: PMC7316307] [PubMed: 32584879]
56.
Ryvicker M, Gallo WT, Fahs MC. Environmental factors associated with primary care access among urban older adults. Soc Sci Med. 2012;75(5):914-921. doi:10.1016/j.socscimed.2012.04.029 [PMC free article: PMC3383917] [PubMed: 22682664] [CrossRef]
57.
Hippisley-Cox J, Coupland CAC, Mehta N, et al. Risk prediction of COVID-19 related death and hospital admission in adults after COVID-19 vaccination: national prospective cohort study. BMJ. 2021;374. doi:10.1136/bmj.n2244 [PMC free article: PMC8446717] [PubMed: 34535466] [CrossRef]
58.
Su C, Zhang Y, Flory JH, et al. Novel clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health. medRxiv. January 2021:2021.02.28.21252645. doi:10.1101/2021.02.28.21252645 [PMC free article: PMC8280198] [PubMed: 34262117] [CrossRef]

Related Publications

This project has generated several manuscripts, from which some sections have been quoted in this report. They include the following:

•.
Zhang Y, Khullar D, Wang F, Steel P, Wu Y, et al. Socioeconomic variation in characteristics, outcomes, and healthcare utilization of COVID-19 patients in New York City. PLoS One. 2021;16(7):e0255171. doi:10.1371/journal.pone.0255171 [PMC free article: PMC8321227] [PubMed: 34324574] [CrossRef]
•.
Su C, Zhang Y, Flory JH, et al. Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health. NPJ Digit Med. 2021;4(1):110. doi:10.1038/s41746-021-00481-w [PMC free article: PMC8280198] [PubMed: 34262117] [CrossRef]
•.
Banerjee S, Schenck E, Weiner M, et al. Temporal changes in risk factors for severe COVID-19 in NYC. [Manuscript in preparation for submission to a peer-reviewed journal]
•.
Goyal P, Banerjee S, Weiner M, et al. The modification of risk factors of severe COVID-19 across socioeconomic strata in New York City. [Manuscript in preparation for submission to a peer-reviewed journal]

Acknowledgment

Research reported in this report was funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (HSD-1604-35187-C19). Further information available at: https://www.pcori.org/research-results/2016/methods-identify-and-predict-which-patients-will-have-high-healthcare-needs-and-use-pcornetr-study

Appendices

Appendix A.

Tables and Figures (PDF, 1.9M)

Footnotes

U:p: P value from χ2 with 2 degrees of freedom for testing unreliability (H0: intercept = 0, slope = 1). Brier: Brier score. Intercept: Intercept from the logistic calibration model. Slope: Slope from the logistic calibration model. The solid line represents a calibration curve for the relation between predicted and observed probabilities based on logistic regression. The dotted line represents a nonparametric smooth curve for the relation between predicted and observed probabilities. Perfect calibration is represented by the gray area. Triangles are based on groups of patients with similar predicted probabilities. Patients were categorized into 30 groups based on the quantiles of the predicted probability. The distribution of predicted probabilities is shown above the x-axis. Figures were created using all patients in the updated data set from the 3 health systems affiliated with the INSIGHT clinical research network (N = 30 016).

U:p: P value from the χ2 test with 2 degrees of freedom for testing unreliability (H0: intercept = 0, slope = 1). Brier: Brier score. Intercept: Intercept from the logistic calibration model. Slope: Slope from the logistic calibration model. The solid line represents a calibration curve for the relation between predicted and observed probabilities based on logistic regression. The dotted line represents a nonparametric, smooth curve for the relation between predicted and observed probabilities. Perfect calibration is represented by the gray area. Triangles are based on groups of patients with similar predicted probabilities. Patients are categorized into 30 groups based on the quantiles of the predicted probability. The distribution of predicted probabilities is shown above the x-axis. Figures were created using all patients in the updated data set from the 3 health systems affiliated with the INSIGHT clinical research network (N = 30 016).

Original Project Title: Using Predictive Models to Improve Care for Hospitalized Patients with Novel Coronavirus Disease
Institutions Receiving Award: Weill Cornell Medicine, New York-Presbyterian
PCORI ID: HSD-1604-35187

Suggested citation:

Kaushal R, Zhang Y, Banerjee S, et al. (2022). Using Predictive Models to Improve Care for Patients Hospitalized with COVID-19. Patient-Centered Outcomes Research Institute (PCORI). http://doi.org/10.25302/01.2023.HSD.160435187_C19

Disclaimer

The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.

Copyright © 2023. Weill Cornell Medical College. All Rights Reserved.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License which permits noncommercial use and distribution provided the original author(s) and source are credited. (See https://creativecommons.org/licenses/by-nc-nd/4.0/

Bookshelf ID: NBK604778PMID: 38976624DOI: 10.25302/01.2023.HSD.160435187_C19

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (3.8M)

Other titles in this collection

Related information

Similar articles in PubMed

See reviews...See all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...