If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Analysis of congenital heart surgery results requires a reliable method of estimating the risk of adverse outcomes. Two major systems in current use are based on projections of risk or complexity that were predominantly subjectively derived. Our goal was to create an objective, empirically based index that can be used to identify the statistically estimated risk of in-hospital mortality by procedure and to group procedures into risk categories.
Methods
Mortality risk was estimated for 148 types of operative procedures using data from 77,294 operations entered into the European Association for Cardiothoracic Surgery (EACTS) Congenital Heart Surgery Database (33,360 operations) and the Society of Thoracic Surgeons (STS) Congenital Heart Surgery Database (43,934 patients) between 2002 and 2007. Procedure-specific mortality rate estimates were calculated using a Bayesian model that adjusted for small denominators. Each procedure was assigned a numeric score (the STS–EACTS Congenital Heart Surgery Mortality Score [2009]) ranging from 0.1 to 5.0 based on the estimated mortality rate. Procedures were also sorted by increasing risk and grouped into 5 categories (the STS–EACTS Congenital Heart Surgery Mortality Categories [2009]) that were chosen to be optimal with respect to minimizing within-category variation and maximizing between-category variation. Model performance was subsequently assessed in an independent validation sample (n = 27,700) and compared with 2 existing methods: Risk Adjustment for Congenital Heart Surgery (RACHS-1) categories and Aristotle Basis Complexity scores.
Results
Estimated mortality rates ranged across procedure types from 0.3% (atrial septal defect repair with patch) to 29.8% (truncus plus interrupted aortic arch repair). The proposed STS–EACTS score and STS–EACTS categories demonstrated good discrimination for predicting mortality in the validation sample (C-index = 0.784 and 0.773, respectively). For procedures with more than 40 occurrences, the Pearson correlation coefficient between a procedure's STS–EACTS score and its actual mortality rate in the validation sample was 0.80. In the subset of procedures for which RACHS-1 and Aristotle Basic Complexity scores are defined, discrimination was highest for the STS–EACTS score (C-index = 0.787), followed by STS–EACTS categories (C-index = 0.778), RACHS-1 categories (C-index = 0.745), and Aristotle Basic Complexity scores (C-index = 0.687). When patient covariates were added to each model, the C-index improved: STS–EACTS score (C-index = 0.816), STS–EACTS categories (C-index = 0.812), RACHS-1 categories (C-index = 0.802), and Aristotle Basic Complexity scores (C-index = 0.795).
Conclusion
The proposed risk scores and categories have a high degree of discrimination for predicting mortality and represent an improvement over existing consensus-based methods. Risk models incorporating these measures may be used to compare mortality outcomes across institutions with differing case mixes.
Cardiac surgeons have recognized and emphasized the need to establish clinical registries and quantitative tools for responsible reporting of outcomes. Large multi-institutional databases, such as the Society of Thoracic Surgeons (STS) Adult Cardiac Surgery Database, among others, have developed, applied, and validated methods of risk adjustment in reporting outcomes. This has addressed appropriate concerns that the reporting of raw, unadjusted mortality data is misleading and potentially penalizes surgeons and centers that manage high-risk patients and complex procedures because observed mortality rates might be higher than in centers dealing with less challenging cases. The kinds of statistical tools and risk models that have been developed to address these issues when the clinical substrate is adult patients with acquired cardiovascular disease cannot simply be applied to the population of pediatric and adult patients with congenital heart disease. Here the problem is considerably more complex, in large part because the individual diagnoses and distinct types of surgical procedures number in the hundreds, despite the fact that the universe of patients with congenital heart disease is considerably smaller than that of adult patients with ischemic and valvular heart disease. As a result, the number of patients in some diagnostic and procedural groups is quite small. Nonetheless, it is recognized that the need to establish tools for case-mix adjustment is fundamental to any systematic attempt to measure outcomes, compare performance, and sustain a program of continual quality improvement.
As a response to the need for case-mix adjustment of outcome data but in the absence of significant amounts of registry data in 2000, the Aristotle Complexity score was developed.
Using the expert opinions of 50 internationally based surgeons, the Aristotle Basic Complexity (ABC) score was constructed for 145 distinct congenital heart surgery procedures. Three components (potential for mortality, potential for morbidity, and technical difficulty) were subjectively scored, and the sum became the ABC score.
Separately, another group of researchers developed the Risk Adjustment for Congenital Heart Surgery (RACHS-1) system, also using an expert panel.
RACHS-1 groups procedures into 6 levels of increasing risk of mortality. This allocation of procedures was subsequently refined using empirical data from 2 multi-institutional registries. When compared with the ABC score, the RACHS-1 categories appear to have better discrimination for predicting mortality, whereas the ABC score covers a larger proportion of congenital heart surgery case volume.
Case complexity scores in congenital heart surgery: a comparative study of the Aristotle Basic Complexity score and the Risk Adjustment in Congenital Heart Surgery (RACHS-1) system.
The largest validation study of the ABC score was recently conducted by using a combined sample of nearly 36,000 patients from the STS Congenital Heart Surgery Database and the European Association for Cardiothoracic Surgery (EACTS) Congenital Heart Surgery Database.
In that study there was a significant increasing association between the ABC score and in-hospital mortality, with an overall C-index of 0.70. Although it was clear that the ABC score generally discriminated between low-risk and high-risk procedures, it was also clear that for a relatively small number of individual procedures, the initial estimation of mortality risk by the Aristotle international panel of surgical experts did not accurately predict the actual empirical estimates observed over the ensuing decade.
The goal of the present study was to derive a new system for classifying congenital heart surgery procedures based on their potential for in-hospital mortality using empirical data from the STS and EACTS databases. There were 3 specific objectives.
First, we sought to estimate procedure-specific relative risks of in-hospital mortality using a statistical model that accounts for uncertainty in procedures with small sample sizes.
Second, we sought to convert these procedure-specific mortality estimates into a scale ranging from 0.1 to 5.0. The range of this scale was chosen for consistency with the Aristotle method. The resulting score has been named the STS–EACTS Congenital Heart Surgery Mortality Score (2009) (or, briefly, the STS–EACTS score).
Third, we sought to group procedures with similar estimated mortality risk into a small number of relatively homogeneous categories (the STS–EACTS Congenital Heart Surgery Mortality Categories [2009] or, briefly, the STS–EACTS categories). These categories are intended to serve as a stratification variable that can be used to adjust for case mix when analyzing outcomes and comparing institutions.
Materials and Methods
Study Population
The STS Congenital Heart Surgery Database and the EACTS Database are described elsewhere.
The study population consisted of patients who underwent a congenital cardiovascular operation at an STS-participating hospital between January 1, 2002, and December 31, 2006, or at an EACTS-participating hospital between January 1, 2002, and April 4, 2007. Data from 1 STS center were excluded because this participant did not consistently report outcomes during the study period. Only the first operation of each hospital admission was analyzed. Operations were included if they involved one of the 148 cardiovascular procedures listed in Table 1. This list includes all cardiovascular procedures that were included in the short-list nomenclature of the STS and EACTS databases and appeared at least once as the primary procedure of an operation in the STS–EACTS dataset. Patients weighing less than or equal to 2500 g undergoing patent ductus arteriosus ligation as their primary procedure were excluded from the analysis because they are not included in mortality calculations in the EACTS and STS Congenital Database reports. In addition, 244 (0.3%) patients with missing in-hospital mortality status were excluded. The final study population consisted of 43,934 operations from 57 centers in the STS database and 33,360 operations from 91 centers in the EACTS database for a total of 77,294 operations.
Table 1Procedure names, proposed scores and categories, and data for model development
The risk tool developed using this dataset was subsequently validated in a separate sample of STS and EACTS patients meeting the same inclusion criteria described above. This validation sample consisted of 20,042 operations performed between January 1, 2007, and June 30, 2008, in the STS database and 7658 operations performed between April 5, 2007, and April 8, 2008, in the EACTS database.
Hospitals participating in the STS and EACTS registries are required to comply with local regulatory and privacy guidelines. The Duke Clinical Research Institute serves as the data analysis center for the STS database and has an agreement, as well as institutional review board approval, to analyze the aggregate deidentified data for research purposes.
Classification of Multiple-Procedure Operations
Several procedures listed in Table 1 are actually combinations of 2 or more procedures. These combinations were identified by the Aristotle expert panel because they occur frequently in the STS and EACTS databases and because the complexity of the combination is regarded as being different from the complexity of the component procedures when performed in isolation. For all other operations involving combinations of procedures, the operation was classified according to the most technically complex procedure, as determined by the difficulty component of the 2007 update of the ABC score. The ABC score contains some ties and is not defined for 3 of the procedures listed in Table 1. To deal with undefined or tied Aristotle scores, 6 of the study authors independently ranked the difficulty of each procedure listed in Table 1. Undefined or tied Aristotle scores were adjudicated by assigning the operation to the procedure with the highest average ranking determined by the 6 graders. The difficulty rankings are included in Table 1 so that users of the risk tool will be able to replicate our method of classifying multiple-procedure operations.
End Point
The study end point was in-hospital mortality, which was defined as death during the same hospitalization as surgery regardless of cause.
Estimation of Procedure-Specific Mortality Rates
Mortality estimates were calculated by using a Bayesian random effects model that adjusted each procedure's mortality rate based on the size of the denominator. Using a statistical model was considered advantageous because several individual procedures had small denominators, and hence their unadjusted mortality rates were susceptible to chance fluctuations. Unlike conventional methods, random effects models use data from all of the procedures in the database when estimating the probability of mortality for any single procedure. This “borrowing of information” across procedures produces estimates with good statistical properties, including smaller standard errors than conventional estimates. Heuristically, the model-based estimate is a weighted average of a procedure's actual observed mortality rate and the overall average mortality rate for all procedures in the database. The model weights an individual procedure's own data more heavily when the denominator is large enough to be reliable and weights the overall average mortality rate more heavily when the denominator is too small to support a reliable mortality estimate. For procedures with more than 200 occurrences, the model-based estimates were virtually identical to the usual unadjusted (raw) mortality percentages (Appendix 1).
Creation of the Mortality Score
Each procedure was assigned a numeric score (STS–EACTS score) ranging from 0.1 to 5.0. The scores were assigned by shifting and rescaling the estimated procedure-specific mortality rates to lie in the interval from 0.1 to 5.0 and then rounding to one decimal place. The following formula was used:
where denotes the estimated risk of the j-th procedure, and max and min denote the maximum and minimum values of across the 148 procedures.
Creation of Mortality Categories
Procedures were sorted by increasing estimated risk and partitioned into 5 relatively homogeneous categories (STS–EACTS categories). Five categories was the smallest number that did not result in excessive within-category heterogeneity. Within-category homogeneity was measured objectively using a weighted sum of squares criterion (Appendix 2).
A dynamic programming algorithm was then used to find the categorization that maximizes the homogeneity criterion. This data-driven approach ensures that procedures in the same category will be as similar as possible with respect to their estimated mortality risk.
To determine the number of categories, we evaluated the performance of different categorizations consisting of 2 to 20 categories. Performance was assessed internally based on 2 criteria. First, we evaluated the internal homogeneity of the categories using the criterion described in Appendix 2. Second, we assessed the discrimination of the categories as predictors of mortality. Discrimination was quantified by the area under the receiver operating characteristic curve (also known as the C-index).
The C-index is interpreted as the probability that a randomly selected patient who died was considered to be higher risk than a randomly selected patient who survived. The C-index generally ranges from 0.5 to 1.0, with 0.5 representing no discrimination (ie, a coin flip) and 1.0 representing perfect discrimination.
Models Combining Scores and Categories With Patient-Level Risk Factors
Two logistic regression models were developed to illustrate the utility of modeling the proposed scores and categories together with patient-level risk factors. The first model included the STS–EACTS score (modeled as a continuous variable) plus 3 patient-level factors: age, weight, and preoperative length of stay. To allow for possible nonlinear effects, the score and the square of the score were both entered in the model. Age and weight were modeled jointly by converting them into a single categorical variable with 7 levels (see Results). Preoperative length of stay was dichotomized as less than or equal to 2 days versus more than 2 days. The second model was identical but used the STS–EACTS categories (modeled as a set of category indicators) instead of the STS–EACTS score. Additional patient factors, such as comorbidities, were not included because these data were not available to us for the EACTS subset at the time of analysis.
Comparisons With RACHS-1 Categories and ABC Scores
The models described above were also estimated with RACHS-1 categories in place of the STS–EACTS categories and with the ABC score in place of the STS–EACTS score to facilitate comparisons with existing methods. Briefly, the ABC score of a procedure is a number ranging from 1.5 to 15 points that reflects the Aristotle expert panel's assessment of that type of procedure's potential for mortality, morbidity, and technical difficulty. When analyzing operations with multiple procedures, the ABC score was defined as the maximum ABC score across all procedures in the operation. The RACHS-1 methodology divides procedures into 6 categories based on an expert panel's assessment of the procedure's average mortality risk, where category 1 has the lowest risk of mortality and category 6 has the highest. Unlike the ABC method, the classification of some procedures is allowed to depend on the patient's age. When analyzing operations with multiple procedures, the operation is assigned to the procedure with the highest RACHS-1 category. Because very few data points were available in RACHS-1 category 5, it was combined with category 6 for analysis. The “full” RACHS-1 methodology involves fitting a logistic regression model that includes indicator variables for the RACHS-1 categories together with an indicator variable for single versus multiple cardiac procedures, plus additional adjustment for 3 patient-level risk factors: age, prematurity, and presence of a major noncardiac structural anomaly. Because the required patient-level risk factors were not available in our dataset, we did not implement the full RACHS-1 methodology but instead focused on evaluating the discrimination of the RACHS-1 categories with and without adjustment for patient age, weight, and preoperative length of stay.
Independent Validation Using 2007–2008 Data
The performance of each model was assessed in a separate, more contemporary sample of STS and EACTS data. Overall discrimination was quantified by the C-index. The ability of the proposed score to predict the risk of individual procedures was quantified by calculating the Pearson correlation coefficient between the score and the actual calculated procedure-specific mortality rate in the validation sample. Because sampling variation in the validation sample might artificially increase or decrease the Pearson correlation coefficient, procedures with fewer than 40 occurrences in the validation sample were excluded when calculating the Pearson correlation coefficient. For graphing the association between the proposed score and observed mortality, data from procedures with the same score were aggregated, and the mortality rate of each group of procedures was plotted as a function of the score, excluding groups with fewer than 40 cases. The entire validation was also repeated in the subset of procedures having at least 200 cases in the development sample. Finally, to permit a fair comparison with RACHS-1 and ABC scores, the performance of each model was assessed in the subset of procedures for which both RACHS-1 categories and ABC scores are defined (n = 25,106 patient operations). Statistical comparisons of the C-index for different models were performed using the method of DeLong and colleagues.
A total of 77,294 patient operations were analyzed, including 3308 (4.3%) in-hospital deaths. There were 71 procedures with at least 200 occurrences, 104 procedures with at least 50 occurrences, and 133 procedures with at least 20 occurrences. Procedures with at least 200 occurrences accounted for 94% of the total patients and 91% of the deaths.
Mortality Rates for Individual Procedures
The frequency of in-hospital mortality for individual procedures ranged from 0% to 40.0%. There were 18 procedures with zero deaths; all of these had sample sizes smaller than 200. When Bayesian modeling was used to estimate mortality risk for individual procedures, the estimates ranged from 0.3% (atrial septal defect repair with patch) to 29.8% (truncus plus interrupted aortic arch repair, Figure 1). For the procedures with more than 200 cases, the raw and model-based estimates were virtually identical (Pearson correlation coefficient > 0.999, Appendix 1).
Names of the procedures analyzed in this study are listed in Table 1, along with their raw and model-based mortality estimates and their proposed scores and categories. The STS–EACTS score takes on values between 0.1 and 5.0 and has 29 unique values. The STS–EACTS categories consist of 5 groups labeled 1 to 5, with higher numbers implying higher mortality risk. The number of patients and procedures per category and their aggregated mortality rates are summarized in Table 2.
Table 2Characteristics of proposed risk categories in 2002–2007 STS and EACTS data
STS–EACTS mortality category
1
2
3
4
5
Range of scores
0.1–0.3
0.4–0.7
0.8–1.2
1.3–2.6
2.7–5.0
No. of procedures
26
52
27
37
6
No. of patients
28,363
23,235
9026
13,862
2808
No. of deaths
234
601
449
1374
650
Mortality
0.8%
2.6%
5.0%
9.9%
23.1%
STS–EACTS, Society of Thoracic Surgeons–European Association for Cardiothoracic Surgery.
The within-category homogeneity criterion and the C-index were plotted as functions of the number of categories to help us determine the optimal number of mortality categories. As shown in Figure 2, A, within-category homogeneity increases rapidly with the number of categories when the number of categories is small. With more than 4 or 5 categories, the homogeneity continues to increase, but the marginal improvement per additional category approaches zero. Similarly, Figure 2, B, shows that the estimated discrimination of the categories changes dramatically when the number of groups is varied between 2 and 5, but using more than 5 categories has a relatively modest effect on the C-index. Five categories were chosen as the smallest number that produces both acceptable within-category homogeneity and good discrimination.
Figure 2Association between number of procedure categories and within-category homogeneity of mortality risk (Panel A) and discrimination for predicting mortality (Panel B). Performance improves with increasing numbers of categories. See Appendix 2 for definition of within-category homogeneity.
Examples of regression models using the proposed scores and categories are summarized in Table 3. The C-index was 0.814 for the model that combined patient factors with the STS–EACTS score and 0.810 for the model that combined patient factors with the STS–EACTS categories. For comparison, when age, weight, and preoperative length of stay were analyzed in a logistic regression model without adjustment for the STS–EACTS scores or categories, the C-index was 0.755.
Table 3Summary of logistic regression models combining the proposed STS–EACTS scores and categories with patient-level risk factors
Odds ratio (95% confidence interval)
Variable
Model 1: STS–EACTS score + patient factors
Model 2: STS–EACTS categories + patient factors
STS–EACTS mortality score
0.5 vs 0.25
1.4 (1.4–1.5)
–
1.0 vs 0.25
2.6 (2.4–2.8)
–
2.0 vs 0.25
6.3 (5.6–7.1)
–
4.0 vs 0.25
9.4 (8.2–10.8)
–
STS–EACTS mortality category
Category 1
–
Reference
Category 2
–
2.9 (2.4–3.3)
Category 3
–
4.3 (3.6–5.0)
Category 4
–
7.5 (6.5–8.7)
Category 5
–
15.9 (13.3–18.9)
Age and weight category
Age ≥1 y
Reference
Reference
Age 1–11 mo, weight ≥6.0 kg
1.0 (0.8–1.2)
0.9 (0.8–1.1)
Age 1–11 mo, weight 4.0–5.9 kg
1.4 (1.2–1.6)
1.3 (1.2–1.5)
Age 1–11 mo, weight <4.0 kg
2.6 (2.2–3.0)
2.6 (2.3–3.0)
Age <1 mo, weight ≥3.0 kg
2.0 (1.8–2.2)
1.9 (1.7–2.2)
Age <1 mo, weight 2.0–2.9 kg
3.3 (2.8–3.8)
3.2 (2.8–3.7)
Age <1 mo, weight <2.0 kg
4.9 (4.2–5.8)
4.9 (4.2–5.7)
Preoperative LOS
≤2 d
Reference
Reference
>2 d
1.4 (1.3–1.6)
1.4 (1.3–1.5)
STS–EACTS, Society of Thoracic Surgeons–European Association for Cardiothoracic Surgery; LOS, length of stay.
There was a strong positive association between the proposed STS–EACTS score and actual observed mortality in the validation sample (C-index = 0.784). For the 82 procedures with at least 40 occurrences in the validation sample, the Pearson correlation coefficient between the score of a procedure and its actual observed mortality rate in the validation sample was 0.80. An increasing association between the score and mortality was observed across the range of scores, although several groups of procedures had lower than expected mortality (Figure 3).
Figure 3Association between Society of Thoracic Surgeons–European Association for Cardiothoracic Surgery score and in-hospital mortality in the validation sample. Square dots represent the aggregate mortality rate of procedures sharing the same risk score. Data points with fewer than 40 observations were excluded from the figure. Vertical lines represent 95% binomial confidence intervals.
The observed mortality rate in the validation sample was slightly lower than in the development sample (3.9% vs 4.3%, P = .004), reflecting a trend toward lower mortality in a more contemporary sample. This lower mortality was seen in each of the 5 STS–EACTS categories (Figure 4). Despite the trend toward lower absolute mortality in 2007–2008, the chosen categories continued to perform well at discriminating between high-risk and low-risk procedures (C-index = 0.773). Receiver operating characteristic curves for the proposed scores and categories are displayed in Figure 5. When the validation was repeated in the subset of 73 procedures with at least 200 cases in the development sample, there was a similarly high level of discrimination (C-index = 0.790 for STS–EACTS scores; C-index = 0.782 for STS–EACTS categories) and high correlation between the STS–EACTS score and procedure-specific mortality rates (Pearson correlation coefficient = 0.87).
Figure 4Association between proposed risk categories and observed in-hospital mortality.
Figure 5Receiver operating characteristic curves for the Society of Thoracic Surgeons–European Association for Cardiothoracic Surgery scores (A) and categories (B) as predictors of in-hospital mortality in the validation sample. The diagonal line is provided as a reference. It is the receiver operating characteristic curve that would be observed hypothetically if the scores and categories were not associated with mortality.
To assess whether the proposed method discriminates mortality better than the existing RACHS-1 categories and Aristotle scores, each of these was evaluated in the validation sample using the subset of procedures for which both RACHS-1 categories and ABC scores are defined. As summarized in Table 4, discrimination was highest for the STS–EACTS score (C-index = 0.787), followed by the STS–EACTS categories (C-index = 0.778), RACHS-1 categories (C-index = 0.745), and ABC scores (C-index = 0.687, all differences P < .0001). Adding patient-level covariates substantially improved each model's discrimination. With the addition of these patient variables, discrimination was highest for the STS–EACTS score (C-index = 0.816), followed by STS–EACTS categories (C-index = 0.812; comparison with STS–EACTS score, P = .035), RACHS-1 categories (C-index = 0.802; comparison vs STS–EACTS categories, P = .008), and ABC scores (C-index = 0.795; comparison vs STS–EACTS score, P < .0001).
Table 4Comparison of C-index for models using the STS–EACTS score, STS–EACTS categories, RACHS-1 categories, and ABC scores
Validation sample, subset of procedures for which both RACHS-1 categories and ABC scores are defined.
Method of modeling procedures
Model without patient covariates (C-index)
Model with patient covariates (C-index)
STS–EACTS score
0.787
0.816
STS–EACTS categories
0.778
0.812
RACHS-1 categories
0.745
0.802
ABC score
0.687
0.795
STS–EACTS, Society of Thoracic Surgeons–European Association for Cardiothoracic Surgery; RACHS-1, Risk Adjustment for Congenital Heart Surgery; ABC, Aristotle Basic Complexity.
∗ Validation sample, subset of procedures for which both RACHS-1 categories and ABC scores are defined.
The goal of this study was to derive a valid tool that can be used to stratify congenital heart surgery procedures based on their relative risk of in-hospital mortality. Using the combined resources of the STS and EACTS databases, we estimated the average mortality rate of 148 procedures and then applied a data-driven algorithm to determine the grouping of procedures that was optimal in the sense of creating internally homogeneous strata. The resulting scores and categories are intended to serve as tools for case-mix adjustment when comparing outcomes of hospitals that perform congenital heart surgery. These measures can be used to perform a stratified analysis that adjusts for type of procedure or they can be included along with patient-level variables in a comprehensive risk adjustment model.
Previous investigators have used a combination of expert opinion and empirical data to group procedures with a similar risk of in-hospital mortality. Experts initially used clinical judgment to group procedures with a similar potential for in-hospital mortality to create the RACHS-1 risk categories. This allocation of procedures was subsequently refined by using empirical data from 2 multi-institutional registries. The goals of the present study were similar to those of RACHS-1 in that we also sought to create internally homogeneous procedure categories using the end point of discharge mortality. A major difference between our approach and the derivation of RAHCS-1 categories is that our procedure categories were determined empirically without the input of an expert panel. When the proposed methodology was assessed in an independent validation sample, models based on the STS–EACTS score and categories had substantially better discrimination than comparable models based on RACHS-1 categories and ABC scores.
Despite the advantages of an empirically based risk stratification system, there are several limitations and caveats.
First, our study focused on estimating procedural mortality and determining homogeneous procedure categories. Additional research is needed to determine the best method of combining these procedural variables with adjustment for patient-specific risk factors.
Second, despite the large database, several individual procedures had small sample sizes, and the true mortality of these procedures may have been estimated with error. We attempted to minimize this error by using a statistical model, which accounted for small denominators.
Third, because the EACTS and STS registries are voluntary, it is possible that the results observed in this database will differ from those of other nonparticipating institutions.
Fourth, because auditing of the STS and EACTS databases has been limited to a small number of sites, the completeness and accuracy of the data are largely unknown. In an audit of 200 patient records from 10 different STS centers, there was 99.0% agreement in the reporting of discharge mortality by STS sites versus independent auditors and no evidence of selective reporting based on discharge mortality status (personal communication, unpublished STS data).
Another potential limitation rests in the fact that mortality was determined only on the basis of status at the time of discharge. Operative mortality has been defined by the STS Congenital Database Taskforce and the Joint STS–EACTS Congenital Database Committee.
What is operative mortality? Defining death in a surgical registry database: a report of the STS Congenital Database Taskforce and the Joint EACTS-STS Congenital Database Committee.
It requires knowledge not only of status at discharge but of patient status at 30 days after the operation. Going forward, validation of the STS–EACTS scores and categories using this definition will be possible as the completeness of these data fields in the STS and EACTS databases improves (Appendix 3).
In summary, we have developed a new tool for grouping procedures with a similar empirically estimated risk of in-hospital mortality. Empirically based mortality stratification was possible to a considerable extent because of the large sample sizes of the STS and EACTS congenital databases. The resulting scores and categories can be incorporated into case-mix adjustment methods, such as stratification and regression analysis, to compare institutions on a level playing field.
Appendix 1. Statistical Model for Estimating Procedure-Specific Mortality Rates
Procedure-specific mortality rates were estimated by using a hierachical (random effects) model. For each of the 148 procedures in the analysis, the number of deaths was modeled by using the following binomial distribution:
where denotes the unknown theoretical probability of mortality for the j-th procedure, denotes the number of patients undergoing the procedure in the database (denominator), and denotes the actual observed number of mortalities in the database (numerator). Variation in the theoretical probability of mortality was modeled by assuming the log odds were normally distributed. Thus the model is as follows:
where and denote the unknown mean and variance, respectively, of the assumed normal random effects distribution. Parameters of the model were estimated in a Bayesian framework using WinBUGS software. A vague (noninformative) prior distribution was chosen for the parameters and . The WinBUGS code for this model is available from the authors on request.
As shown in Figure 6, A, there was a high degree of correlation between the Bayesian model–based estimate of a procedure's risk and the simple raw unadjusted mortality percentage; however, several procedures had large discrepancies. The difference between the model-based versus raw estimates decreased with increasing sample size. For procedures with more than 200 cases, the raw and model-based estimates were virtually identical (Pearson correlation coefficient > 0.999; Figure 6).
Figure 6Relationship between Bayesian model–based estimates and unadjusted mortality rates for individual procedures in the development sample.
Appendix 2. Methodology for Creating Internally Homogeneous Risk Categories
Procedures were first sorted in order of increasing estimated risk (based on the model in Appendix 1) and then grouped into homogeneous categories to create the risk categories. Let denote the true unknown mortality for the i-th procedure, and let denote the corresponding estimate. We first sorted procedures so that . Let k denote the number of categories and let denote a set of category cut points that partition the categories into k groups. The symbol denotes a number between 1 and 148 and represents the index of the highest-risk procedure in the j-th category. Also, define and . For any particular choice of k and , within-category homogeneity is measured by the weighted sum-of-squares criterion:
where denotes the average risk of mortality among all procedures in the j-th category. This criterion is similar to one that has been used previously for defining optimum cut points for categorizing a continuous explanatory variable.
The notation is intended to emphasize that WSS is a function of the chosen cut points and also depends on the unknown procedure-specific probabilities . If the were known instead of unknown, then the “optimal” cut points could (in theory) be determined by enumerating all possible choices for the and choosing the one that minimizes the WSS. Because the are unknown, we instead choose cut points that minimize the Bayesian estimate of . Specifically, we chose the cut points that minimize the estimated Bayesian posterior mean as follows:
where denotes a random draw from the joint posterior distribution of the 's. Finding the set of cut points that minimizes this quantity exactly is technically challenging and required the use of a novel dynamic programming algorithm (unpublished).
The criterion described above gets smaller as the within-category homogeneity improves. For plotting the change in homogeneity versus k, it is intuitively appealing to use a criterion that increases rather than decreases. The criterion used in Figure 2 (and throughout the article) is defined as follows:
This criterion ranges from 0.0 to 1.0 and increases as the categories become more homogeneous.
Appendix 3. Completeness of STS Mortality Data
The mortality end point for this study was mortality status at the time of discharge, ie, in-hospital mortality. It was chosen over operative mortality (ie, death prior to discharge or after discharge but within 30 days of surgery) or 30-day mortality status in large part because 30-day status is frequently missing whereas discharge mortality is rarely missing. As shown in Figure 7, the completeness of 30-day mortality status has improved over time. In the future, it may be feasible to adapt the STS-EACTS methodology (or develop a new methodology) to predict the endpoint of operative mortality or 30-day mortality, assuming the completeness of 30-day mortality reporting continues to improve.(Figure 7)
Figure 7Decreasing percentage of missing data in the fields “mortality discharge status” (alive or dead) and “status at 30 days after surgery” (alive, dead, or unknown) in the Society of Thoracic Surgeons Congenital Database from 2002 to 2006.
Case complexity scores in congenital heart surgery: a comparative study of the Aristotle Basic Complexity score and the Risk Adjustment in Congenital Heart Surgery (RACHS-1) system.
What is operative mortality? Defining death in a surgical registry database: a report of the STS Congenital Database Taskforce and the Joint EACTS-STS Congenital Database Committee.