content: BCI essay v2 + modern_idolatry to drafts

2026-04-10 17:41:46 -04:00 · 2026-04-10 17:41:46 -04:00 · d305b55675
parent b02e1e868d
commit d305b55675
15 changed files with 773 additions and 225 deletions
--- a/content/colophon.md
+++ b/content/colophon.md
@ -60,7 +60,7 @@ A full accounting of what this build process has actually produced is available
 ## The Computing Environment

 ::: dropcap
-Every tool I use was chosen rather than accepted. This distinction matters: the default path through the software ecosystem, particularly for developers, routes through choices made by convenience, employer mandate, or inertia rather than deliberation. I have consistently refused that path — not out of contrarianism, but because I find that the tools one uses become constitutive of the work one produces. This site is downstream of those choices in every particular.
+I am, like many passionate nerds within the realm of computing, obsessive over my technological choices. They are the subject of constant critique, review, revision, etc. I believe in the value of putting deep thought into the systems that one interacts with, rather than accepting the first showing of convenience and going with the flow. A system interacted with is an experience moreso than a mere tool.
 :::

 ### Desktop
@ -72,23 +72,21 @@ My primary desktop is a rig I built myself, running **Gentoo Linux** with **Hypr
 - It is, in my experience, the best-maintained Linux distribution I have ever used.
 - The community is phenomenal.

-I have strong hardware preferences to match: [AMD]{.smallcaps} processors and graphics throughout. This is not tribal loyalty — I have used multiple distinct products from each major vendor and formed the preference empirically.
+I have strong hardware preferences to match; [AMD]{.smallcaps} hardware is used in favor of Intel and NVIDIA. I have used at least 2 distinct chips from each manufacturer and consistently find that, as far as x86-64 is concerned, AMD is the clear winner. As for my preference for AMD gpus, not only do I quite disagree with the business direction of NVIDIA, but the VRAM offerings are simply superior from AMD.

 ### Laptop

-For mobile computing, I use a [P]{.smallcaps}-series ThinkPad running **Arch Linux** — the same Hyprland environment, the same *Levshell*, the same muscle memory, but without the compilation overhead. Gentoo is impractical on battery-constrained hardware: compilation times are simply too long, and running the compiler continuously on battery is not a reasonable tradeoff. Arch is a sound alternative. Portage is better than pacman, but pacman on a laptop is entirely correct.
-
-The Hyprland environment runs identically on both machines. This is deliberate — I do not want to maintain two workflows. Every keybinding, every visual detail, every tool behaves the same. The continuity is itself a productivity commitment.
+For mobile computing, I use a [P]{.smallcaps}-series ThinkPad running **Arch Linux** — the same Hyprland environment, the same *Levshell*, the same muscle memory down to the config file. Gentoo is impractical on battery-constrained hardware, as compilation times are simply too long and require active power connection. Arch is a sound alternative, configurable enough to pass. Portage is better than pacman in my opinion, but pacman is still far better than horrid package managers like apt.

 ### Editor

-Everything on this site — every word of prose, every line of Haskell, every [CSS]{.smallcaps} rule — was written in **Emacs**. I use Emacs for the same reason I use Gentoo: it is configurable to a degree that other editors treat as pathological. Running `make watch` alongside an Emacs buffer gives something close to a [WYSIWYG]{.smallcaps} experience for this workflow: save the file, the Hakyll watcher recompiles, the browser updates. The loop is tight enough that the editor and the rendered output feel nearly continuous.
+Everything on this site — every word of prose, every line of Haskell, every [CSS]{.smallcaps} rule — was written in **Emacs**. I have used most of the major editors, and consistently experiment with others, but have yet to find any which come close to the power of Emacs. I intend to complete a project called "Pmacs," which will introduce much moderner parallelization, among other features, to Emacs. This is a project I intend to tackle in the Summer of 2026 at high intensity.

 ### Privacy-First Computing

 My email and [VPN]{.smallcaps} are self-hosted; I use Thunderbird as a client for the former. For browsing I use [LibreWolf](https://librewolf.net/), not Firefox: the Chromium monopoly and Mozilla's evident incompetence at browser development are, to me, equally concerning developments, and LibreWolf is the most coherent response to both. My phone runs [GrapheneOS](https://grapheneos.org/) — the only reasonably secure and private option for a mobile device, and one whose restrictions are, frankly, a feature rather than a limitation.

-The principle underlying all of these choices is the same one underlying the site's **No Tracking** policy: privacy is an architectural decision, not a settings toggle. You build the infrastructure first. Bolting on privacy after the fact, whether in a browser or on a website, is not privacy — it is the appearance of privacy.
+The principle underlying all of these choices is the same one underlying the site's **No Tracking** policy: privacy is an architectural decision, not a settings toggle. Bolting on privacy after the fact, whether in a browser or on a website, is not privacy — it is the appearance of privacy.

 ---

@ -161,5 +159,4 @@ The tradition of the personal website is one that is built on a sense of communi
 This site is unfinished. Several portals have no content yet. The annotated bibliography
 is sparse. I am in the progress of migrating content, so stay tuned!

-The colophon itself is a living document. When the site changes substantially, this
-page will change with it. The git history functions as an authoritative record; this page is my personal annotated summary.
+The colophon itself is a living document. When the site changes substantially, this page will change with it. The git repository on Forgejo (hosted on the git subdomain here) should always be considered to take precedence.
--- a/content/drafts/essays/modern_idolatry.md
+++ b/content/drafts/essays/modern_idolatry.md
@ -0,0 +1,41 @@
+---
+title: "The Modern Idolatry"
+date: 2026-04-06
+abstract: >
+  Thoughts on idolizing notions of success, whether extrinsic or intrinsic, prompted by my upcoming graduation from Brown University and a recent week spent in Paris.  
+tags:
+  - miscellany
+  - philosophy
+  - personal
+  - personal/travel
+authors:
+  - "Levi Neuwirth | /me.html"
+status: "Draft"
+history:
+  - date: "2026-04-06"
+---
+
+Travel affects me profoundly, and the effect is strangely uniform. There is a hierarchical structure of dichotomies that seems to define most aspects of my life, and my interactions with place are no exception to this rule. One of the dichotomies is as follows: I am rather accustomed to moving around in my adult life to date, never spending more than 4 months in a place before spending at least a few weeks somewhere else, and yet I rapidly develop a sense of "home" wherever I am - a stagnation of sorts, an acceptance of the region in which I reside and an abstraction away of the remainder of the world to some vast, estoeric TERRA INCOGNITA. Perhaps the most profound, persistent personal effect of travel on me is that it knocks me out of this mental state of spatial hibernation, reminding me that there is an entire world beyond that which I consistently perceive, and that I have the means to do something to have a positive impact on it. This has been a profoundly important sensation for me to have for many years now, and is thus one basis by which travel is consistently a high priority for me. 
+
+This is often combined with a sense of grand melancholy, the sort that for me is nearly ubiquitous in the presence of grandeur and beauty. It is a different incarnation of the same melancholy^[I should emphasize here that while "melancholy" may in general invoke a negative connotation, I do not feel that this is a negative emotion whatsoever. To me, the primary effect of melancholy, or at least melancholy of this sort, is an amplification of the imposing impetus, usually some sense of grandeur. The melancholy is like delicate cinnamon powder added to the top of a pristine flat white.] that I feel when I listen to a profound piece of music, view a painting that I enjoy, or reach the summit of a mountain that I have been embracing for hours. In this case the strength is perhaps yielded by the confluence of grandeur of the natural world - the vastness of space, the mystery of distinct regions that I have yet to know and the warm embrace of returning to those which I know but not well - and that of the human world - the various cultures, languages, beliefs, institutions, and above all people that are present in various places. 
+
+This grand, amplified melancholy typically has three causes in my life, two of which I have already mentioned. The third is instances of outward-facing "success" - I typically feel melancholic and pensive when I have done something or crossed some milestone^["Milestones" are not terms that I would use nor guidelines or aspects of some personal timeline or plan, but rather things that society imposes. They don't mean much to me on a personal level, but do unavoidably impact how I feel, since I cannot avoid societial influences as much as I sometimes wish I could.] that many folks see as an indicator of success (or the potential for it). One might imagine, then, that I felt quite a sensation as I was travelling in Paris during my most recent spring break, on the verge of graduating from Brown University after four years of work and extreme personal growth, and such an imagination would be highly warranted. As I took endless walks on the [Champ de Mars](https://en.wikipedia.org/wiki/Champ_de_Mars) and along the [Seine](https://en.wikipedia.org/wiki/Seine) many thoughts and musings were prompted by the grand sensations of emotion, grandeur, and wonder that I felt. They are largely concentrated around the theme of modern idolatry in the name of "success" and the impliciations of this, on both a personal and broader philosophical and societal level. My attempts to collect them into a format that I can share follow. 
+
+## Dichotomies
+<figure class="prose-excerpt">
+<blockquote>
+
+"Everything is a dichotomy; that is perhaps the grandeur of life, of the Universe itself." 
+
+
+</blockquote>
+<figcaption>Levi's personal journal, 29 January 2026</figcaption>
+</figure>
+
+::: dropcap
+What of "success" do I understand, and what of it have I cumulatively failed to understand? Of course, this question depends on one's chosen definition of "success," so perhaps the most interesting approach is to parameterize our choice of definition. Indeed, SUCCESS is a concept that means different things to different people, so perhaps such parameterization is implicitly necessary. Yet such parameterization unsettles me greatly on a personal level. It is the first example of dichotomy that we, together, may explore. 
+:::
+
+Society widely seems to view success as the fulfillment of goals rooted in extrinsic motivations. The credentialist nature of our society seems to conflate one's ability to earn a title with competence, experience, and, in some cases, worthiness - and who, exactly, is worthy of success, or, rather, is it success that deems one worthy in the eyes of the world? In more ways than one, it seems that we have been conditioned somehow through our institutions, both explicit and implicit, to conflate worthiness with success, and this conflation is perhaps grounded in the idea that success will be transitative; that is, one's continued association with successful people leads to more successful outcomes. This seems to imply that "success" is somehow a communal thing, inherently extrinsic that it diffuses and saturates, so long as those who have it^[For the sake of illustration here we are assuming that "success" is something to be had, a notion that will be debunked later.] are willing to continue associating with those who have less of it. 
+
+Yet this is in direct contrast to what is arguably the foundation of our^[I use "our" here to refer to citizens of the United States, my country of birth and the culture that largely influenced my perception of success.] success. The extrinsic nature of such success is not problematic, but the communal aspect is. The ethos of the [American Dream](https://en.wikipedia.org/wiki/American_Dream) is largely that of individualism - the promise that dense individual effort leads to success.
--- a/content/essays/beyond-comorbidity-indices.md
+++ b/content/essays/beyond-comorbidity-indices.md
@ -1,216 +0,0 @@
---
-title: "Beyond Comorbidity Indices"
-date: 2026-03-28
-abstract: >
-  A permutation-invariant deep-learning model that learns ICD-10-CM code representations improved 30-day readmission prediction (AUC 0.7496 vs 0.6553 for CCI) and postdischarge in-hospital mortality prediction (AUC 0.8557 vs 0.7844 for age-adjusted CCI) compared with standard comorbidity-index baselines in a national claims database of over 51 million adult hospitalizations.
-tags:
-  - research
-  - research/machine-learning
-authors:
-  - "Levi Neuwirth | /me.html"
-  - "Liqi Shu"
-  - "Xilin Wang"
-  - "Henry Zheng"
-affiliation:
-  - "Department of Neurology, Warren Alpert Medical School, Brown University"
-  - "Department of Computer Science, Brown University | https://cs.brown.edu"
-  - "Department of Mathematics, Brown University | https://mathematics.brown.edu/"
-  - "Department of Computer Science, Northeastern University"
-status: "Durable"
-confidence: 80
-importance: 3
-evidence: 5
-scope: average
-novelty: moderate
-practicality: moderate
-history:
-  - date: "2026-03-28"
-    note: Preprint auto-formatted for levineuwirth.org
---
-
-::: {.annotation .annotation--collapsible}
-**KEY POINTS**
-
-**Question.** Among adult hospitalizations in a national claims database, does a permutation-invariant model that learns ICD-10-CM representations improve prediction of 30-day outcomes compared with Charlson and Elixhauser comorbidity indices?
-
-**Findings.** In a temporally held-out national cohort (2021--2022), the ICD-10-CM embedding model achieved higher discrimination than comorbidity-index models for 30-day unplanned readmission (AUC, 0.7496) and 30-day postdischarge in-hospital mortality (AUC, 0.8557). The highest-performing comorbidity-index comparators achieved AUCs of 0.6553 and 0.7844, respectively.
-
-**Meaning.** Learning representations directly from ICD-10-CM codes may improve claims-based risk adjustment and help prioritize transitional-care resources at discharge.
-:::
-
-## Introduction
-
-::: dropcap
-Accurate prediction and risk adjustment for short-term clinical outcomes, such as 30-day hospital mortality and readmission, are critical for enhancing healthcare research quality, allowing fair assessment of healthcare outcomes and quality metrics [@cms_hrrp]. Most claims-based risk adjustment continues to rely on comorbidity indices such as the Charlson Comorbidity Index (CCI) and Elixhauser Comorbidity Index (ECI), which map diagnosis codes to a limited set of conditions [@charlson1987; @elixhauser1998]. While these indices are interpretable and widely deployed, they inevitably discard granularity and may miss clinically meaningful comorbidity patterns and interactions among diagnoses.
-:::
-
-Recent machine-learning approaches increasingly use high-dimensional ICD code inputs and have demonstrated improved prediction for a range of outcomes [@deschepper2020; @lelay2022; @qiao2022]. However, many approaches simplify or truncate ICD codes, aggregate diagnosis lists in ways that depend on code order, or are trained and evaluated in settings where coding practices differ across sites---each of which can limit robustness and transportability. In addition, many claims-based studies focus on in-hospital mortality and do not evaluate postdischarge mortality as a discharge-time outcome [@qiao2022; @davis2022; @harerimana2021; @matsui2022; @nguyen2017].
-
-We developed and temporally validated a claims-based prediction model that uses trainable ICD-10-CM embeddings and permutation-invariant aggregation of diagnosis lists. Using the Nationwide Readmissions Database (2016--2022), we assessed discrimination, calibration, and recall-weighted performance for 30-day unplanned readmission and 30-day postdischarge in-hospital mortality and compared results with Charlson and Elixhauser comorbidity-index models; we also generated diagnosis-level attributions using Integrated Gradients.
-
-## Methods
-
-### Study Design, Data Source, and Oversight
-
-We conducted a retrospective cohort study using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD), 2016--2022. Adult discharges from 2016--2020 were used for model development, and discharges from 2021--2022 were reserved for temporal external testing. Discharges in December of each year were excluded to allow complete 30-day follow-up within the same calendar year.
-
-Use of the NRD was governed by the HCUP data use agreement. Because the NRD contains deidentified data, the institutional review board determined the study was not human participants research and that informed consent was not required.
-
-### Cohort Definition
-
-We included hospitalizations for patients aged 18 years or older with a valid patient linkage identifier within each calendar year. For mortality analyses, index hospitalizations with in-hospital death were excluded from the primary outcome definition (postdischarge in-hospital mortality) and examined in a prespecified secondary analysis.
-
-### Outcomes
-
-The coprimary outcomes were (1) 30-day unplanned readmission and (2) 30-day postdischarge in-hospital mortality within 30 days of discharge. Readmissions were classified as unplanned using the HCUP algorithm. Postdischarge mortality was defined as inpatient death occurring during a subsequent hospitalization within 30 days after discharge; deaths outside the hospital are not captured in the NRD.
-
-### Predictors
-
-For each index hospitalization, we used up to 40 ICD-10-CM diagnosis codes (principal and secondary diagnoses). Diagnosis codes were label-encoded into integer identifiers for model input. Demographic and socioeconomic covariates included age, sex, primary payer, and ZIP-code median income quartile. Age was standardized; categorical variables (sex, payer, income quartile) were one-hot encoded with explicit handling of missing values. No diagnosis codes were filtered, simplified, or reordered.
-
-### Model Development
-
-#### Embedding Model Architecture
-
-The embedding model mapped each ICD-10-CM code to a trainable embedding vector and used a Deep Sets architecture [@zaheer2017] for permutation-invariant aggregation. The Deep Sets encoder processed individual diagnosis embeddings independently through shared multilayer perceptrons (MLPs); outputs were summed (permutation-invariant pooling) and passed to a decoder MLP. Demographic and socioeconomic covariates were processed through a separate 2-layer MLP and concatenated with the Deep Sets output before the final predictor MLP. The model was implemented in TensorFlow [@abadi2016] and trained with binary cross-entropy loss, the Adam optimizer, and early stopping on validation loss.
-
-#### Training Strategy
-
-Models were trained separately for each outcome. To address outcome imbalance, we randomly downsampled majority-class encounters in the training set to a 1:1 case-control ratio (validation and temporal test sets were not downsampled). Predicted probabilities were recalibrated on the validation set using logistic calibration. Hyperparameters were tuned via random search (32 trials per outcome), prioritizing validation AUC with $F_2$ as a secondary criterion.[^hparams]
-
-[^hparams]: Hyperparameter search details and final selected configurations are provided in the supplementary eMethods 1 and eTable 1.
-
-### Comparator Models
-
-We implemented logistic regression models based on the Charlson and Elixhauser comorbidity indices. For the Charlson model, we used both the unweighted comorbidity count and the traditional age-adjusted score [@ahrq_elixhauser; @fernando2019; @quan2005; @deyo1992; @quan2011; @stagg2006]. For the Elixhauser model, we used the unweighted comorbidity count; the refined AHRQ algorithm was applied for ICD-10-CM mapping [@ahrq_elixhauser]. All comparator models included age, sex, primary payer, and income quartile as covariates.
-
-### Statistical Analysis and Performance Evaluation
-
-Primary performance evaluation used a prespecified stratified random subsample of 2021--2022 discharges (n = 3,226,831; sampling fraction 0.10). Discrimination was assessed with AUC-ROC and 95% CIs (DeLong method); pairwise comparisons used DeLong tests [@youden1950; @delong1988]. Calibration was evaluated with calibration plots and the Brier score. Binary classification thresholds were selected on the validation set by maximizing the Youden index and applied unchanged to the temporal test set. Threshold-dependent metrics included sensitivity, specificity, precision, $F_1$, and $F_2$ (which weights recall twice as heavily as precision).
-
-### Interpretability Analysis
-
-We used Integrated Gradients [@sundararajan2017] to generate code-level attributions, quantifying each ICD-10-CM code's contribution to predicted risk. Attributions were averaged across occurrences for the 10 most positively and negatively influential codes per outcome (minimum 50 occurrences in the interpretability cohort).[^ig]
-
-[^ig]: Additional details on training, testing, and interpretability methods are provided in supplementary eMethods 1--3.
-
-### Ablation Studies
-
-We conducted three prespecified ablation studies to evaluate the contribution of key architectural choices: (1) addition of transformer blocks for attention-based contextualization, (2) replacement of Deep Sets with a permutation-variant flattening comparator, and (3) removal of demographic and socioeconomic covariates.[^ablation]
-
-[^ablation]: Full ablation details and results are in supplementary eMethods 4 and eTable 3.
-
-## Results
-
-### Cohort Characteristics
-
-The study included 19,120,000 adult hospitalizations from 2016--2020 (development cohort) and 32,268,308 from 2021--2022 (temporal test cohort). The 30-day unplanned readmission rate was 11.0% in the development cohort and 10.7% in the temporal test cohort. The 30-day postdischarge mortality rate was 0.6% in both cohorts. Additional cohort characteristics are reported in Table 1.
-
-### Primary Performance Comparison
-
-For 30-day readmission, the embedding model achieved an AUC of 0.7496 (95% CI, 0.7488--0.7504), outperforming the Charlson model (AUC, 0.6553; 95% CI, 0.6544--0.6562; P < .001) and the Elixhauser model (AUC, 0.6363; 95% CI, 0.6353--0.6372; P < .001). For 30-day postdischarge in-hospital mortality, the embedding model achieved an AUC of 0.8557 (95% CI, 0.8532--0.8581) vs the best-performing comparator (age-adjusted Charlson: 0.7844; 95% CI, 0.7813--0.7874; P < .001). Performance metrics are summarized in Table 2.
-
-The embedding model also showed superior recall-weighted performance. For 30-day readmission, $F_2$ was 0.4848 vs 0.4066 for the best comparator; for postdischarge mortality, $F_2$ was 0.0530 vs 0.0480. Calibration plots demonstrated good agreement between predicted and observed risks across the probability range.[^calibration]
-
-[^calibration]: Calibration reliability plots are provided in supplementary eFigure 2.
-
-::: {.annotation .annotation--static}
-**Figure 1 --- [placeholder]** Receiver operating characteristic curves for 30-day readmission (left) and 30-day postdischarge in-hospital mortality (right), comparing the embedding model against Charlson and Elixhauser comparators on the temporal test set.
-:::
-
-::: {.annotation .annotation--static}
-**Figure 2 --- [placeholder]** Calibration reliability plots for the embedding model and comparators on the temporal test set.
-:::
-
-### Interpretability Analysis
-
-Integrated Gradients attributions identified clinically relevant patterns. For readmission, codes indicating prior readmissions, chronic conditions requiring ongoing management, and social determinants (e.g., housing issues) showed high positive attributions. For postdischarge mortality, codes reflecting severe illness (e.g., sepsis, respiratory failure) and palliative care had high positive attributions, while codes for routine procedures (e.g., colonoscopy) had negative attributions.
-
-::: {.annotation .annotation--static}
-**Figure 3 --- [placeholder]** Top 10 positively and negatively attributed ICD-10-CM codes by mean Integrated Gradients attribution for 30-day readmission.
-:::
-
-::: {.annotation .annotation--static}
-**Figure 4 --- [placeholder]** Top 10 positively and negatively attributed ICD-10-CM codes by mean Integrated Gradients attribution for 30-day postdischarge in-hospital mortality.
-:::
-
-### Ablation Studies
-
-Addition of transformer blocks did not improve discrimination or $F_2$ score for either outcome.[^transformer] Replacing Deep Sets with a permutation-variant flattening comparator reduced AUC (P < .001 for readmission; P = .014 for mortality) and $F_2$ score.[^flattening] Removing demographic and socioeconomic covariates slightly reduced performance (P = .020 for readmission; P < .001 for mortality).[^demographics]
-
-[^transformer]: Reported P values for AUROC differences were P < .001 for readmission and P = 0.57 for postdischarge mortality. Full results in supplementary eTable 3A.
-[^flattening]: Full results in supplementary eTable 3B.
-[^demographics]: Full results in supplementary eTable 3C.
-
-::: {.annotation .annotation--static}
-**Figure 5 --- [placeholder]** Ablation study results comparing the base embedding model against transformer-augmented, permutation-variant, and ICD-only variants for both outcomes.
-:::
-
-## Discussion
-
-In a large national claims database, a permutation-invariant embedding model that learned ICD-10-CM representations achieved higher discrimination than Charlson and Elixhauser comorbidity-index models for predicting 30-day unplanned readmission and 30-day postdischarge in-hospital mortality. The embedding model also showed better calibration and recall-weighted performance, supporting its potential utility for discharge-time risk stratification and claims-based risk adjustment.
-
-### Comparison With Prior Work
-
-Prior studies have demonstrated the value of machine learning for readmission and mortality prediction [@deschepper2020; @lelay2022; @qiao2022; @davis2022; @harerimana2021; @matsui2022; @nguyen2017]. However, many approaches rely on permutation-variant aggregation (e.g., recurrent networks or attention over ordered sequences), which can be sensitive to code ordering---a dimension that varies across sites and coders. Our permutation-invariant approach addresses this limitation and may improve transportability. In addition, many claims-based studies focus on in-hospital mortality and do not evaluate postdischarge mortality as a discharge-time outcome. Our focus on postdischarge mortality provides a more clinically relevant endpoint for discharge planning and transitional care.
-
-### Implications for Practice and Policy
-
-Readmission reduction is a key quality and policy goal, with financial penalties under the Hospital Readmissions Reduction Program [@cms_hrrp; @joynt2013; @desai2016; @zuckerman2017]. However, current risk adjustment often relies on comorbidity indices that may under-adjust for complexity, leading to unfair comparisons across hospitals serving different populations [@obermeyer2019; @kind2014; @joynt2011]. Our findings suggest that learning representations directly from ICD-10-CM codes could improve fairness and precision in risk adjustment.
-
-At the patient level, improved risk stratification at discharge could help prioritize transitional-care resources (e.g., care coordination, medication reconciliation, follow-up calls) to high-risk patients [@hansen2011; @coleman2006]. The interpretability analysis also highlights specific diagnoses that drive risk, which may inform discharge planning conversations.
-
-### Limitations
-
-Our study has several limitations. First, the NRD captures only readmissions to acute-care hospitals within the same state and does not include deaths outside the hospital; both outcomes are therefore underestimated. Second, the model was trained and evaluated in a single national database; external validation in other claims databases and prospective evaluation are needed to assess generalizability and clinical utility. Third, while Integrated Gradients provides code-level attributions, the embedding model remains less interpretable than simple comorbidity indices [@rudin2019]. Fourth, we did not incorporate additional predictors (e.g., laboratory values, vital signs, discharge disposition) that may be available in some settings and could further improve performance. Finally, residual confounding by unmeasured factors (e.g., social determinants, functional status) may affect model predictions and limit clinical deployment without further validation.
-
-### Conclusions
-
-In a large national claims database, a permutation-invariant model that learned ICD-10-CM representations improved prediction of 30-day readmission and postdischarge in-hospital mortality compared with Charlson and Elixhauser index models. These findings support the use of high-dimensional diagnosis information for claims-based risk adjustment and discharge-time risk stratification, with prospective evaluation needed before clinical deployment.
-
-## Tables
-
-### Table 1: Cohort Characteristics {#table-1}
-
-| Characteristic | Development (2016--2020) | Temporal Test (2021--2022) |
-|:---|---:|---:|
-| Total hospitalizations | 19,120,000 | 32,268,308 |
-| 30-day readmission rate (%) | 11.0 | 10.7 |
-| 30-day postdischarge mortality rate (%) | 0.6 | 0.6 |
-
-### Table 2: Model Performance Comparison (Temporal Test Set) {#table-2}
-
-| Outcome | Model | AUC-ROC | 95% CI | Precision | Recall | $F_1$ | $F_2$ |
-|:---|:---|---:|:---:|---:|---:|---:|---:|
-| 30-day readmission | Embedding | 0.7496 | 0.7488--0.7504 | 0.1881 | 0.8006 | 0.3046 | 0.4848 |
-| | CCI | 0.6553 | 0.6544--0.6562 | 0.1493 | 0.6819 | 0.2453 | 0.4066 |
-| | CCI (age-adj) | 0.6483 | 0.6474--0.6491 | 0.1469 | 0.6794 | 0.2416 | 0.4016 |
-| | ECI | 0.6363 | 0.6353--0.6372 | 0.1426 | 0.6750 | 0.2357 | 0.3925 |
-| 30-day postdischarge mortality | Embedding | 0.8557 | 0.8532--0.8581 | 0.0111 | 0.8756 | 0.0220 | 0.0530 |
-| | CCI | 0.7217 | 0.7180--0.7253 | 0.0075 | 0.8026 | 0.0149 | 0.0371 |
-| | CCI (age-adj) | 0.7844 | 0.7813--0.7874 | 0.0093 | 0.7846 | 0.0185 | 0.0480 |
-| | ECI | 0.6686 | 0.6645--0.6728 | 0.0068 | 0.7901 | 0.0135 | 0.0336 |
-
-::: {.annotation .annotation--static}
-**Full title:** Development of an ICD-10-CM Embedding Model for Predicting 30-Day Readmission and Postdischarge In-Hospital Mortality in the Nationwide Readmissions Database
-
-**Keywords:** ICD-10-CM; readmission; mortality; claims data; risk adjustment; deep learning; model interpretability
-:::
-
-::: {.annotation .annotation--collapsible}
-**Structured Abstract**
-
-**Importance.** Comorbidity indices are widely used for claims-based risk adjustment but compress diagnostic information and may under-adjust for clinical complexity.
-
-**Objective.** To develop and temporally validate a permutation-invariant model that learns ICD-10-CM representations to predict 30-day unplanned readmission and 30-day postdischarge mortality and to compare performance with Charlson and Elixhauser comorbidity-index models.
-
-**Design, Setting, and Participants.** Retrospective cohort study of adult hospitalizations in the Healthcare Cost and Utilization Project Nationwide Readmissions Database (NRD), 2016--2022. Models were developed using discharges from 2016--2020 and temporally tested using 2021--2022 discharges. Primary performance evaluation was conducted in a prespecified stratified random subsample of 2021--2022 discharges (n = 3,226,831).
-
-**Exposure.** Up to 40 discharge diagnosis codes (ICD-10-CM) were mapped to trainable embeddings and aggregated with a permutation-invariant Deep Sets architecture; age, sex, primary payer, and ZIP code income quartile were also included as covariates.
-
-**Main Outcomes and Measures.** Outcomes were 30-day unplanned readmission and 30-day postdischarge in-hospital mortality. Discrimination was assessed with the area under the receiver operating characteristic curve (AUC) and 95% CIs; calibration and threshold-dependent metrics (including $F_2$) were evaluated. Performance was compared with optimized logistic regression models based on Charlson and Elixhauser indices.
-
-**Results.** In temporal testing, the embedding model showed higher discrimination for 30-day readmission (AUC, 0.7496 [95% CI, 0.7488--0.7504]) than Charlson (0.6553 [95% CI, 0.6544--0.6562]) and Elixhauser (0.6363 [95% CI, 0.6353--0.6372]). For 30-day postdischarge in-hospital mortality, the embedding model achieved an AUC of 0.8557 (95% CI, 0.8532--0.8581) vs the best-performing comparator (age-adjusted Charlson: 0.7844 [95% CI, 0.7813--0.7874]); DeLong tests were significant for each comparison (P < .001). Recall-weighted performance similarly favored the embedding model ($F_2$: 0.4848 vs 0.4066 for readmission; 0.0530 vs 0.0480 for postdischarge mortality).
-
-**Conclusions and Relevance.** In a large national claims database, a permutation-invariant model that learned ICD-10-CM representations improved prediction of 30-day readmission and postdischarge in-hospital mortality compared with Charlson and Elixhauser index models. These findings support the use of high-dimensional diagnosis information for claims-based risk adjustment and discharge-time risk stratification, with prospective evaluation needed before clinical deployment.
-:::
--- a/content/essays/beyond-comorbidity-indices/figures/efig1-model-architecture.png
+++ b/content/essays/beyond-comorbidity-indices/figures/efig1-model-architecture.png
--- a/content/essays/beyond-comorbidity-indices/figures/efig2-ig-mortality-with-inhospital.png
+++ b/content/essays/beyond-comorbidity-indices/figures/efig2-ig-mortality-with-inhospital.png
--- a/content/essays/beyond-comorbidity-indices/figures/efig4a-calculator-with-demographics.png
+++ b/content/essays/beyond-comorbidity-indices/figures/efig4a-calculator-with-demographics.png
--- a/content/essays/beyond-comorbidity-indices/figures/efig4b-calculator-without-demographics.png
+++ b/content/essays/beyond-comorbidity-indices/figures/efig4b-calculator-without-demographics.png
--- a/content/essays/beyond-comorbidity-indices/figures/fig2a-roc-readmission.png
+++ b/content/essays/beyond-comorbidity-indices/figures/fig2a-roc-readmission.png
--- a/content/essays/beyond-comorbidity-indices/figures/fig2b-roc-mortality.png
+++ b/content/essays/beyond-comorbidity-indices/figures/fig2b-roc-mortality.png
--- a/content/essays/beyond-comorbidity-indices/figures/fig3a-calibration-readmission.png
+++ b/content/essays/beyond-comorbidity-indices/figures/fig3a-calibration-readmission.png
--- a/content/essays/beyond-comorbidity-indices/figures/fig3b-calibration-mortality.png
+++ b/content/essays/beyond-comorbidity-indices/figures/fig3b-calibration-mortality.png
--- a/content/essays/beyond-comorbidity-indices/figures/fig4a-ig-readmission.png
+++ b/content/essays/beyond-comorbidity-indices/figures/fig4a-ig-readmission.png
--- a/content/essays/beyond-comorbidity-indices/figures/fig4b-ig-mortality.png
+++ b/content/essays/beyond-comorbidity-indices/figures/fig4b-ig-mortality.png
--- a/content/essays/beyond-comorbidity-indices/index.md
+++ b/content/essays/beyond-comorbidity-indices/index.md
@ -0,0 +1,367 @@
+---
+title: "Beyond Comorbidity Indices"
+date: 2026-04-09
+abstract: >
+  A deep learning model using ICD-10-CM diagnosis codes with a permutation-invariant Deep Sets aggregator improved 30-day unplanned readmission (AUC 0.7496 vs 0.6553 for CCI) and 30-day postdischarge in-hospital mortality (AUC 0.8557 vs 0.7844 for age-adjusted CCI) compared with Charlson and Elixhauser comorbidity-index benchmarks in a national claims database of over 113 million adult hospitalizations.
+tags:
+  - research
+  - research/machine-learning
+authors:
+  - "Levi Neuwirth | /me.html"
+  - "Liqi Shu"
+  - "Xilin Wang"
+  - "Henry Zheng"
+affiliation:
+  - "Department of Neurology, Warren Alpert Medical School, Brown University"
+  - "Department of Computer Science, Brown University | https://cs.brown.edu"
+  - "Department of Mathematics, Brown University | https://mathematics.brown.edu/"
+  - "Department of Computer Science, Northeastern University"
+status: "Durable"
+confidence: 80
+importance: 3
+evidence: 5
+scope: average
+novelty: moderate
+practicality: moderate
+bibliography: data/bci-paper.bib
+repository: "https://git.levineuwirth.org/neuwirth/beyond_comorbidity_indices"
+history:
+  - date: "2026-03-28"
+    note: Preprint auto-formatted for levineuwirth.org
+---
+
+::: {.annotation .annotation--collapsible}
+**KEY POINTS**
+
+**Question.** Among adult hospitalizations in a national claims database, does a deep learning model using ICD-10-CM diagnosis codes improve prediction of 30-day unplanned readmission and 30-day postdischarge in-hospital mortality compared with benchmark models based on Charlson and Elixhauser comorbidity indices?
+
+**Findings.** In this cohort study of 3,226,831 temporally held-out discharges, the ICD-10-CM--based model showed better discrimination than benchmark comorbidity-index models for both outcomes.
+
+**Meaning.** Using the full set of discharge diagnosis codes may improve short-term claims-based outcome prediction beyond summary comorbidity indices.
+:::
+
+## Introduction (Background and Significance)
+
+::: dropcap
+Accurate prediction and risk adjustment for short-term clinical outcomes, such as 30-day mortality and readmission, are critical for enhancing healthcare research quality, allowing fair assessment of healthcare outcomes and quality metrics [@cms_hrrp]. Most claims-based risk adjustment continues to rely on comorbidity indices such as the Charlson Comorbidity Index (CCI) and Elixhauser Comorbidity Index (ECI), which map diagnosis codes to a limited set of conditions [@charlson1987; @elixhauser1998]. While these indices are interpretable and widely deployed, they inevitably discard granularity and may miss clinically meaningful comorbidity patterns and interactions among diagnoses.
+:::
+
+Recent machine-learning approaches increasingly use a set of ICD-10-CM diagnosis codes and have demonstrated improved prediction for a range of outcomes [@deschepper2020; @lelay2022; @qiao2022]. However, many approaches simplify or truncate ICD codes, aggregate diagnosis lists in ways that depend on code order, or are trained and evaluated in settings where coding practices differ across sites---each of which can limit generalizability across settings. In addition, many claims-based studies focus on in-hospital mortality and do not evaluate postdischarge mortality among outcomes relevant at the time of discharge [@qiao2022; @davis2022; @harerimana2021; @matsui2022; @nguyen2017].
+
+In this study, we developed and temporally validated a claims-based deep learning model using ICD-10-CM diagnosis codes to predict 30-day unplanned readmission and 30-day postdischarge mortality in the Nationwide Readmissions Database. We compared its performance with benchmark models based on the Charlson and Elixhauser comorbidity indices, which are widely used for claims-based risk adjustment but were not originally designed for these specific outcomes. We also evaluated the model with different architectural design and examined diagnosis-level contributions to model predictions.
+
+## Materials and Methods
+
+### Study Design, Data Source, and Oversight
+
+We conducted a retrospective cohort study using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD), 2016--2022. Adult discharges from 2016 through 2020 were used for model development, and a later temporally separated cohort from 2021 through 2022 was reserved for temporal validation. Discharges in December of each year were excluded to allow complete 30-day follow-up within the same calendar year.
+
+Use of the NRD was governed by the HCUP data use agreement. Because the NRD contains deidentified data, the institutional review board determined the study was not human participants research and that informed consent was not required.
+
+### Cohort Definition
+
+We included hospitalizations for patients aged 18 years or older with a valid patient linkage identifier within each calendar year. For both the readmission and mortality analyses, index hospitalizations ending in in-hospital death were excluded because patients were not at risk for postdischarge outcomes. In-hospital death during the index hospitalization was examined in a prespecified secondary mortality analysis (eResults 1).
+
+### Outcomes
+
+The coprimary outcomes were (1) 30-day unplanned readmission and (2) 30-day postdischarge in-hospital mortality (hereafter, postdischarge mortality). Readmissions were classified as unplanned if they were coded as nonelective admissions in the HCUP database. Postdischarge mortality was defined as inpatient death occurring during a subsequent hospitalization within 30 days after discharge. Deaths outside the hospital are not captured in the NRD.
+
+### Predictors
+
+For each index hospitalization, we used up to 40 ICD-10-CM diagnosis codes (principal and secondary) and patient-level covariates (age, sex, primary payer, and ZIP-code median income quartile). Age was standardized, and categorical variables were represented using one-hot encoding. Analyses were restricted to records with nonmissing outcome ascertainment and complete covariates.
+
+### Comparator Models
+
+For benchmarking, we computed the Elixhauser Comorbidity Index (ECI) and Charlson Comorbidity Index (CCI) for each index hospitalization and treated each index as a continuous risk score [@charlson1987; @elixhauser1998]. The ECI identifies 30+ distinct conditions from administrative data, serving as a critical tool for risk adjustment in studies evaluating in-hospital mortality and short-term readmissions [@elixhauser1998; @ahrq_elixhauser; @fernando2019; @quan2005]. The CCI consolidates up to 19 comorbid conditions into a weighted numeric score, including variants that adjust for age, primarily predicting long-term mortality and readmissions [@charlson1987; @fernando2019; @quan2005; @deyo1992; @quan2011]. The ECI was computed using an ICD-10-CM--adapted AHRQ approach that identifies chronic comorbidities primarily from secondary diagnoses [@quan2005]. The CCI was computed using ICD-10-CM mappings to 17 comorbidity categories; both raw CCI and age-adjusted CCI were evaluated [@stagg2006]. These benchmark models used the index score alone as the predictor; age, sex, primary payer, and ZIP-code income quartile were not added separately. Discrimination and threshold-dependent classification metrics were derived directly from the score distributions, with operating thresholds selected on the validation set and then applied unchanged to the temporal test evaluation subsample.
+
+### Model Architecture
+
+We developed a deep learning framework that embeds each patient's diagnosis list along with demographic and socioeconomic information to predict the outcome. Each ICD-10-CM code was mapped into a dense vector representation through a learned numerical transformation. To obtain a single representation for each patient while avoiding reliance on diagnosis ordering, we used an aggregation approach that does not depend on code order:
+
+$$f(x) = \rho\!\left(\sum_{x \in X} \phi(x)\right)$$
+
+where $X$ denotes the set of embedded diagnosis vectors. Functions $\phi$ and $\rho$ were implemented as multilayer perceptrons with ReLU activations [@zaheer2017].
+
+Demographic and socioeconomic variables were processed via a separate 2-layer multilayer perceptron. The resulting vector was concatenated with the aggregated diagnosis representation and passed through fully connected layers with ReLU activations and dropout regularization. A sigmoid output layer finally produced a predicted probability for each outcome.
+
+### Model Development and Temporal Validation
+
+Data from 2016--2020 were split into training (90%) and validation (10%) sets. For each outcome, models were trained to minimize binary cross-entropy loss. To address class imbalance, majority-class downsampling was applied during training (see Supplementary eMethods 1). Because majority-class downsampling altered the effective outcome prevalence in the training data, predicted probabilities were corrected using the original training-set prevalence before reporting calibration, temporal-test probabilities, and web-calculator outputs [@pozzolo2015]. This deterministic correction affects probability scaling but not rank-based discrimination. Hyperparameters (embedding dimension, Deep Sets depth/width, demographic tower width, predictor multilayer perceptron configuration, and dropout rate) were tuned using random search; the configuration with best validation AUROC (with recall-weighted metrics used as secondary criteria) was selected (see Supplementary eTable 1).
+
+Temporal validation was based on eligible 2021--2022 discharges. To support computational feasibility while preserve outcome prevalence, primary performance evaluation was conducted in a prespecified stratified random subsample of the eligible 2021--2022 temporal test cohort, with 10% of outcome-positive and 10% of outcome-negative discharges sampled for each outcome (Supplementary eMethods 2). Models were implemented in Python using TensorFlow [@abadi2016].
+
+### Performance Metrics
+
+For threshold-dependent metrics, binary classification thresholds were selected on the validation set by maximizing the Youden index (sensitivity + specificity − 1) and then applied unchanged to the temporal test evaluation subsample for each model [@youden1950].
+
+Because outcomes were imbalanced, we emphasized discrimination and precision--recall performance. Primary metrics included AUROC with 95% confidence intervals (CIs), average precision, precision, recall, $F_1$ score, and $F_2$ score (placing greater weight on recall).
+
+### Statistical Analysis
+
+AUROCs and their 95% CIs were estimated using DeLong's nonparametric method [@delong1988]. Pairwise comparisons in AUROC between the embedding model and each comorbidity-index comparator were performed using DeLong tests for correlated ROC curves [@delong1988]. Resulting $P$ values are unadjusted and interpreted alongside effect sizes and 95% CIs.
+
+We conducted prespecified ablation analyses to estimate the incremental contribution of key model components, including addition of transformer blocks, replacement of the order-invariant Deep Sets aggregator with a permutation-variant flattening comparator, and removal of demographic and socioeconomic inputs; details are provided in Supplementary eMethods 4.
+
+### Model Interpretation
+
+We used Integrated Gradients (IG) to estimate code-level contributions to model predictions for each outcome [@sundararajan2017; @placido2023]. Attribution values were summarized at the ICD-10-CM code level, with positive values indicating higher predicted risk and negative values indicating lower predicted risk. To reduce instability from rare codes, ranked summaries were restricted to codes with at least 50 occurrences in the temporal test evaluation subsample. Additional implementation details are provided in Supplementary eMethods 3.
+
+### Web Application
+
+A public, read-only web calculator accepts discharge diagnosis lists and returns risk estimates with code-level explanations; inputs are not stored. The tool is intended for research and demonstration purposes rather than clinical decision-making. The calculator is available at: <https://levineuwirth.github.io/icd_embeddings/>. Implementation details are provided in Supplementary eFigure 4.
+
+## Results
+
+### Cohort size and event prevalence
+
+The NRD included 80,217,696 discharges in 2016--2020 for model development and 33,322,761 discharges in 2021--2022 for temporal testing (Figure 1). After application of outcome-specific eligibility criteria, the validation cohort included 7,828,015 discharges. Primary performance evaluation was conducted in a prespecified stratified random subsample of 3,226,831 discharges from the eligible 2021--2022 temporal test cohort, and this evaluation subsample was not downsampled. For model training, majority-class downsampling was applied to address class imbalance, yielding analytic training samples of 17,200,994 discharges for the readmission analysis and 544,138 discharges for the postdischarge mortality analysis. In the temporal test evaluation subsample, 30-day unplanned readmission occurred in 362,696 discharges (11.2%), and 30-day postdischarge in-hospital mortality occurred in 13,071 discharges (0.4%).
+
+### Model Performance
+
+Detailed performance metrics for the ICD-10-CM--based model and benchmark comorbidity-index models are shown in Table 1. In the temporal test evaluation subsample, the ICD-10-CM--based model showed higher discrimination than comparator models for both 30-day unplanned readmission and 30-day postdischarge mortality (Figure 2). For readmission, the AUROC was 0.750 (95% CI, 0.749--0.750) for the ICD-10-CM--based model, compared with 0.655 (95% CI, 0.654--0.656) for the CCI model, 0.644 (95% CI, 0.644--0.645) for the age-adjusted CCI model, and 0.636 (95% CI, 0.635--0.637) for the ECI model. For postdischarge mortality, the AUROC was 0.856 (95% CI, 0.853--0.858) for the ICD-10-CM--based model, compared with 0.784 (95% CI, 0.781--0.787) for the best-performing comparator, the age-adjusted CCI model; the AUROC for the ECI model was 0.641 (95% CI, 0.636--0.647). DeLong tests comparing the ICD-10-CM--based model with each comorbidity-index comparator were significant for all pairwise comparisons ($P < .001$). Calibration curves showed overall agreement between predicted and observed risk for both outcomes, with greater deviation at higher predicted-risk ranges; this deviation was less pronounced for postdischarge mortality than for readmission (Figure 3).
+
+At the prespecified threshold selected on the validation set, the ICD-10-CM--based model showed higher recall-weighted performance than comparator models, with $F_2$ scores of 0.485 vs 0.407 for 30-day readmission and 0.053 vs 0.048 for postdischarge mortality. Threshold-dependent metrics, including precision, recall, and specificity, are shown in Table 1. Because the classification threshold was selected using the Youden index, these metrics reflect a balance of sensitivity and specificity rather than optimization for a specific clinical use case. For readmission, these gains were accompanied by only modest precision, consistent with the difficulty of predicting this heterogeneous outcome.
+
+In the prespecified secondary analysis expanding mortality to include in-hospital death during the index hospitalization, the ICD-10-CM--based model achieved an AUROC of 0.965 (95% CI, 0.965--0.966), exceeding that of the best-performing comparator model (age-adjusted CCI: AUROC, 0.750 [95% CI, 0.749--0.751]) (eTable 2).
+
+### Ablation Studies
+
+We evaluated a set of prespecified model variants that removed or augmented architectural components (eg, ICD-only inputs and insertion of transformer blocks) to estimate the incremental contribution of each element. We also compared the order-invariant aggregation approach with an order-dependent flattening-based aggregator to quantify any performance tradeoff attributable to enforcing invariance. In covariate ablation, removing demographic and socioeconomic inputs (age, sex, payer, and ZIP-income quartile) caused modest attenuation in performance (readmission AUROC, 0.750 vs 0.748; postdischarge mortality AUROC, 0.856 vs 0.848; similar $F_2$ scores), suggesting that diagnosis patterns captured most, but not all, of the predictive signal. Implementation details are provided in Supplementary eMethods 4; results are summarized in eTable 3.
+
+### Feature Importance
+
+ICD-10-CM codes with the 10 highest positive and negative contributions to both prediction outcomes are shown in Figure 4. For 30-day readmission, acute myeloblastic leukemia, in relapse (C9202), had the greatest positive contribution, whereas encounter for care and examination of mother immediately after delivery (Z390) had the greatest negative contribution. For 30-day postdischarge mortality, C9202 also had the greatest positive contribution, whereas assault by unspecified sharp object, initial encounter (X999XXA) had the greatest negative contribution. The most influential diagnosis codes for 30-day mortality prediction including inpatient death are shown in Supplementary eFigure 2.
+
+## Discussion
+
+In this national claims-based cohort study, a deep learning model using the full set of discharge diagnosis codes showed better discrimination than benchmark models based on Charlson and Elixhauser comorbidity indices for both 30-day unplanned readmission and 30-day postdischarge in-hospital mortality. The performance gain was larger for postdischarge mortality than for readmission. Performance remained favorable in a later, temporally separated NRD cohort, supporting robustness across subsequent years of the same database [@davis2020; @collins2024]. At the same time, these comparisons should be interpreted as benchmarking against widely used summary comorbidity approaches rather than as head-to-head comparisons with models purpose-built for these exact outcomes.
+
+### Comparison with prior work
+
+This pattern is consistent with the structure of the compared methods. Charlson and Elixhauser indices compress diagnosis information into a limited set of predefined conditions and were designed primarily for broad case-mix adjustment rather than high-resolution outcome prediction [@charlson1987; @elixhauser1998]. By contrast, the present model learns from the full diagnosis-code set and can represent co-occurrence patterns that are not captured by summary indices [@morgan2019; @beam2018]. Unlike many prior deep-learning approaches that depend on richer electronic health record inputs and site-specific preprocessing, this framework was designed for portability within claims-based settings by using routinely available diagnosis, demographic, and payer-related variables [@rajkomar2018]. This design also preserves the broader diagnostic context of each hospitalization rather than reducing diagnoses to fixed summary weights as ECI and CCI.
+
+### Interpretability
+
+Interpretability in this setting should not be viewed as an afterthought to an otherwise opaque model. Using Integrated Gradients, the model provided code-level attributions that were generally clinically plausible and helped explain why predicted risk increased or decreased for a given patient. For example, diagnoses associated with high treatment burden or advanced systemic illness, such as relapsed acute myeloid leukemia and alcoholic cirrhosis with ascites, tended to increase predicted risk, whereas postpartum encounters and some assault-related injuries tended to decrease it. These findings suggest that learning-based models can yield clinically meaningful information rather than functioning only as "black boxes," even when they are more flexible than traditional summary indices [@sundararajan2017; @placido2023; @rudin2019].
+
+One plausible explanation for the performance gap between the present model and the Charlson and Elixhauser indices is that the prognostic contribution of a diagnosis is not fixed across patients. Summary comorbidity indices assign prespecified, static weights to diagnosis groups, effectively assuming that a given condition contributes similarly regardless of the broader diagnostic context. By contrast, in the present model, the contribution of a diagnosis could vary according to the full set of co-occurring diagnoses, which is more consistent with how risk is often understood clinically. We did not directly test this mechanism, so it should be interpreted as a hypothesis supported by the attribution patterns rather than as a proven explanation for the observed performance differences. Nevertheless, this dynamic view of diagnosis contribution may help explain why retaining the full diagnosis-code context improved prediction beyond summary comorbidity scores. As with other attribution methods, these explanations improve transparency but do not establish causality.
+
+### Clinical and policy implications
+
+These findings have two potential implications. First, in discharge-facing workflows, a claims-compatible model could be evaluated in read-only settings to identify patients who may warrant closer follow-up, medication reconciliation, or transitional-care outreach [@hansen2011; @coleman2006]. Second, in research and quality measurement, more granular use of diagnosis data may improve outcome prediction when summary comorbidity indices underrepresent diagnostic complexity [@joynt2013; @desai2016; @zuckerman2017]. This tool is intended for clinical prioritization and equitable quality measurement, not for coverage denial or utilization gatekeeping [@obermeyer2019].
+
+Because demographic and socioeconomic factors are known to influence postdischarge outcomes, we assessed their incremental contribution beyond diagnosis patterns using ablation [@kind2014; @joynt2011]. Removing age, sex, payer, and neighborhood income produced minimal changes in performance, suggesting that much of the predictive signal available to this model was already captured by diagnosis patterns. This finding should not be interpreted to mean that demographic or socioeconomic factors are unimportant. Rather, within this claims-based framework, coded diagnoses may already capture part of the risk signal associated with demographic and socioeconomic differences, whether through differences in disease burden, comorbidity clustering, or patterns of healthcare use.
+
+The public read-only calculator is intended for research and demonstration rather than clinical deployment. Future work should focus on external validation, prospective evaluation in read-only workflows, monitoring for coding and case-mix drift, and recalibration when needed [@davis2020; @collins2024; @collins2015].
+
+### Limitations
+
+This study has several limitations. First, the NRD captures deaths only during inpatient encounters; therefore, the mortality outcome reflects postdischarge in-hospital mortality rather than all-cause 30-day mortality. Second, claims data are subject to coding error and variation and do not directly capture functional status, physiologic severity, or many social risk factors. Third, although temporal validation in later NRD years reduces optimism, it is not a substitute for external validation, and performance may differ in other health systems or data sources with different coding practices, case mix, and discharge workflows. Fourth, this study evaluated predictive performance rather than downstream improvement in confounding control, hospital profiling, or other risk-adjustment applications; thus, better discrimination does not by itself establish superior risk adjustment, and because the model was trained specifically for 30-day unplanned readmission and postdischarge in-hospital mortality, performance may not generalize to other outcomes without separate validation. Fifth, Charlson and Elixhauser indices were included as benchmark comparators because of their widespread use in claims-based analyses, but they were not originally developed for these specific outcomes; accordingly, these comparisons should be interpreted as benchmarking rather than definitive head-to-head testing. Finally, attribution methods may improve transparency but do not establish causality [@rudin2019].
+
+## Conclusions
+
+A deep learning model using ICD-10-CM diagnosis codes improved prediction of 30-day unplanned readmission and 30-day postdischarge mortality compared with Charlson and Elixhauser comorbidity-index models in the Nationwide Readmissions Database. Prospective validation, drift monitoring, and attention to intended use will be essential before implementation for clinical decision support or policy applications.
+
+## Data Sharing Statement
+
+The study used de-identified data from the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database under a data-use agreement. Data are available from HCUP to qualified researchers.
+
+## Code Availability
+
+The analytic code (including the non-elective readmission implementation, model training, and evaluation) will be made publicly available at publication in a GitHub repository (<https://github.com/Rice-wxl/icd-10-embedding>), with a versioned release tag/commit to support reproducibility. HCUP NRD data cannot be shared by the authors under the data-use agreement.
+
+## Conflict of Interest Disclosures
+
+The authors report no conflicts of interest related to this work.
+
+## Tables
+
+### Table 1: Performance metrics for the ICD model vs. CCI and ECI {#table-1}
+
+Performance metrics for the ICD model vs. CCI and ECI for 30-day readmission (a) and 30-day postdischarge mortality (b) in the temporal test evaluation subsample.
+
+**(a) 30-day readmission**
+
+| Methods                        | AUC-ROC                      | Accuracy | Precision  | Recall     | $F_1$      | $F_2$      |
+|:-------------------------------|:-----------------------------|---------:|-----------:|-----------:|-----------:|-----------:|
+| ICD Model (Threshold: 0.5022)  | **0.7496** [0.7488, 0.7504]  |   0.5892 | **0.1881** | **0.8006** | **0.3046** | **0.4848** |
+| CCI                            | 0.6553 [0.6544, 0.6562]      | **0.6962** | 0.1844   | 0.4973     | 0.2690     | 0.3713     |
+| CCI Age-Adjusted               | 0.6444 [0.6435, 0.6453]      |   0.6479 | 0.1673     | 0.5360     | 0.2550     | 0.3720     |
+| ECI                            | 0.6363 [0.6353, 0.6372]      |   0.5708 | 0.1598     | 0.6622     | 0.2575     | 0.4066     |
+
+**(b) 30-day postdischarge mortality**
+
+| Methods                        | AUC-ROC                      | Accuracy   | Precision  | Recall     | $F_1$      | $F_2$      |
+|:-------------------------------|:-----------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|
+| ICD Model (Threshold: 0.4644)  | **0.8557** [0.8532, 0.8581]  |     0.6848 | **0.0111** | **0.8756** | **0.0220** | **0.0530** |
+| CCI                            | 0.7621 [0.7585, 0.7657]      |     0.6987 | 0.0093     | 0.6963     | 0.0184     | 0.0442     |
+| CCI Age-Adjusted               | 0.7844 [0.7813, 0.7874]      |     0.7352 | 0.0102     | 0.6700     | 0.0201     | 0.0480     |
+| ECI                            | 0.6414 [0.6358, 0.6469]      | **0.7763** | 0.0089     | 0.4915     | 0.0175     | 0.0415     |
+
+*Primary performance evaluation used a prespecified stratified random subsample of eligible 2021--2022 discharges. Classification thresholds were selected on the validation set by maximizing the Youden index and then applied unchanged to the temporal test evaluation subsample.*
+
+## Figures
+
+::: {.annotation .annotation--static}
+**Figure 1 --- [placeholder]** Flow chart of discharge records in NRD for training, validation, and testing cohorts. Primary performance evaluation used a prespecified stratified random subsample of eligible 2021--2022 discharges. This temporal-test evaluation subsample was not downsampled. Classification thresholds were selected on the validation set by maximizing the Youden index and then applied unchanged to the temporal test subsample.
+:::
+
+**Figure 2.** Receiver operating characteristic (ROC) curves and area under the curve (AUC) for 30-day readmission and postdischarge mortality in the temporal test evaluation subsample. Each curve depicts the trade-off between sensitivity and specificity across different thresholds.
+
+![(a) 30-day readmission](figures/fig2a-roc-readmission.png)
+
+![(b) 30-day postdischarge mortality](figures/fig2b-roc-mortality.png)
+
+**Figure 3.** Calibration curves on temporal test evaluation subsample for 30-day readmission and postdischarge mortality.
+
+![(a) 30-day readmission](figures/fig3a-calibration-readmission.png)
+
+![(b) 30-day postdischarge mortality](figures/fig3b-calibration-mortality.png)
+
+**Figure 4.** Top influential ICD codes for model prediction. Mean Integrated Gradients attribution per occurrence (positive values indicate higher predicted risk; negative values indicate lower predicted risk).
+
+![(a) 30-day readmission](figures/fig4a-ig-readmission.png)
+
+![(b) 30-day postdischarge mortality](figures/fig4b-ig-mortality.png)
+
+## Supplement
+
+### eMethods
+
+1. **Training and Hyperparameter Tuning**
+2. **Temporal Test Set Construction and Threshold Selection**
+3. **Integrated Gradients for Code-Level Attribution**
+4. **Ablation Analyses**
+   1. Transformer Blocks
+   2. Deep Sets vs Permutation-Variant Flattening Comparator
+   3. Demographic and Socioeconomic Inputs
+
+### eTables
+
+1. **Hyperparameter Configurations Used in Models**
+2. **Performance Comparison for 30-Day Mortality Including Index-Hospital Death**
+3. **Ablation Study Results**
+   - (A) Addition of Transformer Blocks
+   - (B) Replacement of Deep Sets With Flattening Comparator
+   - (C) Removal of Demographic and Socioeconomic Inputs (ICD-Only)
+
+### eFigures
+
+1. **Permutation-invariant ICD Embedding Model**
+2. **Top ICD-10-CM Codes by Integrated Gradients for 30-Day Mortality Including Index-Hospital Death**
+3. **Calibration Reliability Plots for Temporal Test Set Predictions**
+   - 30-Day Readmission
+   - 30-Day Postdischarge Mortality
+4. **Web Calculator Interface Examples**
+
+### eMethods 1. Training and Hyperparameter Tuning
+
+Models were trained using batch size 128 for 10 epochs with the Adam optimizer at a learning rate of 2e-5. Early stopping was applied with patience of 2 epochs, retaining the checkpoint with the best validation performance. To address outcome imbalance, we randomly downsampled majority-class encounters in the training set to achieve a target case-control ratio of 1:1 (validation data and temporal test evaluation subsample were not downsampled). Because majority-class downsampling changes the effective outcome prevalence and can bias predicted probabilities, we performed adjustment with the original downsampling ratio of the training set and use the readjusted model outputs in all metrics/plots and for the web calculator. Random seeds for the train/validation split, downsampling, and model initialization were set to ensure reproducibility.
+
+Hyperparameters were tuned via random search (32 trials per outcome). eTable 1 reports the selected configurations. Hyperparameter definitions: $d_{\text{embed}}$ (ICD embedding dimension); $d_{\text{hidden}}$ (Deep Sets hidden dimension); $r_{\text{deepset}} \times d_{\text{hidden}}$ (Deep Sets output dimension); $n_{\text{encode}}$ and $n_{\text{decode}}$ (numbers of Deep Sets encoding/decoding layers); $d_{\text{demo}}$ (first-layer width of the demographic/socioeconomic MLP; second layer set to $d_{\text{demo}}/2$); $d_{\text{mlp}}$ (first-layer width of the predictor MLP, halving each layer to a minimum of 32 over 4 layers); $r_{\text{dropout}}$ (dropout rate). Model selection prioritized validation AUROC; recall-weighted metrics (including $F_2$) were used as secondary criteria.
+
+### eMethods 2. Temporal Test Set Construction and Threshold Selection
+
+Temporal testing used eligible 2021--2022 data. To enable computationally feasible evaluation while preserving outcome prevalence, we created the temporal test evaluation subsample by stratified random sampling of the combined eligible 2021--2022 cohort, sampling 10% of outcome-positive and 10% of outcome-negative discharges for each outcome. This yielded approximately 3.2 million discharges for primary performance evaluation. Interpretability analyses used the same temporal test evaluation subsample.
+
+Binary classification thresholds for each model were selected on the validation set by maximizing the Youden index (sensitivity + specificity − 1) and then applied unchanged to the temporal test evaluation subsample.
+
+### eMethods 3. Integrated Gradients for Code-Level Attribution
+
+Integrated Gradients (IG) was used to quantify code-level influence on model predictions. The baseline input was defined as a neutral input corresponding to an empty diagnosis list (ie, no diagnosis codes). The straight-line interpolation path from baseline to the observed input was discretized into 32 steps. At each step, gradients of the model logit were computed with respect to diagnosis embeddings; gradients were accumulated across steps and summed across embedding dimensions to yield a scalar attribution per code occurrence. Attribution values retained sign, with positive values indicating higher predicted risk and negative values indicating lower predicted risk. To reduce instability from rare codes, ICD-10-CM codes with fewer than 50 total occurrences in the temporal test evaluation subsample were excluded from ranked summaries. For each outcome, we reported the 10 codes with the largest mean positive attributions and the 10 codes with the largest mean negative attributions.
+
+### eMethods 4. Ablation Analyses
+
+#### eMethods 4.1 Transformer blocks
+
+To evaluate whether attention-based contextualization improves performance, we added three multi-head transformer blocks operating over individual ICD embeddings. Each block followed a standard transformer design with multi-head attention, residual connections, normalization, and feed-forward sublayers. We used 3 attention heads, dropout 0.3, embedding dimension $d_{\text{embed}}$, and feed-forward dimension $4 \times d_{\text{embed}}$.
+
+Comparative results are shown in eTable 3, Panel A. Across both outcomes, transformer blocks did not materially improve AUROC and $F_2$ score relative to the base model. Reported $P$ values for AUROC differences were $P < .001$ for readmission and $P = 0.57$ for postdischarge mortality.
+
+#### eMethods 4.2 Deep Sets vs permutation-variant flattening comparator
+
+To test whether permutation invariance via Deep Sets reduced predictive performance, we compared the base model with a permutation-variant alternative: a flattening layer that converts the 2-dimensional ICD embedding matrix into a 1-dimensional vector, followed by two MLP layers with $d_{\text{hidden}}$ and $r_{\text{deepset}} \times d_{\text{hidden}}$ units to mirror the Deep Sets hidden/output sizes.
+
+Results are shown in eTable 3, Panel B. The base Deep Sets models outperformed the flattening comparators on AUROC and $F_2$ score. Reported $P$ values for AUROC differences were $P < .001$ for readmission and $P = 0.014$ for postdischarge mortality.
+
+#### eMethods 4.3 Demographic and socioeconomic inputs
+
+To evaluate the incremental value of non-diagnosis covariates, we removed the 2-layer demographic/socioeconomic MLP and trained ICD-only variants.
+
+Results are shown in eTable 3, Panel C. ICD-only variants had slightly worse AUROC and $F_2$ score. Reported $P$ values for AUROC differences were $P = 0.020$ for readmission and $P < .001$ for postdischarge mortality.
+
+### eTable 1: Hyperparameter configurations used in models
+
+| Outcome                                           | $d_{\text{embed}}$ | $d_{\text{hidden}}$ | $r_{\text{deepset}}$ | $n_{\text{encode}}$ | $n_{\text{decode}}$ | $d_{\text{demo}}$ | $d_{\text{mlp}}$ | $r_{\text{dropout}}$ |
+|:--------------------------------------------------|---:|----:|----:|---:|---:|---:|----:|----:|
+| 30-day readmission                                | 32 | 416 | 0.5 | 1  | 3  | 64 | 480 | 0.1 |
+| 30-day postdischarge mortality                    | 64 | 320 | 0.6 | 2  | 1  | 64 | 384 | 0.1 |
+| 30-day mortality including index-hospital death   | 64 | 416 | 0.8 | 3  | 3  | 64 | 448 | 0.4 |
+
+*The model configurations are determined through hyperparameter tuning for each outcome variable, respectively. All models share the same batch size and learning rate.*
+
+### eTable 2. Performance comparison for 30-day mortality including index-hospital death
+
+| Methods          | AUC-ROC                  | Precision | Recall | $F_1$  | $F_2$  |
+|:-----------------|:-------------------------|----------:|-------:|-------:|-------:|
+| ICD Model        | 0.9651 [0.9647, 0.9656]  | 0.2107    | 0.9165 | 0.3427 | 0.5489 |
+| CCI              | 0.7217 [0.7203, 0.7231]  | 0.0663    | 0.6270 | 0.1200 | 0.2331 |
+| CCI Age-Adjusted | 0.7501 [0.7489, 0.7513]  | 0.0724    | 0.6043 | 0.1294 | 0.2448 |
+| ECI              | 0.6158 [0.6139, 0.6177]  | 0.0637    | 0.4427 | 0.1114 | 0.2022 |
+
+### eTable 3. Ablation study results
+
+**(A) Addition of transformer blocks**
+
+| Outcome Variable               | Model Variant        | AUC-ROC | AUC-ROC CI       | Precision | Recall | $F_1$  | $F_2$  |
+|:-------------------------------|:---------------------|--------:|:-----------------|----------:|-------:|-------:|-------:|
+| 30-day readmission             | Full Model           | 0.7496  | [0.7488, 0.7504] | 0.1881    | 0.8006 | 0.3046 | 0.4848 |
+|                                | 3 Transformer Blocks | 0.7472  | [0.7464, 0.7479] | 0.1974    | 0.7565 | 0.3131 | 0.4829 |
+| 30-day postdischarge mortality | Full Model           | 0.8557  | [0.8532, 0.8581] | 0.0111    | 0.8756 | 0.0220 | 0.0530 |
+|                                | 3 Transformer Blocks | 0.8547  | [0.8523, 0.8572] | 0.0114    | 0.8662 | 0.0225 | 0.0542 |
+
+**(B) Replacement of Deep Sets with flattening comparator**
+
+| Outcome Variable               | Model Variant   | AUC-ROC | AUC-ROC CI       | Precision | Recall | $F_1$  | $F_2$  |
+|:-------------------------------|:----------------|--------:|:-----------------|----------:|-------:|-------:|-------:|
+| 30-day readmission             | Full Model      | 0.7496  | [0.7488, 0.7504] | 0.1881    | 0.8006 | 0.3046 | 0.4848 |
+|                                | Without DeepSet | 0.7474  | [0.7466, 0.7482] | 0.1933    | 0.7724 | 0.3092 | 0.4829 |
+| 30-day postdischarge mortality | Full Model      | 0.8557  | [0.8532, 0.8581] | 0.0111    | 0.8756 | 0.0220 | 0.0530 |
+|                                | Without DeepSet | 0.8513  | [0.8488, 0.8538] | 0.0108    | 0.8777 | 0.0214 | 0.0515 |
+
+**(C) Removal of demographic and socioeconomic inputs (ICD-only)**
+
+| Outcome Variable               | Model Variant   | AUC-ROC | AUC-ROC CI       | Precision | Recall | $F_1$  | $F_2$  |
+|:-------------------------------|:----------------|--------:|:-----------------|----------:|-------:|-------:|-------:|
+| 30-day readmission             | Full Model      | 0.7496  | [0.7488, 0.7504] | 0.1881    | 0.8006 | 0.3046 | 0.4848 |
+|                                | ICD Inputs Only | 0.7483  | [0.7475, 0.7490] | 0.1907    | 0.7868 | 0.3070 | 0.4842 |
+| 30-day postdischarge mortality | Full Model      | 0.8557  | [0.8532, 0.8581] | 0.0111    | 0.8756 | 0.0220 | 0.0530 |
+|                                | ICD Inputs Only | 0.8483  | [0.8457, 0.8509] | 0.0110    | 0.8627 | 0.0218 | 0.0525 |
+
+### eFigure 1: Permutation-invariant ICD Embedding Model
+
+![Given a patient's ICD-10-CM diagnosis codes together with demographic and socioeconomic features, the model estimates risk of 30-day readmission and 30-day postdischarge mortality. Diagnosis codes are embedded and aggregated with a Deep Sets module, then combined with other inputs to generate the final prediction.](figures/efig1-model-architecture.png)
+
+### eFigure 2: Top ICD-10-CM codes by Integrated Gradients for 30-day mortality including index-hospital death
+
+![Top ICD-10-CM codes by Integrated Gradients for 30-day mortality including index-hospital death.](figures/efig2-ig-mortality-with-inhospital.png)
+
+### eFigure 3. Calibration reliability plots for temporal test evaluation subsample predictions
+
+*(A) 30-day unplanned readmission and (B) 30-day postdischarge mortality.*
+
+::: {.annotation .annotation--static}
+**eFigure 3 --- [placeholder]** Calibration reliability plots for temporal test evaluation subsample predictions.
+:::
+
+### eFigure 4. Web Calculator Interface Examples
+
+*(A) With demographics and (B) Without demographics.*
+
+**(A)**
+
+![Web calculator interface with demographics.](figures/efig4a-calculator-with-demographics.png)
+
+**(B)**
+
+![Web calculator interface without demographics.](figures/efig4b-calculator-without-demographics.png)
--- a/data/bci-paper.bib
+++ b/data/bci-paper.bib
@ -0,0 +1,359 @@
+% bci-paper.bib — Beyond Comorbidity Indices
+% BibLaTeX format. References for the "Deep Learning of Diagnosis Codes
+% for Readmission and Postdischarge Mortality Prediction" manuscript.
+% Numbered order matches the docx reference list.
+
+@misc{cms_hrrp,
+  author = {{Centers for Medicare \& Medicaid Services}},
+  title  = {Hospital Readmissions Reduction Program ({HRRP})},
+  url    = {https://www.cms.gov/medicare/payment/prospective-payment-systems/acute-inpatient-pps/hospital-readmissions-reduction-program-hrrp}
+}
+
+@article{charlson1987,
+  author  = {Charlson, M. E. and Pompei, P. and Ales, K. L. and MacKenzie, C. R.},
+  title   = {A new method of classifying prognostic comorbidity in longitudinal studies: development and validation},
+  journal = {Journal of Chronic Diseases},
+  year    = {1987},
+  volume  = {40},
+  number  = {5},
+  pages   = {373--383}
+}
+
+@article{elixhauser1998,
+  author  = {Elixhauser, A. and Steiner, C. and Harris, D. R. and Coffey, R. M.},
+  title   = {Comorbidity measures for use with administrative data},
+  journal = {Medical Care},
+  year    = {1998},
+  volume  = {36},
+  number  = {1},
+  pages   = {8--27}
+}
+
+@article{deschepper2020,
+  author  = {Deschepper, Mathieu and Eeckloo, Kristof and Vogelaers, Dominique and Waegeman, Willem},
+  title   = {Using structured pathology data to predict hospital-wide mortality at admission},
+  journal = {PLoS One},
+  year    = {2020},
+  volume  = {15},
+  number  = {6},
+  pages   = {e0235117}
+}
+
+@article{lelay2022,
+  author  = {Le Lay, Jonathan and Martin, Sylvain and Mosnier, Arnaud and others},
+  title   = {Prediction of hospital readmission of multimorbid patients using machine learning models},
+  journal = {PLoS One},
+  year    = {2022},
+  volume  = {17},
+  number  = {12},
+  pages   = {e0279433}
+}
+
+@article{qiao2022,
+  author  = {Qiao, Edmund M. and Deng, Jie and Kluger, Harvey M. and Yu, James B. and Gross, Cary P.},
+  title   = {Evaluating High-Dimensional Machine Learning Models to Predict Hospital Mortality Among Older Patients With Cancer},
+  journal = {JCO Clinical Cancer Informatics},
+  year    = {2022},
+  volume  = {6},
+  pages   = {e2100186}
+}
+
+@article{davis2022,
+  author  = {Davis, Steve and Borah, Michael F. and Hansen, Justin and Malin, Bradley and Robinson, Jonathan and Turchin, Alexander},
+  title   = {Effective hospital readmission prediction models using machine-learned features},
+  journal = {BMC Health Services Research},
+  year    = {2022},
+  volume  = {22},
+  number  = {1},
+  pages   = {1415}
+}
+
+@article{harerimana2021,
+  author  = {Harerimana, Gaspard and Kim, Jong Wook and Jang, Beakcheol},
+  title   = {A deep attention model to forecast the Length Of Stay and the in-hospital mortality right on admission from {ICD} codes and demographic data},
+  journal = {Journal of Biomedical Informatics},
+  year    = {2021},
+  volume  = {118},
+  pages   = {103778}
+}
+
+@article{matsui2022,
+  author  = {Matsui, Hiroki and Yasunaga, Hideo and Fushimi, Kiyohide and Homma, Yuichi},
+  title   = {Development of Deep Learning Models for Predicting In-Hospital Mortality Using an Administrative Claims Database: Retrospective Cohort Study},
+  journal = {JMIR Medical Informatics},
+  year    = {2022},
+  volume  = {10},
+  number  = {2},
+  pages   = {e27936}
+}
+
+@article{nguyen2017,
+  author  = {Nguyen, Phuoc and Tran, Truyen and Wickramasinghe, Nilmini and Venkatesh, Svetha},
+  title   = {{Deepr}: A Convolutional Net for Medical Records},
+  journal = {IEEE Journal of Biomedical and Health Informatics},
+  year    = {2017},
+  volume  = {21},
+  number  = {1},
+  pages   = {22--30}
+}
+
+@misc{ahrq_elixhauser,
+  author = {{Agency for Healthcare Research and Quality}},
+  title  = {Elixhauser Comorbidity Software Refined for {ICD-10-CM}},
+  url    = {https://www.hcup-us.ahrq.gov/toolssoftware/comorbidityicd10/comorbidity_icd10.jsp}
+}
+
+@article{fernando2019,
+  author  = {Fernando, Dulith T. and Berecki-Gisolf, Janneke and Newstead, Stuart and Ansari, Zul},
+  title   = {Effect of comorbidity on injury outcomes: a review of existing indices},
+  journal = {Annals of Epidemiology},
+  year    = {2019},
+  volume  = {36},
+  pages   = {5--14}
+}
+
+@article{quan2005,
+  author  = {Quan, Hude and Sundararajan, Vijaya and Halfon, Patricia and others},
+  title   = {Coding algorithms for defining comorbidities in {ICD-9-CM} and {ICD-10} administrative data},
+  journal = {Medical Care},
+  year    = {2005},
+  volume  = {43},
+  number  = {11},
+  pages   = {1130--1139}
+}
+
+@article{deyo1992,
+  author  = {Deyo, Richard A. and Cherkin, Daniel C. and Ciol, Marcia A.},
+  title   = {Adapting a clinical comorbidity index for use with {ICD-9-CM} administrative databases},
+  journal = {Journal of Clinical Epidemiology},
+  year    = {1992},
+  volume  = {45},
+  number  = {6},
+  pages   = {613--619}
+}
+
+@article{quan2011,
+  author  = {Quan, Hude and Li, Bing and Couris, Chantal M. and others},
+  title   = {Updating and validating the {Charlson} comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries},
+  journal = {American Journal of Epidemiology},
+  year    = {2011},
+  volume  = {173},
+  number  = {6},
+  pages   = {676--682}
+}
+
+@misc{stagg2006,
+  author    = {Stagg, Vince},
+  title     = {{CHARLSON}: {Stata} module to calculate {Charlson} index of comorbidity},
+  year      = {2006},
+  publisher = {Boston College Department of Economics}
+}
+
+@inproceedings{zaheer2017,
+  author    = {Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak and Poczos, Barnabas and Salakhutdinov, Ruslan and Smola, Alexander},
+  title     = {Deep Sets},
+  booktitle = {Advances in Neural Information Processing Systems},
+  year      = {2017},
+  volume    = {30}
+}
+
+@inproceedings{pozzolo2015,
+  author    = {Dal Pozzolo, Andrea and Caelen, Olivier and Johnson, Reid A. and Bontempi, Gianluca},
+  title     = {Calibrating Probability with Undersampling for Unbalanced Classification},
+  booktitle = {2015 {IEEE} Symposium Series on Computational Intelligence},
+  year      = {2015},
+  pages     = {159--166}
+}
+
+@inproceedings{abadi2016,
+  author    = {Abadi, Mart{\'i}n and Barham, Paul and Chen, Jianmin and Chen, Zhifeng and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghemawat, Sanjay and Irving, Geoffrey and Isard, Michael and Kudlur, Manjunath and Levenberg, Josh and Monga, Rajat and Moore, Sherry and Murray, Derek G. and Steiner, Benoit and Tucker, Paul and Vasudevan, Vijay and Warden, Pete and Wicke, Martin and Yu, Yuan and Zheng, Xiaoqiang},
+  title     = {{TensorFlow}: A System for Large-Scale Machine Learning},
+  booktitle = {Proceedings of the 12th {USENIX} Symposium on Operating Systems Design and Implementation},
+  year      = {2016},
+  pages     = {265--283}
+}
+
+@article{youden1950,
+  author  = {Youden, W. J.},
+  title   = {Index for rating diagnostic tests},
+  journal = {Cancer},
+  year    = {1950},
+  volume  = {3},
+  number  = {1},
+  pages   = {32--35}
+}
+
+@article{delong1988,
+  author  = {DeLong, Elizabeth R. and DeLong, David M. and Clarke-Pearson, Daniel L.},
+  title   = {Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach},
+  journal = {Biometrics},
+  year    = {1988},
+  volume  = {44},
+  number  = {3},
+  pages   = {837--845}
+}
+
+@inproceedings{sundararajan2017,
+  author    = {Sundararajan, Mukund and Taly, Ankur and Yan, Qiqi},
+  title     = {Axiomatic Attribution for Deep Networks},
+  booktitle = {Proceedings of the 34th International Conference on Machine Learning},
+  year      = {2017},
+  pages     = {3319--3328}
+}
+
+@article{placido2023,
+  author  = {Placido, Davide and Yuan, Bo and Hjaltelin, Jessica X. and Zheng, Chunlei and Haue, Amalie D. and Chmura, Piotr J. and Yuan, Chen and Kim, Jihye and Umeton, Renato and Antell, Gregory and Chowdhury, Alexander and Franz, Alexandra and Brais, Lauren and Andrews, Elizabeth and Marks, Debora S. and Regev, Aviv and Ayandeh, Siamack and Brophy, Mary T. and Do, Nhan V. and Kraft, Peter and Wolpin, Brian M. and Rosenthal, Michael H. and Fillmore, Nathanael R. and Brunak, S{\o}ren and Sander, Chris},
+  title   = {A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories},
+  journal = {Nature Medicine},
+  year    = {2023},
+  volume  = {29},
+  number  = {5},
+  pages   = {1113--1122}
+}
+
+@article{davis2020,
+  author  = {Davis, Sharon E. and Greevy, Robert A. and Fonnesbeck, Christopher and Lasko, Thomas A. and Walsh, Colin G. and Matheny, Michael E.},
+  title   = {Detection of calibration drift in clinical prediction models to inform model updating},
+  journal = {Journal of Biomedical Informatics},
+  year    = {2020},
+  volume  = {112},
+  pages   = {103611}
+}
+
+@article{collins2024,
+  author  = {Collins, Gary S. and Dhiman, Paula and Ma, Jie and Schlussel, Michael M. and Archer, Lucinda and Van Calster, Ben and Harrell, Frank E. and Martin, Glen P. and Moons, Karel G. M. and van Smeden, Maarten and Sperrin, Matthew and Bullock, Garrett S. and Riley, Richard D.},
+  title   = {Evaluation of clinical prediction models (part 1): from development to external validation},
+  journal = {BMJ},
+  year    = {2024},
+  volume  = {384},
+  pages   = {e074819}
+}
+
+@article{morgan2019,
+  author  = {Morgan, Daniel J. and Bame, Bill and Zimand, Paul and Dooley, Patricia and Thom, Kerri A. and Harris, Anthony D. and Bentzen, S{\o}ren and Ettinger, Walter and Garrett-Ray, Stacy D. and Tracy, J. Kathleen and Liang, Yuanyuan},
+  title   = {Assessment of Machine Learning vs Standard Prediction Rules for Predicting Hospital Readmissions},
+  journal = {JAMA Network Open},
+  year    = {2019},
+  volume  = {2},
+  number  = {3},
+  pages   = {e190348}
+}
+
+@article{beam2018,
+  author  = {Beam, Andrew L. and Kohane, Isaac S.},
+  title   = {Big Data and Machine Learning in Health Care},
+  journal = {JAMA},
+  year    = {2018},
+  volume  = {319},
+  number  = {13},
+  pages   = {1317--1318}
+}
+
+@article{rajkomar2018,
+  author  = {Rajkomar, Alvin and Oren, Eyal and Chen, Kai and Dai, Andrew M. and Hajaj, Nissan and Hardt, Michaela and Liu, Peter J. and Liu, Xiaobing and Marcus, Jake and Sun, Mimi and Sundberg, Patrik and Yee, Hector and Zhang, Kun and Zhang, Yi and Flores, Gerardo and Duggan, Gavin E. and Irvine, Jamie and Le, Quoc and Litsch, Kurt and Mossin, Alexander and Tansuwan, Justin and Wang, De and Wexler, James and Wilson, Jimbo and Ludwig, Dana and Volchenboum, Samuel L. and Chou, Katherine and Pearson, Michael and Madabushi, Srinivasan and Shah, Nigam H. and Butte, Atul J. and Howell, Michael D. and Cui, Claire and Corrado, Greg S. and Dean, Jeffrey},
+  title   = {Scalable and accurate deep learning with electronic health records},
+  journal = {npj Digital Medicine},
+  year    = {2018},
+  volume  = {1},
+  pages   = {18}
+}
+
+@article{rudin2019,
+  author  = {Rudin, Cynthia},
+  title   = {Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead},
+  journal = {Nature Machine Intelligence},
+  year    = {2019},
+  volume  = {1},
+  number  = {5},
+  pages   = {206--215}
+}
+
+@article{hansen2011,
+  author  = {Hansen, Luke O. and Young, Robert S. and Hinami, Keiki and Leung, Albert and Williams, Mark V.},
+  title   = {Interventions to reduce 30-day rehospitalization: a systematic review},
+  journal = {Annals of Internal Medicine},
+  year    = {2011},
+  volume  = {155},
+  number  = {8},
+  pages   = {520--528}
+}
+
+@article{coleman2006,
+  author  = {Coleman, Eric A. and Parry, Carla and Chalmers, Sandra and Min, Sung-joon},
+  title   = {The care transitions intervention: results of a randomized controlled trial},
+  journal = {Archives of Internal Medicine},
+  year    = {2006},
+  volume  = {166},
+  number  = {17},
+  pages   = {1822--1828}
+}
+
+@article{joynt2013,
+  author  = {Joynt, Karen E. and Jha, Ashish K.},
+  title   = {Characteristics of hospitals receiving penalties under the {Hospital Readmissions Reduction Program}},
+  journal = {JAMA},
+  year    = {2013},
+  volume  = {309},
+  number  = {4},
+  pages   = {342--343}
+}
+
+@article{desai2016,
+  author  = {Desai, Nihar R. and Ross, Joseph S. and Kwon, Jin Yul and others},
+  title   = {Association Between Hospital Penalty Status Under the {Hospital Readmission Reduction Program} and Readmission Rates for Target and Nontarget Conditions},
+  journal = {JAMA},
+  year    = {2016},
+  volume  = {316},
+  number  = {24},
+  pages   = {2647--2656}
+}
+
+@article{zuckerman2017,
+  author  = {Zuckerman, Rachael B. and Sheingold, Steven H. and Orav, E. John and Ruhter, Jordan and Epstein, Arnold M.},
+  title   = {Effect of a Hospital-wide Measure on the Readmissions Reduction Program},
+  journal = {New England Journal of Medicine},
+  year    = {2017},
+  volume  = {377},
+  number  = {16},
+  pages   = {1551--1558}
+}
+
+@article{obermeyer2019,
+  author  = {Obermeyer, Ziad and Powers, Brian and Vogeli, Christine and Mullainathan, Sendhil},
+  title   = {Dissecting racial bias in an algorithm used to manage the health of populations},
+  journal = {Science},
+  year    = {2019},
+  volume  = {366},
+  number  = {6464},
+  pages   = {447--453}
+}
+
+@article{kind2014,
+  author  = {Kind, Amy J. H. and Jencks, Stephen and Brock, Jim and others},
+  title   = {Neighborhood socioeconomic disadvantage and 30-day rehospitalization: a retrospective cohort study},
+  journal = {Annals of Internal Medicine},
+  year    = {2014},
+  volume  = {161},
+  number  = {11},
+  pages   = {765--774}
+}
+
+@article{joynt2011,
+  author  = {Joynt, Karen E. and Orav, E. John and Jha, Ashish K.},
+  title   = {Thirty-day readmission rates for {Medicare} beneficiaries by race and site of care},
+  journal = {JAMA},
+  year    = {2011},
+  volume  = {305},
+  number  = {7},
+  pages   = {675--681}
+}
+
+@article{collins2015,
+  author  = {Collins, Gary S. and Reitsma, Johannes B. and Altman, Douglas G. and Moons, Karel G. M.},
+  title   = {Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis ({TRIPOD}): the {TRIPOD} statement},
+  journal = {Annals of Internal Medicine},
+  year    = {2015},
+  volume  = {162},
+  number  = {1},
+  pages   = {55--63}
+}