keep · mimic-iv reproduction · story 2

The OMOP
Disease graph,
and the seven
orphans we found.

A sixty-eight-thousand-node SNOMED Condition hierarchy rooted at concept 4274025 — the foundation of our reproduction of KEEP on MIMIC-IV. Built from Athena vocabularies in roughly nine seconds with three DuckDB queries; immediately broken in seven small but instructive ways.

Paper
Elhussein et al., CHIL 2025
Built by
build_omop_graph.py
i · anatomy

Counting things.

A standard SNOMED Condition is in the graph if it descends from Disease within five hierarchical hops in the OMOP CONCEPT_ANCESTOR table. Of the roughly 100k standard Conditions in Athena, sixty-eight thousand pass that filter.

Nodesstandard SNOMED conditions
Edgesparent → child is-a
Leaves≈ 61% of the graph
Multi-parentSNOMED is heavily polyhierarchical

Depth distribution

BFS from the root Disease. The graph fans out viciously fast — by depth 3 we are already past sixteen thousand concepts — then contracts as the tree exhausts itself, with a long tail of nodes pushed past the paper's nominal max depth of 5 by the orphan rescue.

Why are there nodes past depth 5? The KEEP paper (§A.1.1) limits the graph to min_levels_of_separation ≤ 5 in CONCEPT_ANCESTOR, which is computed against the full SNOMED hierarchy. After filtering down to standard Conditions only, some nodes can only reach the root through longer paths through their Condition siblings — so their effective BFS depth in our subgraph is 6, 7, even 8. All twenty-nine of those are downstream of the orphan-rescue patch documented below.
ii · a walk through

Two hundred vertices,
chosen by breadth.

A force-directed sample, taken by walking out from the root in BFS order and capping each node at eight outgoing edges so the deeper subtrees do not flood the layout. The full graph has 68,396 nodes; this is what 220 of them and the edges between them feel like.

Drag · Zoom · Hover a tactile slice of the hierarchy
Depth from Disease
0  root
1
2
3
4
5
6+ rescued
iii · the bug

Forty-two nodes,
cut adrift.

The naive recipe — node set = standard SNOMED Conditions descended from 4274025 within depth ≤ 5; edges = direct parent–child pairs from the same table — sounds airtight. It is not. It quietly leaves forty-two nodes unreachable from the root.

The KEEP paper (§A.1.1) does not mention this. The G2Lab reference implementation skips the issue by loading a pre-built ukbb_omop_tree_filtered.pickle instead of constructing the graph at all. We hit it on the very first run of build_omop_graph.py.

7 Primary orphans — standard SNOMED Conditions whose direct parent in the full hierarchy is not in our node set.
35 Downstream casualties — concepts that became unreachable as a side effect of their ancestor losing its only edge in.
42 Total unreachable from the root before the rescue patch is applied. The graph is still a DAG, just no longer a connected one.

Two reasons a parent gets filtered out

Every primary orphan we found is missing its incoming edge for one of two reasons:

"My direct parent is a Condition — just not one of Disease's descendants."

The first family of orphans has direct parents that are perfectly valid standard SNOMED Conditions, but they live in a sibling subtree like Clinical finding (441840) instead of Disease. The depth filter excludes them.

"My direct parent isn't a Condition at all — it's an Observation."

The second family has direct parents in non-Condition domains entirely — almost always Observation, capturing things like "Finding relating to drug misuse behavior". The domain filter excludes them, severing the only edge in.

In one case — Scarring alopecia — both happen at once.

All seven, in detail

Each card shows the orphan, the direct parent(s) the build query filtered out (with the reason), and the rescue parent that build_omop_graph.py step 3a wired in to keep the graph connected. Cards are grouped by category.

How the rescue works

For each primary orphan we query CONCEPT_ANCESTOR for its closest in-node-set ancestor — smallest min_levels_of_separation > 0, ties broken deterministically by smaller ancestor_concept_id — and add a single edge from that ancestor to the orphan. All seven were rescued at depth 2 or 3; none had to fall back to the root. The total edge count goes from 152,340 to 152,347, the graph becomes connected again, and re-running the script produces byte-identical pickles because both the queries and the tie-break are deterministic.