| Title: | Multisource Graph Synthesis with EHR Data |
|---|---|
| Description: | We develop Multi-source Graph Synthesis (MUGS), an algorithm designed to create embeddings for pediatric Electronic Health Record (EHR) codes by leveraging graphical information from three distinct sources: (1) pediatric EHR data, (2) EHR data from the general patient population, and (3) existing hierarchical medical ontology knowledge shared across different patient populations. See Li et al. (2024) <doi:10.1038/s41746-024-01320-4> for details. |
| Authors: | Mengyan Li [cre, aut], Thomas Charlon [ctb] (ORCID: 0000-0001-7497-0470), Xiaoou Li [aut], Tianxi Cai [aut], PARSE LTD [aut], CELEHS Team [aut] |
| Maintainer: | Mengyan Li <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-05-15 06:30:52 UTC |
| Source: | https://github.com/celehs/mugs |
This function estimates code effects using left and right embeddings from source and target sites.
CodeEff_Matrix( S.1, S.2, n1, n2, U.1, U.2, V.1, V.2, common_codes, zeta.int, lambda, p )CodeEff_Matrix( S.1, S.2, n1, n2, U.1, U.2, V.1, V.2, common_codes, zeta.int, lambda, p )
S.1 |
SPPMI from the source site. |
S.2 |
SPPMI from the target site. |
n1 |
The number of codes from the source site. |
n2 |
The number of codes from the target site. |
U.1 |
The left embeddings left singular vectors times the square root of the singular values from the source site. |
U.2 |
The left embeddings left singular vectors times the square root of the singular values from the target site. |
V.1 |
The right embeddings right singular vectors times the square root of the singular values from the source site. |
V.2 |
The right embeddings right singular vectors times the square root of the singular values from the target site. |
common_codes |
The list of overlapping codes. |
zeta.int |
The initial estimator for the code effects. |
lambda |
The tuning parameter controls the intensity of penalization on the code effect. |
p |
The length of an embedding. |
A list with the following elements:
zeta |
The estimated code effects. |
dif_F |
The Frobenius norm difference between the updated and initial estimators. |
V.1.new |
Updated right embeddings for the source site. |
V.2.new |
Updated right embeddings for the target site. |
Function Used To Estimate Code-Site Effects Parallelly
CodeSiteEff_l2_par( S.1, S.2, n1, n2, U.1, U.2, V.1, V.2, delta.int, lambda.delta, p, common_codes, n.common, n.core )CodeSiteEff_l2_par( S.1, S.2, n1, n2, U.1, U.2, V.1, V.2, delta.int, lambda.delta, p, common_codes, n.common, n.core )
S.1 |
SPPMI from the source site |
S.2 |
SPPMI from the target site |
n1 |
the number of codes from the source site |
n2 |
the number of codes from the target site |
U.1 |
the left embeddings (left singular vectors times the square root of the singular values) from the source site |
U.2 |
the left embeddings (left singular vectors times the square root of the singular values) from the target site |
V.1 |
the right embeddings (right singular vectors times the square root of the singular values) from the source site |
V.2 |
the right embeddings (right singular vectors times the square root of the singular values) from the target site |
delta.int |
the initial estimator for the code-site effect |
lambda.delta |
the tuning parameter controls the intensity of penalization on the code-site effects |
p |
the length of an embedding |
common_codes |
the list of overlapping codes |
n.common |
the number of overlapping codes |
n.core |
the number of cored used for parallel computation |
The output for the estimation of code-site effects
Function used to generate input data (used only for Simulations) Generate SPPMIs, dummy matrices based on prior group structures, and code-code pairs for tuning and evaluation
DataGen_rare_group( seed = NULL, p, n1, n2, n.common, n.group, sigma.eps.1, sigma.eps.2, ratio.delta, network.k, rho.beta, rho.U0, rho.delta, sigma.rare, n.rare, group.size )DataGen_rare_group( seed = NULL, p, n1, n2, n.common, n.group, sigma.eps.1, sigma.eps.2, ratio.delta, network.k, rho.beta, rho.U0, rho.delta, sigma.rare, n.rare, group.size )
seed |
for reproducibility |
p |
the length of an embedding |
n1 |
the number of codes in site 1 |
n2 |
the number of codes in site 2 |
n.common |
common: the number of overlapping codes |
n.group |
the number of groups |
sigma.eps.1 |
the sd of error in site 1 |
sigma.eps.2 |
the sd of error in site 2 |
ratio.delta |
the proportion of codes in each site that have site-specific effects applied to them |
network.k |
the number of distinct blocks within each site for which unique inter-code correlations are modeled |
rho.beta |
AR parameter for the group effects covariance matrix |
rho.U0 |
AR parameter for the code effects covariance matrix |
rho.delta |
AR parameter for the code-site effects covariance matrix |
sigma.rare |
the sd of error for rare codes (usually larger than sigma.eps.1 and sigma.eps.2) |
n.rare |
The number of rare codes |
group.size |
the size of each group |
Returns input data, SPPMIs, dummy matrices based on prior group structures and code-code pairs for tuning and evaluation
Download and Load Example Data from Zenodo
download_example_data(file, destdir = tempdir())download_example_data(file, destdir = tempdir())
file |
Name of the .Rdata file to download (e.g., "S.1.Rdata"). |
destdir |
Directory to store the downloaded data. Defaults to a temporary directory. |
A list containing the loaded dataset.
Function Used For Tuning And Evaluation
evaluation.sim(pairs.rel, U, seed = NULL)evaluation.sim(pairs.rel, U, seed = NULL)
pairs.rel |
the known code-code pairs |
U |
the code embedding matrix |
seed |
Optional integer for reproducibility of sampling. |
The output of tuning and evaluation
Function For Getting Embedding From SVD
get_embed(mysvd, d = 2000, normalize = TRUE)get_embed(mysvd, d = 2000, normalize = TRUE)
mysvd |
the (managed) svd result (adding an element with 'names') |
d |
dim of the final embedding |
normalize |
if the output embeddings have l2 norm equal to 1 |
The embedding from SVD
Function Used To Estimate Group Effects Parallelly
GroupEff_par( S.MGB, S.BCH, n.MGB, n.BCH, U.MGB, U.BCH, V.MGB, V.BCH, X.MGB.group, X.BCH.group, n.group, name.list, beta.int, lambda = 0, p, n.core )GroupEff_par( S.MGB, S.BCH, n.MGB, n.BCH, U.MGB, U.BCH, V.MGB, V.BCH, X.MGB.group, X.BCH.group, n.group, name.list, beta.int, lambda = 0, p, n.core )
S.MGB |
SPPMI from the source site |
S.BCH |
SPPMI from the target site |
n.MGB |
the number of codes from the source site |
n.BCH |
the number of codes from the target site |
U.MGB |
the left embeddings (left singular vectors times the square root of the singular values) from the source site |
U.BCH |
the left embeddings (left singular vectors times the square root of the singular values) from the target site |
V.MGB |
the right embeddings (right singular vectors times the square root of the singular values) from the source site |
V.BCH |
the right embeddings (right singular vectors times the square root of the singular values) from the target site |
X.MGB.group |
the dummy matrix based on prior group structures at the source site |
X.BCH.group |
the dummy matrix based on prior group structures at the target site |
n.group |
the number of groups |
name.list |
the full list of code names from the source site and the target site with repeated names of overlapping codes |
beta.int |
the initial estimator for the group effects |
lambda |
the tuning parameter controls the intensity of penalization on the group effect; by default we set it to 0 |
p |
the length of an embedding |
n.core |
the number of cored used for parallel computation |
The output of estimating group effects parallelly
Main function for MUGS algorithm
MUGS( TUNE = FALSE, Eva = TRUE, Lambda = c(10), Lambda.delta = c(1000), n.core = 4, tol = 1, seed = NULL, S.1 = NULL, S.2 = NULL, X.group.source = NULL, X.group.target = NULL, pairs.rel.CV = NULL, pairs.rel.EV = NULL, p = 100, n.group = 400, outdir = NULL )MUGS( TUNE = FALSE, Eva = TRUE, Lambda = c(10), Lambda.delta = c(1000), n.core = 4, tol = 1, seed = NULL, S.1 = NULL, S.2 = NULL, X.group.source = NULL, X.group.target = NULL, pairs.rel.CV = NULL, pairs.rel.EV = NULL, p = 100, n.group = 400, outdir = NULL )
TUNE |
Logical value indicating whether the function should tune parameters TRUE or use predefined parameters FALSE. |
Eva |
Logical value indicating whether to perform evaluation (TRUE) or skip it (FALSE). |
Lambda |
The candidate values for the tuning parameter controlling the intensity of penalization on the code effects. |
Lambda.delta |
The candidate values for the tuning parameter controlling the intensity of penalization on the code-site effects. |
n.core |
Integer specifying the number of cores to use for parallel processing. |
tol |
Numeric value representing the tolerance level for convergence in the algorithm. |
seed |
Integer used to set the seed for random number generation, ensuring reproducibility. Set to NULL to disable. |
S.1 |
The SPPMI matrix from site 1. |
S.2 |
The SPPMI matrix from site 2. |
X.group.source |
The dummy matrix representing the group structure of codes at site 1. |
X.group.target |
The dummy matrix representing the group structure of codes at site 2. |
pairs.rel.CV |
Code-code pairs used for tuning via cross-validation. |
pairs.rel.EV |
Code-code pairs used for evaluation. |
p |
Integer indicating the length of embeddings. |
n.group |
The number of groups. |
outdir |
Optional directory to write output files. Defaults to a temporary directory. |
A list or saved files containing the embedding matrices, similarity matrices, and site-heterogeneous code analysis.
A data frame containing cross-validation pairs for relative comparisons.
pairs.rel.CVpairs.rel.CV
A data frame with multiple columns:
Integer representing the column index of a pair.
Integer representing the row index of a pair.
Character string indicating the type of data (e.g., "train", "test").
A data frame containing evaluation pairs for relative comparisons.
pairs.rel.EVpairs.rel.EV
A data frame with multiple columns:
Integer representing the column index of a pair.
Integer representing the row index of a pair.
Character string indicating the type of data (e.g., "validation").
A matrix containing SPPMI data from the source site. This dataset is used as input for analysis in the package.
S.1S.1
A matrix with 2000 rows and 10 columns:
Unique identifiers for each row.
Numeric values representing SPPMI data.
A matrix containing SPPMI data from the target site. This dataset is used as input for analysis in the package.
S.2S.2
A matrix with 2000 rows and 10 columns:
Unique identifiers for each row.
Numeric values representing SPPMI data.
A matrix containing left embeddings from the source site. These embeddings are used for embedding-based computations.
U.1U.1
A matrix with 2000 rows and 10 columns:
Unique identifiers for each row.
Numeric values representing embeddings.
A matrix containing left embeddings from the target site. These embeddings are used for embedding-based computations.
U.2U.2
A matrix with 2000 rows and 10 columns:
Unique identifiers for each row.
Numeric values representing embeddings.
A matrix containing group structures at the source site. It represents binary group membership of entities at the source.
X.group.sourceX.group.source
A matrix with 2000 rows and 50 columns:
Entities at the source site.
Binary values (0 or 1) indicating group membership.
A matrix containing group structures at the target site. It represents binary group membership of entities at the target.
X.group.targetX.group.target
A matrix with 2000 rows and 50 columns:
Entities at the target site.
Binary values (0 or 1) indicating group membership.