Title: | High-Throughput Phenotyping with EHR using a Common Automated Pipeline |
---|---|
Description: | Implement surrogate-assisted feature extraction (SAFE) and common machine learning approaches to train and validate phenotyping models. Background and details about the methods can be found at Zhang et al. (2019) <doi:10.1038/s41596-019-0227-6>, Yu et al. (2017) <doi:10.1093/jamia/ocw135>, and Liao et al. (2015) <doi:10.1136/bmj.h1885>. |
Authors: | Yichi Zhang [aut], Chuan Hong [aut], Tianxi Cai [aut], PARSE LTD [aut, cre] |
Maintainer: | PARSE LTD <[email protected]> |
License: | GPL-3 |
Version: | 1.2.2 |
Built: | 2025-01-03 05:00:55 UTC |
Source: | https://github.com/celehs/phecap |
Implement surrogate-assisted feature extraction (SAFE) and common machine learning approaches to train and validate phenotyping models. Background and details about the methods can be found at Zhang et al. (2019) <doi:10.1038/s41596-019-0227-6>, Yu et al. (2017) <doi:10.1093/jamia/ocw135>, and Liao et al. (2015) <doi:10.1136/bmj.h1885>.
The DESCRIPTION file:
Package: | PheCAP |
Type: | Package |
Title: | High-Throughput Phenotyping with EHR using a Common Automated Pipeline |
Version: | 1.2.2 |
Authors@R: | c( person("Yichi", "Zhang", role = "aut"), person("Chuan", "Hong", role = "aut"), person("Tianxi", "Cai", role = "aut"), person(family = "PARSE LTD", role = c("aut", "cre"), email = "[email protected]") ) |
Description: | Implement surrogate-assisted feature extraction (SAFE) and common machine learning approaches to train and validate phenotyping models. Background and details about the methods can be found at Zhang et al. (2019) <doi:10.1038/s41596-019-0227-6>, Yu et al. (2017) <doi:10.1093/jamia/ocw135>, and Liao et al. (2015) <doi:10.1136/bmj.h1885>. |
URL: | https://celehs.github.io/PheCAP/, https://github.com/celehs/PheCAP |
BugReports: | https://github.com/celehs/PheCAP/issues |
License: | GPL-3 |
Encoding: | UTF-8 |
ByteCompile: | yes |
Imports: | graphics, methods, stats, utils, glmnet, RMySQL |
Suggests: | ggplot2, e1071, randomForestSRC, xgboost, knitr, rmarkdown |
VignetteBuilder: | knitr |
Depends: | R (>= 3.3.0) |
RoxygenNote: | 7.1.1 |
LazyData: | true |
Config/pak/sysreqs: | libmysqlclient-dev |
Repository: | https://celehs.r-universe.dev |
RemoteUrl: | https://github.com/celehs/phecap |
RemoteRef: | HEAD |
RemoteSha: | 7c18179625c93f1058533aafedcae9480409da06 |
Author: | Yichi Zhang [aut], Chuan Hong [aut], Tianxi Cai [aut], PARSE LTD [aut, cre] |
Maintainer: | PARSE LTD <[email protected]> |
Index of help topics:
PheCAP-package High-Throughput Phenotyping with EHR using a Common Automated Pipeline PhecapData Define or Read Datasets for Phenotyping PhecapSurrogate Define a Surrogate Variable used in Surrogate-Assisted Feature Extraction (SAFE) ehr_data A Synthetic EHR Dataset phecap_generate_dictionary_file Generate a Dictionary File for Note Parsing phecap_perform_majority_voting Perform Majority Voting on the CUIs from Multiple Knowledge Sources phecap_plot_roc_curves Plot ROC and Related Curves for Phenotyping Models phecap_predict_phenotype Predict Phenotype phecap_run_feature_extraction Run Surrogate-Assisted Feature Extraction (SAFE) phecap_train_phenotyping_model Train Phenotyping Model using the Training Labels phecap_validate_phenotyping_model Validate the Phenotyping Model using the Validation Labels
PheCAP provides a straightforward interface for conducting
phenotyping on eletronic health records. One can specify the
data via PhecapData
, define surrogate using
PhecapSurrogate
. Next, one may run
surrogate-assisted feature extraction (SAFE) by calling
phecap_run_feature_extraction
, and then
train and validate phenotyping models via
phecap_train_phenotyping_model
and
phecap_validate_phenotyping_model
.
The predictive performance can be visualized using
phecap_plot_roc_curves
.
Predicted phenotype is provided by
phecap_predict_phenotype
.
Yichi Zhang [aut], Chuan Hong [aut], Tianxi Cai [aut], PARSE LTD [aut, cre]
Maintainer: PARSE LTD <[email protected]>
Yu, S., Chakrabortty, A., Liao, K. P., Cai, T., Ananthakrishnan, A. N., Gainer, V. S., ... & Cai, T. (2016). Surrogate-assisted feature extraction for high-throughput phenotyping. Journal of the American Medical Informatics Association, 24(e1), e143-e149.
Liao, K. P., Cai, T., Savova, G. K., Murphy, S. N., Karlson, E. W., Ananthakrishnan, A. N., ... & Churchill, S. (2015). Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350, h1885.
# Simulate an EHR dataset size <- 2000 latent <- rgamma(size, 0.3) latent2 <- rgamma(size, 0.3) ehr_data <- data.frame( ICD1 = rpois(size, 7 * (rgamma(size, 0.2) + latent) / 0.5), ICD2 = rpois(size, 6 * (rgamma(size, 0.8) + latent) / 1.1), ICD3 = rpois(size, 1 * rgamma(size, 0.5 + latent2) / 0.5), ICD4 = rpois(size, 2 * rgamma(size, 0.5) / 0.5), NLP1 = rpois(size, 8 * (rgamma(size, 0.2) + latent) / 0.6), NLP2 = rpois(size, 2 * (rgamma(size, 1.1) + latent) / 1.5), NLP3 = rpois(size, 5 * (rgamma(size, 0.1) + latent) / 0.5), NLP4 = rpois(size, 11 * rgamma(size, 1.9 + latent) / 1.9), NLP5 = rpois(size, 3 * rgamma(size, 0.5 + latent2) / 0.5), NLP6 = rpois(size, 2 * rgamma(size, 0.5) / 0.5), NLP7 = rpois(size, 1 * rgamma(size, 0.5) / 0.5), HU = rpois(size, 30 * rgamma(size, 0.1) / 0.1), label = NA) ii <- sample.int(size, 400) ehr_data[ii, "label"] <- with( ehr_data[ii, ], rbinom(400, 1, plogis( -5 + 1.5 * log1p(ICD1) + log1p(NLP1) + 0.8 * log1p(NLP3) - 0.5 * log1p(HU)))) # Define features and labels used for phenotyping. data <- PhecapData(ehr_data, "HU", "label", validation = 0.4) data # Specify the surrogate used for # surrogate-assisted feature extraction (SAFE). # The typical way is to specify a main ICD code, a main NLP CUI, # as well as their combination. # The default lower_cutoff is 1, and the default upper_cutoff is 10. # In some cases one may want to define surrogate through lab test. # Feel free to change the cutoffs based on domain knowledge. surrogates <- list( PhecapSurrogate( variable_names = "ICD1", lower_cutoff = 1, upper_cutoff = 10), PhecapSurrogate( variable_names = "NLP1", lower_cutoff = 1, upper_cutoff = 10)) # Run surrogate-assisted feature extraction (SAFE) # and show result. feature_selected <- phecap_run_feature_extraction( data, surrogates, num_subsamples = 50, subsample_size = 200) feature_selected # Train phenotyping model and show the fitted model, # with the AUC on the training set as well as random splits. model <- phecap_train_phenotyping_model( data, surrogates, feature_selected, num_splits = 100) model # Validate phenotyping model using validation label, # and show the AUC and ROC. validation <- phecap_validate_phenotyping_model(data, model) validation phecap_plot_roc_curves(validation) # Apply the model to all the patients to obtain predicted phenotype. phenotype <- phecap_predict_phenotype(data, model) # A more complicated example # Load Data. data(ehr_data) data <- PhecapData(ehr_data, "healthcare_utilization", "label", 0.4) data # Specify the surrogate used for # surrogate-assisted feature extraction (SAFE). # The typical way is to specify a main ICD code, a main NLP CUI, # as well as their combination. # In some cases one may want to define surrogate through lab test. # The default lower_cutoff is 1, and the default upper_cutoff is 10. # Feel free to change the cutoffs based on domain knowledge. surrogates <- list( PhecapSurrogate( variable_names = "main_ICD", lower_cutoff = 1, upper_cutoff = 10), PhecapSurrogate( variable_names = "main_NLP", lower_cutoff = 1, upper_cutoff = 10), PhecapSurrogate( variable_names = c("main_ICD", "main_NLP"), lower_cutoff = 1, upper_cutoff = 10)) # Run surrogate-assisted feature extraction (SAFE) # and show result. feature_selected <- phecap_run_feature_extraction(data, surrogates) feature_selected # Train phenotyping model and show the fitted model, # with the AUC on the training set as well as random splits model <- phecap_train_phenotyping_model(data, surrogates, feature_selected) model # Validate phenotyping model using validation label, # and show the AUC and ROC validation <- phecap_validate_phenotyping_model(data, model) validation phecap_plot_roc_curves(validation) # Apply the model to all the patients to obtain predicted phenotype. phenotype <- phecap_predict_phenotype(data, model)
# Simulate an EHR dataset size <- 2000 latent <- rgamma(size, 0.3) latent2 <- rgamma(size, 0.3) ehr_data <- data.frame( ICD1 = rpois(size, 7 * (rgamma(size, 0.2) + latent) / 0.5), ICD2 = rpois(size, 6 * (rgamma(size, 0.8) + latent) / 1.1), ICD3 = rpois(size, 1 * rgamma(size, 0.5 + latent2) / 0.5), ICD4 = rpois(size, 2 * rgamma(size, 0.5) / 0.5), NLP1 = rpois(size, 8 * (rgamma(size, 0.2) + latent) / 0.6), NLP2 = rpois(size, 2 * (rgamma(size, 1.1) + latent) / 1.5), NLP3 = rpois(size, 5 * (rgamma(size, 0.1) + latent) / 0.5), NLP4 = rpois(size, 11 * rgamma(size, 1.9 + latent) / 1.9), NLP5 = rpois(size, 3 * rgamma(size, 0.5 + latent2) / 0.5), NLP6 = rpois(size, 2 * rgamma(size, 0.5) / 0.5), NLP7 = rpois(size, 1 * rgamma(size, 0.5) / 0.5), HU = rpois(size, 30 * rgamma(size, 0.1) / 0.1), label = NA) ii <- sample.int(size, 400) ehr_data[ii, "label"] <- with( ehr_data[ii, ], rbinom(400, 1, plogis( -5 + 1.5 * log1p(ICD1) + log1p(NLP1) + 0.8 * log1p(NLP3) - 0.5 * log1p(HU)))) # Define features and labels used for phenotyping. data <- PhecapData(ehr_data, "HU", "label", validation = 0.4) data # Specify the surrogate used for # surrogate-assisted feature extraction (SAFE). # The typical way is to specify a main ICD code, a main NLP CUI, # as well as their combination. # The default lower_cutoff is 1, and the default upper_cutoff is 10. # In some cases one may want to define surrogate through lab test. # Feel free to change the cutoffs based on domain knowledge. surrogates <- list( PhecapSurrogate( variable_names = "ICD1", lower_cutoff = 1, upper_cutoff = 10), PhecapSurrogate( variable_names = "NLP1", lower_cutoff = 1, upper_cutoff = 10)) # Run surrogate-assisted feature extraction (SAFE) # and show result. feature_selected <- phecap_run_feature_extraction( data, surrogates, num_subsamples = 50, subsample_size = 200) feature_selected # Train phenotyping model and show the fitted model, # with the AUC on the training set as well as random splits. model <- phecap_train_phenotyping_model( data, surrogates, feature_selected, num_splits = 100) model # Validate phenotyping model using validation label, # and show the AUC and ROC. validation <- phecap_validate_phenotyping_model(data, model) validation phecap_plot_roc_curves(validation) # Apply the model to all the patients to obtain predicted phenotype. phenotype <- phecap_predict_phenotype(data, model) # A more complicated example # Load Data. data(ehr_data) data <- PhecapData(ehr_data, "healthcare_utilization", "label", 0.4) data # Specify the surrogate used for # surrogate-assisted feature extraction (SAFE). # The typical way is to specify a main ICD code, a main NLP CUI, # as well as their combination. # In some cases one may want to define surrogate through lab test. # The default lower_cutoff is 1, and the default upper_cutoff is 10. # Feel free to change the cutoffs based on domain knowledge. surrogates <- list( PhecapSurrogate( variable_names = "main_ICD", lower_cutoff = 1, upper_cutoff = 10), PhecapSurrogate( variable_names = "main_NLP", lower_cutoff = 1, upper_cutoff = 10), PhecapSurrogate( variable_names = c("main_ICD", "main_NLP"), lower_cutoff = 1, upper_cutoff = 10)) # Run surrogate-assisted feature extraction (SAFE) # and show result. feature_selected <- phecap_run_feature_extraction(data, surrogates) feature_selected # Train phenotyping model and show the fitted model, # with the AUC on the training set as well as random splits model <- phecap_train_phenotyping_model(data, surrogates, feature_selected) model # Validate phenotyping model using validation label, # and show the AUC and ROC validation <- phecap_validate_phenotyping_model(data, model) validation phecap_plot_roc_curves(validation) # Apply the model to all the patients to obtain predicted phenotype. phenotype <- phecap_predict_phenotype(data, model)
This dataset gives a sample dataset for EHR phenotyping. It contains counts for ICD codes, counts for NLP mentions, healthcare utilization (HU) features for all observations. It also contains the accurate phenotypes for 181 observations.
data(ehr_data)
data(ehr_data)
A data.frame with 10000 observations of 588 variables.
Given a list of CUIs, connect to the UMLS database stored in MySQL, extract CUIs and associated terms, and write a dictionary file for use in note parsing.
phecap_generate_dictionary_file( cui_list, dict_file, user = "username", password = "password", host = "localhost", dbname = "umls", ...)
phecap_generate_dictionary_file( cui_list, dict_file, user = "username", password = "password", host = "localhost", dbname = "umls", ...)
cui_list |
a character vector consisting of CUIs of interest. |
dict_file |
a character scalar for the path to the dictionary file that will be generated. |
user |
a character scalar for the username for database connection; passed to |
password |
a character scalar for the password for database connection; passed to |
host |
a character scalar for the host (or URL) for database connection; passed to |
dbname |
a character scalar for the database name for database connection; passed to |
... |
Other arguments passed to |
The dictionary will be written to the location given by dict_file
.
Return the dictionary invisibly.
Read parsed knowledge sources and identify CUIs. Generate a list of CUIs that appear in at least half of the sources.
phecap_perform_majority_voting( input_folder)
phecap_perform_majority_voting( input_folder)
input_folder |
a character scalar for the path to the folder that contains the parsed knowledge sources |
A character vector consisting of CUIs that pass the majority voting criterion.
Plot ROC-like curves to illustrate phenotyping accuracy.
phecap_plot_roc_curves( x, axis_x = "1 - spec", axis_y = "sen", what = c("training", "random-splits", "validation"), ggplot = TRUE, ...)
phecap_plot_roc_curves( x, axis_x = "1 - spec", axis_y = "sen", what = c("training", "random-splits", "validation"), ggplot = TRUE, ...)
x |
either a single object of class PhecapModel or PhecapValidation
(returned from |
axis_x |
an expression that leads to the |
axis_y |
an expression that leads to the |
what |
The curves to be included in the figure. |
ggplot |
if TRUE and ggplot2 is installed, ggplot will be used for the figure. Otherwise, the base R graphics functions will be used. |
... |
arguments to be ignored. |
See PheCAP-package
for code examples.
Compute predicted probability of having the phenotype for each patient in the dataset.
phecap_predict_phenotype(data, model)
phecap_predict_phenotype(data, model)
data |
an object of class |
model |
an object of class |
A data.frame
with two columns:
patient_index |
patient identifier |
,
prediction |
predicted phenotype |
.
See PheCAP-package
for code examples.
Run surrogate-assisted feature extraction (SAFE) using unlabeled data and subsampling.
phecap_run_feature_extraction( data, surrogates, subsample_size = 1000L, num_subsamples = 200L, dropout_proportion = 0, frequency_cutoff = 0.5, start_seed = 45600L, verbose = 0L)
phecap_run_feature_extraction( data, surrogates, subsample_size = 1000L, num_subsamples = 200L, dropout_proportion = 0, frequency_cutoff = 0.5, start_seed = 45600L, verbose = 0L)
data |
An object of class PhecapData, obtained by calling PhecapData(...) |
surrogates |
A list of objects of class PhecapSurrogate, obtained by something like list(PhecapSurrogate(...), PhecapSurrogate(...)) |
subsample_size |
An integer scalar giving the size of each subsample |
num_subsamples |
The number of subsamples drawn for each surrogate |
dropout_proportion |
A scalar between 0 and 1. If it is positive, for each predictor a random subset of observations will be set to zero |
frequency_cutoff |
A scalar between 0 and 1. Variables selected in at least this proportion of the subsamples are the variables finally selected |
start_seed |
in the i-th subsample, the seed is set to start_seed + i |
verbose |
print progress every |
In this unlabeled setting, the extremes of each surrogate are used to define cases and controls. The variables selected are those selected in at least half (or the proportion specified) of the subsamples.
An object of class PhecapFeatureExtraction
, with components
selected |
the names of selected features |
frequency |
the proportion of being selected for each feature |
See PheCAP-package
for code examples.
Train the phenotyping model on the training dataset, and evaluate its performance via random splits of the training dataset.
phecap_train_phenotyping_model( data, surrogates, feature_selected, method = "lasso_bic", train_percent = 0.7, num_splits = 200L, start_seed = 78900L, verbose = 0L)
phecap_train_phenotyping_model( data, surrogates, feature_selected, method = "lasso_bic", train_percent = 0.7, num_splits = 200L, start_seed = 78900L, verbose = 0L)
data |
an object of class |
surrogates |
a list of objects of class |
feature_selected |
a character vector of the features that should be included in the model,
probably returned by |
method |
Either a character vector or a list of two components. If a character vector is used, possible entries are given below. When at least two methods are specified, the predicted probability is the simple average of the predicted probabilities from each method.
If a list is used, it should contain two named components as follows.
|
train_percent |
The percentage (between 0 and 1) of labels that are used for model training during random splits |
num_splits |
The number of random splits. |
start_seed |
in the i-th split, the seed is set to start_seed + i. |
verbose |
print progress every verbose splits if verbose is positive, or remain quiet if verbose is zero |
An object of class PhecapModel
, with components
coefficients |
the fitted object |
method |
the method used for model training |
feature_selected |
the feature selected by SAFE |
train_roc |
ROC on training dataset |
train_auc |
AUC on training dataset |
split_roc |
average ROC on random splits of training dataset |
split_auc |
average AUC on random splits of training dataset |
fit_function |
the function used for fitting |
predict_function |
the function used for prediction |
See PheCAP-package
for code examples.
Apply the trained model to all patients in the validation dataset, and measure the prediction accuracy via ROC and AUC.
phecap_validate_phenotyping_model(data, model)
phecap_validate_phenotyping_model(data, model)
data |
an object of class |
model |
an object of class |
An object of class PhecapValidation
, with components
method |
the method used for model training |
train_roc |
ROC on training dataset |
train_auc |
AUC on training dataset |
split_roc |
average ROC on random splits of training dataset |
split_auc |
average AUC on random splits of training dataset |
valid_roc |
ROC on validation dataset |
valid_auc |
AUC on validation dataset |
See PheCAP-package
for code examples.
Specify the data to be used for phenotyping.
PhecapData( data, hu_feature, label, validation, patient_id = NULL, subject_weight = NULL, seed = 12300L, feature_transformation = log1p)
PhecapData( data, hu_feature, label, validation, patient_id = NULL, subject_weight = NULL, seed = 12300L, feature_transformation = log1p)
data |
A data.frame consisting of all the variables needed for phenotyping, or a character scalar of the path to the data, or a list consisting of either character scalar or data.frame. If a list is given, patient_id cannot be NULL. All the datasets in the list will be joined into a single dataset according to the columns specified by patient_id. |
hu_feature |
A character scalar or vector specifying the names of one of more healthcare utilization (HU) variables. There variables are always included in the phenotyping model. |
label |
A character scalar of the column name that gives the phenotype status (1 or TRUE: present, 0 or FALSE: absent). If label is not ready yet, just put a column filled with NA in data. In such cases only the feature extraction step can be done. |
validation |
A character scalar, a real number strictly between 0 and 1, or an integer not less than 2. If a character scalar is used, it is treated as the column name in the data that specifies whether this observation belongs to the validation samples (1 or TRUE: validation, 0 or FALSE: training). If a real number strictly between 0 and 1 is used, it is treated as the proportion of the validation samples. The actual validation samples will be drawn from all labeled samples. If an integer not less than 2 is used, it is treated as the size of the validation samples. The actual validation samples will be drawn from all labeled samples. |
patient_id |
A character vector for the column names, if any, that uniquely identifies each patient. Such variables must appear in the data. patient_id can be NULL if such fields are not contained in the data. |
subject_weight |
An optional numeric vector of weights for observations. |
seed |
If validation samples need to be drawn from all labeled samples, seed specifies the random seed for sampling. |
feature_transformation |
A function that will be applied to all the features.
Since count data are typically right-skewed,
by default |
An object of class PhecapData
.
See PheCAP-package
for code examples.
Define a surrogate varible from existing features, and specify associated lower and upper cutoffs.
PhecapSurrogate(variable_names, lower_cutoff = 1L, upper_cutoff = 10L)
PhecapSurrogate(variable_names, lower_cutoff = 1L, upper_cutoff = 10L)
variable_names |
a character scalar or vector consisting of variable names. If a vector is given, the value of the surrogate is defined as the sum of the values of each variable. |
lower_cutoff |
a numeric scalar. If the surrogate value of a patient is less than or equal to this cutoff, then this patient is treated as a control in SAFE. |
upper_cutoff |
a numeric scalar. If the surrogate value of a patient is greater than or equal to this cutoff, then this patient is treated as a case in SAFE. |
This function only stores the definition. No calculation is done.
An object of class PhecapSurrogate
.
See PheCAP-package
for code examples.