#### DMCA

## EVALUATION OF NOVEL BIOMARKERS FOR CORONARY ARTERY DISEASE AMONG SYMPTOMATIC PATIENTS: STATISTICAL METHODOLOGY AND APPLICATION EVALUATION OF NOVEL BIOMARKERS FOR CORONARY ARTERY DISEASE AMONG SYMPTOMATIC PATIENTS: STATISTICAL METHODOLOGY AND APPLICATION (2011)

### BibTeX

@MISC{Lans11evaluationof,

author = {Daniel Lans and B A and Daniel Lans and PhD William Laframboise and PhD and Research Assistant Professor and Andriy Bandos and PhD and Professor and Gary Marsh and M.S. Daniel Lans},

title = {EVALUATION OF NOVEL BIOMARKERS FOR CORONARY ARTERY DISEASE AMONG SYMPTOMATIC PATIENTS: STATISTICAL METHODOLOGY AND APPLICATION EVALUATION OF NOVEL BIOMARKERS FOR CORONARY ARTERY DISEASE AMONG SYMPTOMATIC PATIENTS: STATISTICAL METHODOLOGY AND APPLICATION},

year = {2011}

}

### OpenURL

### Abstract

ABSTRACT Proteomics has led to the discovery of several biomarkers within an individual's bloodstream that can be used in the diagnostic process for disease. Identification of novel biomarkers have a significant impact in the area of public health, with the potential to replace existing diagnostic methods that are complicated, costly, and that pose considerable risk to the patient. Cardiac catheterization, the current diagnostic method for coronary artery disease, is such an invasive procedure. An over-abundance of negative test results leads to the inquiry whether exposing all symptomatic patients to the procedure is in a physician's best interest. A statistical analysis involving multivariate logistic regression and evaluation of predictive models identified a panel of biomarkers that can be used to classify patient with coronary artery disease and those with "normal" coronary arteries. This panel was used in conjunction with common clinical risk factors for heart disease to examine the added predictive power of the multi-marker panel when combined with clinical characteristics. A four-marker panel consisting of OPN, IL1β, Apo-B100, and Fibrinogen were found to be statistically significant predictors of coronary artery disease in a predictive logistic model adjusting for clinical risk factors, diabetes status and smoking status. The ability to identify EVALUATION OF NOVEL BIOMARKERS FOR CORONARY ARTERY DISEASE AMONG SYMPTOMATIC PATIENTS: STATISTICAL METHODOLOGY AND APPLICATION Daniel Lans, M.S. University of Pittsburgh, 2013 v patients that did not have clinically relevant coronary disease based on currently used clinical risk factors increased greatly, from zero to approximately thirty percent of the patients, with the inclusion of the biomarker panel. The use of a blood screening test for the diagnosis of coronary artery disease among symptomatic patients can limit the number of unnecessary cardiac catheterizations, reducing healthcare costs and patient risks associated with the invasive nature of the procedure. However, with such a test, there may be some discrimination error present, and the cost of misdiagnosing a patient with clinically relevant coronary artery disease needs to be weighed against the benefits of the test. vi Much gratitude goes out to my committee members -to Bill LaFramboise, whom I met through playing handball, who provided me with such a meaningful and thought-provoking public health problem to research that will undoubtedly lead to future success; to Dr. Gary Marsh for his outstanding advisement throughout my graduate program and essential role in helping me succeed as a biostatistician; and to Dr. Andriy Bandos, whose expertise, without question, was indispensable throughout the process of this thesis, and furthered my knowledge on the subject matter. Last, but not least, I would like to thank my friends and classmates at the University of Pittsburgh Graduate School of Public Health, whose collaboration and camaraderie was essential to my success as a student. I could not have had as valuable an experience without the support of these individuals. 1 INTRODUCTION Biological markers, more commonly referred to as "biomarkers," refer to observable measurements derived from a patient that can be used to describe certain biological developments, including disease status, risk, or prognosis for that patient. According to an NIH working group, the definition of a biomarker is standardized to be "a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention [1]. These biomarkers can be classified into separate categories based on their clinical properties. For the purpose of this paper, the term biomarkers will be used to denote biological components that indicate disease status of an individual; they are disease biomarkers consisting of diagnostic properties. More specifically, this paper is interested in circulating biomarkers ascertained from advanced proteomics methods. Previously, biomarkers were commonly found to be simple physiological measurements, such as one's blood pressure or heart rate, but have now evolved into complex imaging techniques and multi-marker genomic/proteomic panels 2 Novel discoveries in biomarker research have a significant impact in the area of public health, providing alternative diagnostic methods for currently used invasive procedures, thus reducing the existing medical complications and economic burden of such procedures. CORONARY ARTERY DISEASE The application of biomarker research related to coronary artery disease (CAD) is the primary focus of this thesis. In the United States, CAD is the leading cause of mortality, accounting for about one of every six deaths. In 2009, 386,324 deaths due to CAD were recorded The current diagnostic method for CAD involves invasive coronary angiography, where medical imaging is used to detect a dye injected into the arteries by way of cardiac catheterization. This involves the insertion of a catheter, a thin and flexible tube, through a brachial or femoral artery and up to the aorta and chamber of the heart, where the dye is then released into the bloodstream Furthermore, a patient undergoing the catheterization procedure is exposed to localized x-ray radiation for an extended period of time, increasing the risk of cancer and other genetic effects The alarming rate of CAD has led to an increase in the number of cardiac catheterizations performed in hospitals, thus increasing the incidence of these complications. Almost half of the patients referred for catheterizations are found to have insignificant coronary lesions, and are unnecessarily exposed to procedural complications Symptomatic Patients Patients referred for catheterization come in to the emergency room (ER) or heart clinic showing symptoms of CAD. This cohort of patients excludes those that have experienced a cardiac event, such as myocardial infarction, who skip the ER and are immediately sent for percutaneous intervention. For the patients received in the ER or heart clinic, an assessment of the individual is performed to determine the pretest probability of having CAD Biomarkers and Coronary Artery Disease Cardiovascular disease is often accompanied with sources of inflammation and plaque instability, followed by thrombosis within the arterial regions of the heart. Resulting ischemia may be followed by remodeling of the heart's ventricles. Investigation into the biological pathways for atherosclerosis involving inflammation, plaque instability, thrombosis, and remodeling of the extracellular matrix, has identified several biomarkers associated with acute coronary syndromes 6 Identifying biomarkers in serum of symptomatic patients could lead to the development of a clinical assay to use as a diagnostic method for those patients with a high pretest probability and likelihood for CAD. Instead of referring patients for catheterization based on a clinical assessment, stress test, and/or EKG, a less costly blood assay can be performed to filter out symptomatic patients that would otherwise be diagnosed negative for CAD. Data from a clinical study is analyzed later in this paper to demonstrate the effectiveness of using serum protein profiles and clinical characteristics as biomarkers for clinically relevant CAD. 7 LITERATURE REVIEW Several methods for assessing biomarkers are currently used, and there is some debate in the best way to measure the predictive power of new biomarkers. Logistic regression is a common classification technique that is generally employed for problems involving biomarkers, while receiver operating characteristic (ROC) curves have been used to evaluate the predictive ability of these new biomarkers. In this literature review, these common statistical methods for classification of disease and the evaluation of biomarkers are covered. STATISTICAL METHODS FOR PREDICTION Odds ratios Before delving into any of the more advanced statistical methods, it is important to grasp the concept of the odds ratio and how it used in clinical interpretation. In biomarker experiments, it is often desired to know the probability of an event, or the probability a patient is diagnosed with disease. Odds can then be defined as the ratio of the probability the event will occur versus the probability the event cannot occur 8 If p then equals the probability of disease for a patient, 1-p would equal the probability the patient does not have disease and the above equation can be reformulated as For example, if a clinical test using biomarkers determined a patient to have a 60% risk of CAD, the odds this patient actually has the disease would be 60% / 40%, or 1.5. This means the patient is 50% more likely to be diagnosed with disease than disease-free by a gold standard assessment, i.e. coronary angiography. If the odds of disease equal 1, this means the patient has the same chance of being diagnosed positive or negative, and odds less than 1 means the patient has a lesser chance to be diagnosed with disease according to the angiographic test. The odds ratio compares the odds of an event occurring between two patients: ⁄ If the odds of CAD for patient one was 1.5, and the odds of disease for patient two was 1.2, the odds ratio would then be 1.5 / 1.2, or 1.25. This means patient 1 has a 25% higher chance of having a positive result from coronary angiography than patient 2. This concept of odds ratios is carried over for logistic regression, and it will be shown how predicted probabilities of disease for patients can be derived from odds ratios. 9 Binary Logistic Regression Binary logistic regression is a common statistical method for predicting the classification of subjects according to a dichotomous outcome. Many times, in health sciences, the goal is to differentiate those with and without a specific disease. Logistic regression has the ability to model the probability of disease, or any categorical outcome, and how the addition or subtraction of predictor variables affects that probability The regression model can then be written out as Notice that the left side of the equation is the natural log of odds equation specified in section 2.1.1. Exponentiation both sides of the regression model then gives odds ratio estimates for the β's. With some simple algebra, the regression model can then be remodeled to match the logit function to calculate the estimated probability of disease. Coefficients for main effects in the logistic regression model are generated through maximum likelihood estimators (MLE Graph of the logit function 11 When combining several markers for prediction, which is becoming more and more popular with proteomic and genomic technologies, logistic regression serves as a useful tool for finding the best set of markers to use as a diagnostic tool As previously discussed, the main assumption of logistic regression is that predictors within the model hold a linear relationship with the log-odds of the outcome. This is usually straightforward when dealing with categorical or ordinal predictors, but presents some difficulties when predictors are in a continuous form. Continuous and Categorical Predictors Protein biomarkers are usually reported on a continuous scale to reflect the concentration of the protein in a subject's serum. While the assumption of normality does not necessarily need to hold for variables used in logistic regression, if a predictor is normally distributed for both levels of the outcome, the logistic regression model will be better at describing a linear relationship between the predictor and the outcome 13 If the log-linear assumption of logistic regression is violated, the predictive model will produce inaccurate estimates for the odds-ratios. Dichotomization or categorization of continuous predictors is commonly used in exploratory stages to fit logistic regression models when the linear relationship is questionable Categorization of variables into two or more categories is often done in medical research as a way to simplify the interpretation of odds ratios, creating regression models with step functions. Factoring by tertiles, quartiles, or quintiles is commonly seen in proteomic analysis when clinically relevant thresholds are not available Moving forward with the factored continuous variables may present complications for clinical interpretation. First, the cutpoints used to factor continuous variables need to be explicitly defined when translating results into other research. Second, categorization of these variables discards information that may be relevant to the analysis. It is improbable that a subject's risk for disease will suddenly increase when one of the thresholds is crossed. If a linear assumption is validated, continuous variables will provide more powerful statistical results than their factored counterparts. Therefore, categorization of continuous variables is valid in an exploratory process, but final analysis should be conducted on the continuous form of the data. If the data is truly expected to be non-linear with respect to the log-odds of the outcome, some more advanced modeling techniques can be used to address the situations. Fractional Polynomials The idea of fractional polynomials in regression is discussed in detail by Royston and Altman Fractional polynomials provide a flexible and more practical approach to modeling continuous covariates in an appropriate functional form, as opposed to categorization of these covariates which may present several disadvantages and statistically significant loss of information. Provided a nonlinear relationship exists between the dependent and independent 15 variables, fitting a logistic regression model with fractional polynomials will produce more accurate odds-ratios for covariates within the model. EVALUATION OF BIOMARKERS One of the main uses of biomarkers is to make a diagnosis more reliable, more rapidly, and inexpensive compared to existing methods Multiple Comparisons Valid biomarkers should have a greater presence in the affected individuals than the unaffected Sensitivity and Specificity Sensitivity and specificity are common statistical measures used to assess the diagnostic accuracy of a biomarker 17 Figure 4. Sensitivity and specificity calculations for a diagnostic test In most cases, it is desired to find a certain threshold that will maximize both sensitivity and sensitivity. For diagnostic tests, this will provide the most accurate results for discrimination between patients with and without disease. Sometimes, it is more convenient to control for higher levels of sensitivity if the benefit of identifying true positives highly outweighs the cost of false positives. This is such the case in biomarker analysis for CAD. The cost of misdiagnosing patients with clinically relevant CAD is too great, while misdiagnosing a symptomatic patient without CAD will only expose them to a cardiac catheterization procedure. Valid diagnostic tests should maintain very high levels of sensitivity. In order to characterize measures of sensitivity and specificity, receiver operating characteristic curves are usually generated. Sensitivity and Specificity Receiver Operating Characteristic Curves The Receiver Operating Characteristic (ROC) curve is a way to visualize and gauge the performance of a set of classifiers When generating ROC curves for logistic models with several predictors, retrospective calculations of the ROC curve tend to give inflated assessments of score performance K-fold cross-validation is an internal validation method for estimating the prediction error. In this process, the data is split into k number of blocks. A predictive model is generated based on the K-1 partitions and used to score the Kth block ( 19 External validation is one of the best ways to determine how well a logistic model can perform in clinical applications 2/3 data for the training set, 1/3 data for the validation set) 20 CLINICAL APPLICATION The data set examined in this thesis originates from a study by LaFramboise et al., focusing on the identification of circulating proteins for the diagnosis of coronary artery disease The proteomics analysis was conducted in two stages. In stage one, 239 samples (138 with CAD and 101 with normal coronary arteries) were assayed for 24 proteins. A scoring algorithm was generated off these 239 samples to measure the predictive ability of the proteins. This scoring algorithm was developed with a Monte Carlo optimization technique using a Metropolis algorithm In stage 2, assays were run on 120 additional samples (71 with CAD and 49 with normal coronary arteries) for validation of the algorithm, but for economic reasons, the researchers excluded assaying for proteins the scoring algorithm found to be poor predictors for disease. Therefore, patient samples in this stage were only assayed for 11 of the proteins in the study. The addition of the 120 samples in stage 2 of the study was intended to externally validate the predictive ability of the scoring algorithm, comparable to the external validation process mentioned in section 2. Through statistical processes, it was determined the 11 proteins in the stage 2 data set were sufficient for this analysis. Clinical characteristics for these subjects were obtained retrospectively, so there are some missing data encountered where clinical information could not be determined for the patient. No clinical characteristics were made available for the validation data set.