Predictive modeling
Predictive modeling with clinical and molecular data
Establishing statistical models for prediction of patient prognosis and/or to investigate the (individual) treatment effect based on clinical and molecular information constitute the core of translational cancer research. It is the prerequisite for the identification of biologically distinct tumor subgroups and allows making treatment decisions tailored to a patient's specific needs. Prognostic biomarkers group patients according to their overall prognosis, e.g., with respect to overall survival. In contrast, predictive biomarkers refine the individual treatment effect and thus form the basis for personalized medicine. Whenever a study fails to show a global treatment effect one hopes to find a treatment effect in at least one subgroup of patients. Tree-based methods are appealing for this purpose, as they do not require an exhaustive search over all possible subgroups. Moreover, they can easily be applied to high-dimensional settings when used in conjunction with ensemble methods, in particular random forests. We extended a model-based recursive partitioning method for subgroup analyses to specifically identify predictive biomarkers by reparametrizing the base model. We tailored our method for application in the randomized clinical trial setting and recently adjusted it for use in observational data analysis as well.
An alternative to ensemble methods for coping with high-dimensional data are regularization approaches. For modelling survival endpoints, we have extensively investigated regularization approaches based on penalized partial likelihood maximisation and developed recommendations for their use. When performing regression modeling in very high covariate dimensions, it is often required to reduce the number of covariates through preliminary screening. A large number of variable screening methods are available by now but there is a lack of guidance on how to select an appropriate method in practice. Specifically for survival analysis, we provided an overview of marginal variable screening methods and made recommendations for their application.
Not only the covariate space, also the survival endpoint itself can be complex. Common prediction models predominantly use composite endpoints such as event-free survival (EFS) or relapse-free survival (RFS). However, time-to-first-event endpoints do not incorporate important aspects of the individual course of the disease. For modeling competing risks data in higher dimensions, we provided a penalized cause-specific hazards approach. The idea is to link the independently penalized cause-specific hazards models by choosing the combination of tuning parameters that yields the best prediction with respect to the incidence of the event of interest at a fixed time point. A multi-state model is required to more accurately capture pathogenic disease processes and underlying etiologies. To decompose EFS and/or RFS accordingly, taking into account high-dimensional molecular factors, we are currently extending model selection and model reduction methods based on stratified reparametrization to combine homogeneous effects for different transitions and data-driven covariate selection using regularization methods.
We applied predictive modeling approaches in a multitude of situations. For instance, to identify patients with chronic lymphocytic leukemia who particularly benefit from chemoimmunotherapy with Fludarabine, Cyclophosphamide, and Rituximab (FCR), we used integrative penalized Cox regression models combining established prognostic factors and gene expression profiles from a phase III clinical trial comparing first-line treatment with FC or FCR. In accompanying research to clinical trials on acute myeloid leukaemia (AML), we characterized the mutation landscape of AML patients. Targeted sequencing data were evaluated by various statistical approaches to reconstruct the temporal order of mutational evolution. A hierarchical Dirichlet process extracted possible biological subtypes of AML and random survival forests were fitted to evaluate the impact of clinical and genetic features. As the data sets we use to build prediction models often involve molecular data, appropriate statistical analysis methods for molecular data from various sources is needed. We addressed the evaluation of statistical methods for accurate detection of methylated and hydroxymethylated CpGs using methylation arrays and investigated the use of Ago-RIP-Seq experiments for the identification of microRNA targets. In cooperation with the Section of Allogeneic Stem Cell Transplantation at Heidelberg University Hospital we investigate the usefulness of EASIX as a prognostic and predictive biomarker for several diseases and endpoints. For instance, we illustrate the prognostic and predictive value of EASIX for time-to-sepsis, the effectiveness of statin-based prophylaxis for non-relapse mortality in different EASIX subgroups and the prognostic value of EASIX for severe complications after CAR-T cell therapy.
- Benner, A., Zucknick, M., Hielscher, T., Ittrich, C. and Mansmann, U. High-dimensional Cox models: The choice of penalty as part of the model building process. Biom. J., 52: 50-69 (2010)
- Bloehdorn J. et al. Integrative prognostic models predict long-term survival after immunochemotherapy in chronic lymphocytic leukemia patients. Haematologica 107(3): 615 (2022)
- Edelmann D. et al. Marginal variable screening for survival endpoints. Biom. J. 62: 610-626 (2020)
- Krzykalla J. et al. Exploratory identification of predictive biomarkers in randomized trials with normal endpoints. Stat. Med. 39: 923-939 (2020)
- Krzykalla J. et al. Tree-based exploratory identification of predictive biomarkers in observational data. arXiv preprint arXiv:2212.08460 (2022)
- Luft T. et al. EASIX in patients with acute graft-versus-host disease: a retrospective cohort analysis. Lancet Haematol. 2017 Sep;4(9):e414-e423. doi: 10.1016/S2352-3026(17)30108-4.
- Saadati M, Benner A. Statistical challenges of high-dimensional methylation data. Stat Med. 33(30):5347-5357 (2014).
- Saadati M. et al. Prediction accuracy and variable selection for penalized cause-specific hazards models. Biom. J. 60: 288-306 (2018)
- Slynko A., Benner A. Statistical methods for classification of 5hmC levels based on the Illumina Infinium HumanMethylation450 (450k) array data, under the paired bisulfite (BS) and oxidative bisulfite (oxBS) treatment. PLoS One. 14(6) (2019)
- Tichy D. et al., Experimental design and data analysis of Ago-RIP-Seq experiments for the identification of microRNA targets. Brief. Bioinform. 19: 918-929 (2018)