Institute of Bioinformatics · University Medicine Greifswald · Ernst-Moritz-Arndt University of Greifswald

Molecular profiles of cancerous tissue can be obtained, for example, using DNA microarrays, an experimental platform to measure gene activity profiles for a large nunmber of genes simultaneously. On a glass or plastic surface, messenger RNA concentrations are measured for different genes in a matrix pattern. Each spot in this matrix corresponds to a particular gene, as shown in the image, and the color indicates the level of activity of this gene in the tissue sample under consideration. Platforms such as array-CGH can be used to identify genomic alterations, such as gains or losses of individual genes, chromosome arms or entire chromosomes. More recently, next generation sequencing platforms enable the identification and quantification of cellular DNA at an unprecedented resolution.

Suche data have been shown to correlate with therapy response and survival in many cancers. Thus, these data can assist the clinician in evaluating treatment options, and genes found to correlate with survival may hint at novel targets for drug design. However, with modern platforms, thousands to tens of thousands of gene mRNA levels or genomic alterations are measured simultaneously. Such massive amounts of data cannot be analyzed manually, and sophisticated computational tools are needed for this purpose.

In our work, in collaboration with clinical partners, we are developing statistical tools and algorithms for the analysis of gene expression and genomic data, and for the prediction of treatment outcome and patient survival from these molecular data. Using algorithms from the machine learning field, we program computers to learn characteristic prognostic patterns from patient data, and to use these patterns to make predictions of outcome and therapy response for newly diagnosed patients.

A difficulty in the analysis of the data stems from the large number of genes measured and the - in comparison - small number of patients typically available in clinical studies. Furthermore, clinical studies are typically run over a predetermined time, and follow-up time for patients is sometimes relatively short. So for a proportion of the patients in the study, it is not known whether the patient ultimately relapses with the disease, or whether the patient is cured - this is called "censored data". This presents a problem for training of an automatic predictor, since it is not clear whether such a patient should be considered "good" or "bad" outcome. On the other hand, excluding such patients from classifier training may lead to systematic bias.

To counter these difficulties, we use specialized regression models for censored data, which are strictly regularized to be able to deal with the high dimensionality of the data and the in comparison typically small sample number. The methods we employ predict actual survival times or times to relapse for each individual patient, in years and months. This has the advantage that a censored patient, that is a patient who has survived longer than a certain time with no event, will contribute in training of the classifier with just this partial information. There is no need to manually classify patients in low-risk or high-risk for training.

To deal with the high dimensionality (large number of genes), we employ a Bayesian statistical framework, and use a prior distribution on regression paramters driving the learning algorithm to sparse solutions, that is, solutions where most of the genes have no or only minimal influence on the survival prediction. This helps avoid overfitting, where the learning algorithm tunes to noise in the data instead of true biological signal.

The Bayesian posterior distribution can then be evaluated to obtain relevances for individual weights, and to make predictions for new patients. This can be done using maximization algorithms (gradient descent, simulated annealing, genetic algorithms and others), or Markov Chain Monte Carlo approaches. While the former class of approaches is computationally more efficient, the Markov Chain Monte Carlo method allows the analysis of the full distribution over model parameters and survival times.

A current focus of our work is on the integration of different data types in such predictive models, for example, the prediction based on genomic and gene expression profiles, and including further biological knowledge through adequately chosen prior distributions.

- Dr. Benedikt Brors, Prof. Dr. R. Eils, German Cancer Research Center
- Dr. Andre Oberthuer, Dr. Matthias Fischer, Prof. Dr. F. Berthold, University Clinic Cologne
- Dr. Alexander Schramm, Prof. Dr. A. Eggert, University Clinic Essen
- Dr. Frank Westermann, German Cancer Research Center
- Dr. Thomas Zander, Prof. Dr. Joachim Schultze, University Clinic Cologne
- Prof. Dr. Rainer Schrader, University of Cologne

- J.-M. Löhr, R. Faissner, D. Koczan, P. Bewerunge, C. Bassi, B. Brors, R. Eils, L. Frulloni, A. Funk, W. Halangk, R. Jesenofsky,
**L. Kaderali**, J. Kleef, B. Krüger, M. M. Lerch, R. Lösel, M. Magnani, M. Neumaier, S. Nittka, M. Sahin-Toth, J. Sänger, S. Serafini, M. Schnölzer, H.-J. Thierse, S. Wandschneider, G. Zamboni, G. Klöppel (2010).*Autoantibodies Against the Exocrine Pancreas in Autoimmune Pancreatitis: Gene and Protein Expression Profiling and Immunoassays Identify Pancreatic Enzymes as a Major Target of the Inflammatory Process.*Am. J. Gastroenerology 105(9):2060-71. - J. H. Schulte, B. Schowe, P. Mestdagh,
**L. Kaderali**,**P. Kalaghatgi**, S. Schlierf, B. Brockmeyer, K. Pajtler, T. Thor, K. de Preter, F. Spelemen, K. Morik, A. Eggert, J. Vandesompele, A. Schramm (2010).*Accurate Prediction of Neuroblastoma Outcome based on miRNA Expression Profiles.*International Journal of Cancer, 127(10):2374-85. - A. Oberthuer, B. Hero, F. Berthold, D. Juraeva, A. Faldum, Y. Kahlert, S. Asgharzadeh, R. Seeger, P. Scaruffi, G. P. Tonini, I. Janoueix-Lerosey, O. Delattre, G. Schleiermacher, J. Vandesompele, J. Vermeulen, F. Speleman, R. Noguera, M. Piqueras, J. Benard, A. Valent, S. Avigad, I. Yaniv, A. Weber, H. Christiansen, R. Grundy, K. Schardt, M. Schwab, R. Eils, P. Warnat,
**L. Kaderali**, T. Simon, B. DeCarolis, J. Theissen, F. Westermann, B. Brors, M. Fischer (2010).*The Prognostic Impact of Gene Expression-Based Classification for Low and Intermediate Risk Neuroblastoma.*Journal of Clinical Oncology 28(21):3506-15. - A. Schramm, I. Mierswa,
**L. Kaderali**, K. Morik, A. Eggert, J. H. Schulte (2009).*Reanalysis of Neuroblastoma Expression Profiling Data using improved Methodology and extended Follow-Up increases Validity of Outcome Prediction.*Cancer Letters, in press, doi:10.1016/j.canlet.2009.02.052. - A. Oberthür*,
**L. Kaderali***, Y. Kahlert, B. Hero, F. Westermann, F. Berthold, B. Brors, R. Eils, M. Fischer (2008).*Sub-classification and individual survival time prediction from gene-expression data of neuroblastoma patients using CASPAR.*Clinical Cancer Research, 14(20):6590-6601. - M. Zapatka,
**L. Kaderali**(2007).*Statistische Analyse von Daten klinischer Studien.*GynSpectrum 3, 18-19. **B. Heinzel**(2008).*Ein Markov-Chain-Monte-Carlo Ansatz zur Klassifikation und Überlebenszeitvorhersage von Krebspatienten auf Basis von Genexpressionsdaten*. Diploma Thesis, Applied University Weihenstephan, 2008.**L. Kaderali**(2007).*Individualized Predictions of Survival Time Distributions from Gene Expression Data using a Bayesian MCMC Approach.*S. Hochreiter and R. Wagner (Eds.): BIRD 2007, LNBI Lecture Notes in Bioinformatics 4414, 77-89.- A. Schramm, J. Vandesompele, J. H. Schulte, S. Dreesmann,
**L. Kaderali**, B. Brors, R. Eils, F. Speleman, A. Eggert (2007).*Translating Expression Profiling into a Clinically Feasible Test to Predict Neuroblastoma Outcome.*Clinical Cancer Research 13(5), 1459-1465. **L. Kaderali**, T. Zander, U. Faigle, J. Wolf, J. L. Schultze, R. Schrader (2006).*CASPAR: A Hierarchical Bayesian Approach to predict Survival Times in Cancer from Gene Expression Data.*Bioinformatics 22, 1495-1502.