David Gnabasik

David.Gnabasik.thumb.JPG
This is my first year as a Ph.D. candidate at CU-Boulder after receiving my Masters in computer science and certificate in computational biology at CU-Denver. I enjoy playing competitive soccer and the cello when not snowboarding with my two sons, Ryan and Francisco.

I'm fortunate to be currently (Dec.07) working with three different people:
  1. Dr. Steve Mark of the Biostatistics department of the Health Sciences Center and I are analyzing SEER cancer data under a new logistic multinomial model.
  2. Dr. Gita Alaghband of CU-Denver (CS) and I have recently submitted a proposal to the Dept. of Education computer-assisted program for discovering scientific and mathematical concepts called "The Scientific Assistant". I assist her in the CU REACH program – Recruiting Engineers to Succeed.
  3. Dr. Mark Duncan of the Duncan Proteomics Lab at the University of Colorado Anchutz Medical Campus (Fitzsimmons) and I have just submitted a proposal to the NIH entitled "Statistical Management of Proteomic Biomarker Discovery and Analysis (SMPBDA)".


dGnabasik questionnaire
dGnabasik colloquia review 1
dGnabasik colloquia review 2
David Gnabasik HW8.rtf



Review: Statistical Analysis of Proteomic Mass Spectrometry Data
By Kelly Handley, University of Nottingham, thesis, statistics department


When searching for disease biomarkers in proteomic data, the aim is to find a small subset of the available mass/charge (m/z) values which correctly classify disease state. The thesis attempts to use mass spectra to differentiate between drug-treated breast cancer cell-lines and non-treated controls. Handley uses 4 methods to reduce the dimensionality of the datasets and to make the problem tractable:
• create an algorithm to identify spectra peaks;
• use a variety of deterministic classification methods to classify new spectra;
• estimate parameters (peak locations, heights, variances) by fitting a Gaussian multi-level parametric model using a Bayesian Markov Chain Monte Carlo (MCMC) algorithm;
• then using these averaged to identify important m/z values.

Handley recognizes 4 significant objections to this type of analysis
• the very low reproducibility of experimental results;
• the exclusion of m/z values below 2,000 Daltons as noise;
• the effects of improper and inconsistent sample handling;
• the unavoidability of many more m/z values compared to the number of experimental spectra (over fitting).

Both Principal Components Analysis (where data is projected linearly onto lower dimensional subspaces in which they show maximal variation) and Independent Components Analysis (in which observed random data are expressed as linear combination of independent components) reduce the dimensionality of the spectral data. Mathematically, the data can then be subjected to various classification methods such as discriminant analysis, support vector machines and k-nearest-neighbor clustering, all of which are subject to severe constraints. The use of MCMC methods enables a parametric model to be fitted to the data.

The modeling problem is how to model the intensities at each m/z value within a particular spectra. Spectra can be classified into groups using differing peak location and height information. The peak finding algorithm does a good job of picking out peak locations for the breast cancer data set on day one, but degrades thereafter. Overall, several of the classifications provided good results over both datasets within specific weight (Dalton) ranges, and some m/z values were identified which exhibited significant differences in protein expression between groups.

However, I objected to many of the computational and mathematical simplifications made by Handley such as a splitting algorithm to separate m/z values into distinct sections, restricting how far a peak is allowed to have an influence, using different parameter values for each section and assuming that valuable information only exists at and around peaks. These shortcuts abstract away much of the biological significance of the data, since small changes in certain protein levels are biologically important. The absence of a protein in the result set does not mean it not expressed or present. Furthermore, I doubt that correct inference can be made for differences between groups when it is based on only 6 observations per group.

But applying multilevel modeling to proteomic MS data is certainly necessary. Multilevel modeling is a useful technique to apply when the data have an obvious hierarchical structure, such as analyzing m/z values within spectra. Also, using mixed effect models means that the data are described using a much smaller number of parameters than there are original data points. This is all the more necessary considering that the dynamic range of protein abundance is greater than the dynamic range of any analytical tool by at least 10 orders of magnitude, and that proteomics requires multiple technologies – there is no optimal, standardized technology. Conclusions from proteomic studies must always be taken in context of the limitations of the approach and the complexity of the system.