Optimization

Optimization is a general framework to model many real-world and theoretical problems. In the case that the objective function is convex over the region of interest, any local minimum will also be a global minimum, and there exist robust, fast numerical techniques for optimizing twice differentiable convex functions. In the more general case, however, when the convexity and differentiability of the objective function can no longer be guaranteed, there do not exist general effecient and effective numerical techniques to find global minimums. In this case, a heuristic technique must be used.

The simulated annealing strategy, the genetic algorithms and evolutionary strategies, as well as other techniques motivated by observations of natural phenomena, have demonstrated favorite performance in dealing with many real-world complicated optimization problems. These strategies can even be more efficient when integrated with powerful distributed and parallel computing techniques.

The theoretical part of this research would eventually aim at designing parameter free frameworks and distrubuted parallel diagrams for these general heuristic optimization techniques. Starting with analyzing the influence of parameters on the performance of these methods on benchmark numerical and combinatorial optimization problems, we would eliminate irrelevant parameters and significantly reduce the number of parameters used in these methods. Starting with analyzing how these methods can escape from local minimums, we could design effetive parallel diagrams which can enable these methods to more efficiently utilize computing resources.

The application part of this research would involve the solution of many scientific and engineering problems which heavily rely on optimizing certain objective functions. Particularly, we well focus on formulating computational biology and computational finance problems to optimization problems and solving them using the developed high performance and easy to use optimization framework.

Statistical learning

Kernel machines have been getting more and more wide usage in computational biology and bioinformatics, while the feature selection problem is of the central importance in most of these applications. Bayesian approaches with kernel machines have been successfully developed to integrate the feature selection procedure into the kernel machines. Nevertheless the support vector machine is not computational economy due to its intrinsic quadratic programming characteristic, whereas the requirement of integrating out irrelevant parameters in Bayesian approaches often makes the methods very complicated and suffering from the lack of close forms for most functions, although various approximation methods have been proposed.

The random forests, as a typical ensemble predictive learning method, have gained the application focus in recent years, due to the computational economy intrinsic of tree-based methods and the integrated random feature selection scheme. Meanwhile, other statistical learning methods such as the boosting have also gained focus in various applications.

In the theoretical part of this research, we would focus on developing new statistical learning diagrams which can integrate feature selection procedure and are efficient in computation. As an example, a Bayesian sparse logistic rule fitting approach has been developed and demonstrated preferred properties, motivated by the understanding that each tree in a random forest can be thought of as composed of many non-overlap rules, and consequently fitting many simple rules to the problem of interest can possibly achieve both excellent prediction accuracy and preferred interpretability.

In the application part of this research, we would focus on customizing exsiting statistical learning methods to solve computational biology and bioinformatics problems. For example, for the random forests, a sample space splitting technique had been adopted and the results showed significant improvement in the prediction performance in terms of the area under the receiver operating characteristic curve (AUC) and the balanced error rate (BER).

Identification of genetic variants causing complex diseases

The identification of causal genes and inherited genetic variation is the primary step towards the prevention, diagnosis and treatment of complex diseases. Family-based linkage analysis and population-based association studies have been the two major categories demonstrated remarkable successes in this direction but both have some disadvantages. The emergence of powerful yet low-cost sequencing techniques have making the sequencing of a large number of individuals more and more feasible and therefore putting forward ever-greater challenges to these classical statistical learning methods. For example, linkage analysis works well for rare Mendelian diseases but has limited power when a single locus fails to explain a significant fraction of a disease, while association studies suffer from serious multiple testing problem. Indeed, the demand of large amount of data has translated into a need for sophisticated tools integrating biophysical and biochemical knowledge with statistical methods for effectively identifying genetic variants causing complex diseases.

The theoretical part of this research involves understanding how genetic variants could affect transcription of DNA, translation of mRNA, protein structures, and protein post translational modification, exploring the biophysical and biochemical knowledge that can explain a significant part of current disease-causing genetic variants, and developing new statistical learning diagrams which can efficiently handle categorical and numerical characteristics derived from physiochemical properties of biopolymer sequences.

The application part of this research includes automatical collection of a vast number of genetic variants from various databases, integration of these data into a single database, development of programs for automatical calculation of categorical and numerical properties for the collected genetic variants (based on the theoretical results), and implementation of stand-alone and web-based user-friendly interactive applications for predicting the probability of suffering from a certain disease for individuals whose whole-genome information is available.

Network Biology

Biological functions usually raise from combined effects of multiple genes or gene products and their interactions with environmental factors. These interactive effects are encoded in transcriptional regulatory networks, protein-protein interaction networks, as well as all kinds of functional pathways. Studies of biological interaction networks are therefore crucial for the understanding of biological systems.

The first step in this research includes the obtain of biological interaction networks from literature and from high-throughput data. In order to obtain networks from literature, data mining in text documents for genes and gene products would be necessary, and basic natural language processing techniques would be helpful. In order to recover networks from high-throughput data, reliability of the data need to be analyzed and noises in the data need to be eliminated.

With networks obtained from literature or from high-throughput data, the theoretical part of this research involves statistical analysis of global and local characteristics of the networks. As for global statistical properties, the power-low distribution of node degrees in these networks and the small world effect existing in these networks need to be analyzed. As for local statistical properties, network motifs need to be identified. For example, with the existence of intrinsic and/or experimental uncertainties in the networks, network motifs would also be stochastic and new methods for identifying network motifs need to be explored. Finite mixture models, Bayesian models, expectation-maximization (EM) algorithms, and Markov Chain Monte Carlo strategies would be neccessary in order to identify stochastic network motifs.

The application part of this research involves exploring the meaning of the recovered biological networks and applying the recovered networks to solve problems in computational biology. For example, identification of pathways determing certain biological functions would be a direct application of the recovered networks. In many times, supporting evidences from other data sources such as gene expression would also be neccessary in this step and in this case the recovery of biological networks and the analysis of recovered networks can be thought of as the first step towards systems biology.

Systems Biology

A biological system is composed of many sub-systems, and systems biology studies dynamic interactions of these sub-systems and how these interactions give rise to the function and behavior of that system.

Data integration would be the first step towards systems biology. Biological data often come from highly heterogeneous sources, including categorical values such as the annotations from various database and literature, continuous vector values from gene expression, complicated tree structured values from gene ontology, as well as network structured values from pathway databases. These data must be converted to a single type of data to indicate quantitative status of biological sub-systems. For this objective, multiple methods — the Pearson correlation coefficient, the Chi-square statistics, and many others — need to be adopted, and a finally data integrating scheme — the order statistics, the naive Bayesian approach, or other statistical inference and learning methods — need to be used.

With the data from heterogeneous sources being integrated to indicate status of biological sub-systems, the theoretical part of this research involves the development of mechanistic models — the reconstruction of dynamic systems from the quantitative properties and status of the sub-systems. Differential equations would be the major tool for this purpose; control theory would have its position in this study; measurement techniques and detection principles would also have their contributions.

With theoretical study results, more understanding regarding the running machenism of biological systems could be obtained. As a result, the application part of this study would include as simple as in silicon simulation of single cell systems, and as complicated as cancer treatments and general health care topics for human. For example, the development of novel therapeutics and the introduction of personalized treatments.