Functional classification of genes is a challenging problem in functional genomics due to several reasons. First, each gene participates in multiple biological activities. Second, the genes are classified according to a hierarchical classification scheme that represents the relationship between genes functions. In addition, various biomolecular data sources, such as gene expression data and protein-protein interaction data, can be used to assign biological functions to genes.
In order to address these issues, this thesis proposes new algorithms for the hierarchical multi-label classification. Hierarchical multi-label classification is a variant of convention classification in which the instances can belong to several labels, that are in turn organized in a hierarchy. The purpose of this thesis is threefold: first, Hierarchical Multi-Label classification algorithm using Boosting classifiers, HML-Boosting, for the hierarchical multi-label classification problem in the context of gene function prediction is proposed. Moreover, we propose the HiBLADE algorithm (Hierarchical multi-label Boosting with LAbel DEpendency), a novel algorithm that takes advantage of not only the pre-established hierarchical taxonomy of the classes, but also effectively exploits the hidden correlation among the classes, thereby improving the quality of the predictions. The primary objective of the proposed algorithm is to find and share a number of base models across the correlated labels. The HiBLADE algorithm is different from the conventional algorithms in two ways. First, it allows the prediction of multiple labels at the same time while maintaining the hierarchy constraint. Second, the classifiers are built based on the label under-study and its most similar sibling. Experimental results on several real-world biomolecular datasets show that the proposed method can improve the performance of hierarchical multi-label classification.
More important, however, is the third part that focuses on the integration of multiple heterogeneous data sources for improving hierarchical multi-label classification. We explore the integration of heterogeneous data sources for genome-wide gene function prediction with a novel Hierarchical Bayesian iNtegration algorithm, HiBiN, a general framework that uses Bayesian reasoning to integrate heterogeneous data sources for accurate gene function prediction. The system formally uses posterior probabilities to assign class memberships to samples using multiple data sources while maintaining the hierarchical constraint. We demonstrate that the integration of the diverse datasets significantly improves the classification quality for gene function prediction in terms of several measures, compared to single-source prediction and fused-flat models, which are the baseline methods compared against. Moreover, the system has been extended to include a weighting scheme to control the contributions from each data source according to its relevance to the label under-study. The results show that the new weighting scheme compares favorably with the other approach along various performance criteria.