Sign In

Neveen Mohammad Hijazi

PhD Abstract

Feature selection is an essential pre-processing step of machine learning, as it addresses large datasets and increases the predictive performance of the learning model. The main goal of feature selection is to select a subset of features with high predictive information while eliminating irrelevant, redundant features with little or no predictive information, especially from large datasets.

Recently, the dimensionality of datasets used in machine learning applications has increased significantly, resulting in the introduction of the term “high-dimensionality”. High-dimensional data can reduce the overall performance of the learning models because some irrelevant, redundant, and noisy data are involved.

Ensemble learning have emerged as a useful machine learning technique, which is based on the idea that combining the output of multiple models instead of using a single model. This practice, known as “diversity”, and it usually enhances the performance. On other hand, ensemble feature selection method is based on the same idea, where multiple feature subsets are combined to select an optimal subset of features. Learning methods have difficulties with the dimensionality curse that impact the performance and increase the time exponentially. To overcome this issue, we propose a Parallel Evolutionary algorithm for Ensemble Feature Selection (PE-EFS): homogeneous and heterogeneous approaches.

The proposed approaches are based on four phases; namely, distribution phase, parallel ensemble feature selection phase, combining and aggregation phase, and testing phase. In the first phase, a distribution process is applied using a sample with replacement method to split and distribute the data into multiple-cores for parallel execution. In the second phase, ensemble feature selection is utilized on the cores produced in the first phase. A wrapper feature selection method is then used based on three well-regarded metaheuristic algorithms: the Genetic Algorithm (GA), Particle Swarm Optimizer (PSO), and Grey Wolf Optimizer (GWO). We used the same metaheuristic algorithm in the homogeneous approach, while we used three different metaheuristic algorithms in the heterogeneous approach. In the third phase, the combining process is conducted to gather the results from the cores, the outcomes of this phase depend on majority voting. In the last phase, the dimensionality of the data is reduced according to the final feature subset, which, in turn, is used to predict the testing data.

Three implementations of the proposed approaches are presented: a sequential approach running on the Central Processing Unit (CPU), a parallel approach running on multi-core CPU (P.CPU), and a parallel approach running on multi-core CPU with Graphics Processing Units (P.GPU). During the evaluation phase, the three versions of our proposed approaches (CPU, P.CPU, and P.GPU) were tested on twenty-one large datasets, which are belonging to various application domains with different complexities. The results show that the proposed parallel approach improved the performance in terms of the prediction results and running time. Furthermore, a comparison of both approaches revealed that the heterogeneous approach possesses better prediction power than the homogeneous approach, where it was the best in 90.47% of the datasets. As well, an enhancement in performance by increasing the number of ensemble feature selection of 85.71% of the datasets.



​​

Contact us

Dr. Neveen Mohammad Hijazi

Latest News