Memetic Algorithm For Feature/Gene Selection

Introduction

Feature selection has become the focus of many research areas in recent years. With the rapid advance of computer and database technologies, datasets with hundreds and thousands of variables or features are now ubiquitous in pattern recognition, data mining, and machine learning. To process such huge datasets is a challenging task because traditional machine learning techniques usually work well only on small datasets. Feature selection addresses this problem by removing the irrelevant, redundant, or noisy features. It improves the performance of the learning algorithms, reduces the computational cost, and provides better understandings of the datasets.

Feature selection on microarray data has attracted increasing interest in many academic communities and industries over the last decade. The breakthrough in microarray technology promises a new insight into the mechanisms of living systems by providing a way to simultaneously measure the activities and interaction of thousands of genes. The main difficulties of analyzing microarray data lie in its inherently noisy and high-dimensional nature. Microarray data is characterized with thousands of genes but with only a small number of samples available for analysis. This makes learning from microarray data an arduous task under the effect of curse of dimensionality. Furthermore, microarray data often contains many irrelevant and redundant features, which affect the speed and accuracy of most learning algorithms. Hence, feature selection is widely used to addresses these problems in microarray data analysis, where feature selection is also known as gene selection.

We propose novel feature selection methods for gene selection using the concept of memetic algorithm (MA) . In particular, these new feature selection methods are synergies of filter and GA wrapper methods using a memetic framework. Taking advantage of both filter and wrapper methods, they search the feature space much more efficiently and converges to small robust feature subsets with high prediction accuracy.

In particular, we propose a wrapper-filter feature selection algorithm (WFFSA) [1] using a memetic framework. A filter ranking method based local search is introduced to fine-tune the population of GA solutions by adding or deleting features based on univariate feature ranking information. Empirical studies of WFFSA on several commonly used UCI and microarray datasets indicate that it outperforms several recently reported feature selection methods in the literature in terms of classification accuracy, selected feature size, and search efficiency.

An alternative to univariate ranking methods is the use of Markov blanket. Markov blanket is a cross-entropy based filter technique capable of identifying both redundant and irrelevant features. Using the cross-entropy based method, we propose an alternative MA feature selection method known here as the Markov blanket embedded GA (MBEGA) [2], which is a hybridization of the Markov blanket with GA wrapper methods. The empirical studies on microarray data suggest that MBEGA is more effective and efficient in eliminating irrelevant and redundant genes. A detailed comparative study with other methods from each of filter, wrapper, and standard GA shows that MBEGA gives a best compromise among all four evaluation criteria, i.e., classification accuracy, number of selected genes, computational cost, and robustness.

Software/Source Code Download

MAFS (WFFSA and MBEGA)

Any enquiry, please email me at zhuzx@szu.edu.cn. More contact information could be found on my homepage.

References

  1. Zexuan Zhu, Y. S. Ong and M. Dash, “Wrapper-Filter Feature Selection Algorithm Using A Memetic Framework”, IEEE Transactions On Systems, Man and Cybernetics - Part B,  Vol. 37, No. 1, pp. 70-76, 2007.
  2. Zexuan Zhu, Y. S. Ong and M. Dash, “Markov Blanket-Embedded Genetic Algorithm for Gene Selection”, Pattern Recognition, Vol. 49, No. 11, 3236-3248, 2007.
  3. Zexuan Zhu and Y. S. Ong, “Memetic Algorithms for Feature Selection On Microarray Data”, 4th International Symposium on Neural Networks (ISNN2007), Nanjing, China, 3-7 Jun, 2007.