I Introduction
High dimensional data can pose a significant challenge to
learning methods due to the curse of dimensionality [1].
Feature selection is a prominent dimensionality reduction technique that selects a small subset of features
based on certain relevancy criteria. Apart from reducing data dimensionality,
feature selection provides insights into the data, prevents overfitting and reduces computational
costs for learning, which ultimately results in better learned models.
Depending on whether there is label information available, feature selection can be classified into two categories:
supervised and unsupervised. Supervised feature selection procedures are broadly classified into three groups, wrapper, filter
and embedded methods [2].
Wrapper procedures select features for a specific learning model.
Filter methods on the other hand are classifier agnostic.
Feature selection and model learning are treated as two separate steps. These procedures rely on statistical
characteristics of the data such as correlation, distance and information, to select the most important features.
Embedded procedures incorporate feature selection as part of the learning model, as seen in neural nets.
We focus on the modelindependent filter procedures for feature selection, because of their classifier
independence, simplicity and computational efficiency [3]
. Specifically, we consider Mutual Information (MI) based criteria for feature selection. MI is a probabilistic measure that captures the ‘correlation’ between random variables (see Figure (
1)). Whereas standard correlation captures linear relationships between variables, MI can capture nonlinear dependencies between variables [4].Since our aim is to develop better classifier models using feature selection, we select the best subset of features that together have the highest MI with the class variable. Estimating MI between a subset of features requires the estimation of high dimensional joint probability distributions, which in turn necessitates exponentially large amounts of data. We circumvent this hurdle by invoking the assumption of conditional independence between the features. This reduces the MI estimation problem to a Conditional Mutual Information (CMI) estimation problem, involving three features at a time. However, selecting the subset of features with the highest CMI turns out to be a very hard combinatorial problem. We model feature selection as a Binary Quadratic Problem (BQP), which is NPhard. We introduce approximations to solve the BQP taking inspiration from solutions to other related NPhard problems like
SparsePCA and DensestSubgraph. We also evaluate our procedure by comparing it with less optimal, but computationally more efficient iterative procedures for feature selection. We evaluate our feature selection using classification accuracies across multiple datasets.Motivation: Theoretical underpinnings of using CMI for feature selection have not been clearly outlined in literature. Also, current global feature selection methods like Semidefinite Programming (SDP) do not scale well. We therefore aim to provide better insights into the approximations and assumptions involved in CMI based feature selection, and introduce efficient approximations to solving the NPhard problem of feature selection from related problems.
Contributions: We have demonstrated that the class posterior distribution can be approximated by selecting a subset of variables with the highest MI with the class variable. We have proved that the MI between a subset of variables and the class random variable, reduces to CMI between pairs of variables under the assumptions of conditional and classconditional independence. We have modeled the NPhard problem of feature selection as a Binary Quadratic Problem (BQP) and demonstrated our feature selection method across multiple datasets.
The rest of our paper is organized as follows. In Sec. II, we outline the problem of feature selection and formulate the BQP and its approximate solutions. Sec. III compares our work with existing feature selection procedures. We conclude with experiments in Sec. IV followed by discussion in Sec. V.
Ii Feature Selection using Conditional Mutual Information
We consider the standard classification setting where we are presented with i.i.d data . Each data point , is regarded as an instance of a set of continuous random variables . We interchangeably use the terms features or variables to denote or its subsets. The dependent class label
, is considered to be an instance of a discrete random variable
, that takes values in . We define to be a subset of feature indices, where and , the subset of features indexed by . Likewise, is the subset of left over indices, i.e. and is the subset of leftover features indexed by , i.e. . We consider, to be the posterior probability on the dataset
. We now state our problem.Problem II.1
To estimate the optimal subset , of feature indices, such that is a good approximation to .
Theorem II.1
, if and only if Mutual Information
Joint probability can be factored as
(1) 
When viewed as elements in a Bayesian Network, the random variables can be thought to form a Markov Chain
where is conditionally independent of given . Therefore, the CMI . With , the necessary condition is proved. The statements in the proof are also true in the reverse order. Theorem(II.1) provides the justification for choosing a MI based approach to determining the optimal . depends only on and is a constant that is shared between and . It is very likely that is a nonzero value for every except for the most trivial case with . For a fixed value of , we therefore try to estimate the optimal subset that maximizes .Iia The Binary Quadratic Problem
Given a subset of features , in order to evaluate , we need to estimate the joint probability distribution
, which is often intractable. To simplify the estimation of the joint distribution, we introduce the assumption of conditional independence between features in the spirit of Naïve Bayes and
[5]. We first introduce a new term as the subset without the index also denoted as . We now outline the assumption that will simplify the joint distribution.Assumption II.1
For a set of selected features , the features are conditionally independent and classconditionally independent given , i.e. and .
Theorem II.2
If Assumption II.1 is true, and then,
The MI between and is given by:
(2) 
In the above derivation, we have first applied the MI Chain Rule
. We have then applied the MI rule and expressed the two trailing MI terms in terms of entropy. Applying Assumption II.1 to (2), the high order entropy terms can be replaced with summations and reduced further using , to yield,(3) 
Since we would like to select that maximizes (3), we can formulate the global feature selection problem as,
(4) 
This is equivalent to the constrained Binary Quadratic problem,
(BQP) 
where, is a nonnegative matrix with = and = and the nonzero indices of the solution , constitute .
IiB Solving the Binary Quadratic Problem
We aim to solve the BQP, where is a symmetric and possibly indefinite matrix with nonnegtiave elements, i.e., for all . Although this problem is well defined, it is highly nonconvex due to the nonconvex constraint. And it is also known that this binary quadratic problem is NPhard [6]. So, it is difficult to find the optimal value in practice. Therefore, we aim to find an approximate solution to the BQP. Approximation methods such as linear relaxation, spectral relaxation, semidefinite programming, truncated power method and low rank bilinear approximation have been applied to solve this family of NPhard problems. We will now briefly review some of these relaxation methods.
IiB1 Linear Relaxation
By introducing a new variable , we are able to linearize the quadratic term. With , we formulate the following:
(LP1)  
LP1 is equivalent to BQP, which is also NPhard. LP1 is simplified by relaxing . One of the optimality conditions is then given by, . This relaxation reduces LP1 to LP2 which is given by,
(LP2)  
Since , the maximum value for Linear is equivalent to the largest column (or row) sum of . The solution to Linear guarantees a tight lower bound to the BQP (see appendix). In our work, we use the solution from Linear to initialize the input to other algorithms.
IiB2 Truncated Power Method
Truncated power (TPower) method aims to find the largest
sparse eigenvector. Given a positive semidefinite matrix
, the largestsparse eigenvalue can be defined as follows
[7]:(5) 
Matrix is required to be positive semidefinte, but TPower method can be extended to deal with general symmetric matrices by setting where such that . The truncated power method is given as follows. Starting from an initial
sparse vector
, at each iteration , we multiply the vector by and then truncate the entries of to zeros and set the largest entries to 1. TPower can benefit from a good starting point. We use the solution from Linear as the initial sparse vector .IiB3 Low Rank Bilinear Approximation
The low rank bilinear approximation procedure has been applied to solve the SparsePCA [8] and the DensestSubgraph [9]. It approximates the solution to a BQP by applying a bilinear relaxation which is given as,
(6)  
provides a good (2approximation [9]) approximation to BQP and can be solved in polynomial time using a rank approximation of . The authors in [9] have developed the Spannogram algorithm to estimate a candidate set (termed the Spannogram ) of vector pairs with features. One of the vectors from the pair that maximizes , is the bilinear approximate solution to BQP.
IiB4 Related BQP Approximation Methods
For the sake of completeness we briefly mention two other techniques that have been applied to approximate the BQP
in the domain of feature selection.
The authors in [10] propose two relaxation techniques, (1) Spectral relaxation and
(2) Semidefinite Programming relaxation. In Spectral Relaxation, the constraint on
the values of are relaxed to continuous values.
The values of being positive, can be interpreted as feature weights.
The solution to Spectral has been shown to
be the largest eigenvector of [10].
Semidefinite Programming relaxation (SDP)
is a closer approximation to the BQP than Spectral.
The BQP is approximated by a trace maximization problem using semidefinite relaxation.
The approximate solution , to the BQP, is obtained from the SDP solution through random projection rounding based on
Cholesky decomposition [11]. For our experiments, we apply random rounding with 100 projections.
For further details, please refer to [10].
Iii Existing Methods For MI Based Feature Selection
In this section we will discuss existing filter based Mutual Information measures for feature selection. We categorize these procedures as Greedy (iterative) Selectors and Global Selectors. We limit our discussion to the best feature selectors we found in our studies. For a broader perspective on feature selection procedures, we suggest the works of Tang et al. [12], Guyon and Elisseeff [3] and Duch [13].
Iiia Greedy Feature Selectors
Greedy selectors usually begin with an empty set , and iteratively add to it the most important feature indices, until a fixed number of feature indices are selected or a stopping criterion is reached. MI between the random variables (features and label) provides the ranking for the features. The most basic form of the scoring function is Maximum Relevance (MaxRel) [14], where the score is simply the MI between the feature and the class variable. To account for the redundancy between features, Peng et al. [15], introduced the Maximum Relevance Minimum Redundancy (MRMR) criterion, which selects features with maximum relevance to the label and minimum redundancy between each other. A greedy procedure very closely related to our technique is the Joint Mutual Information (JMI), that was developed by Yang and Moody [16], and later by Meyer et al. [17]. In our experiments (Sec. IV), we evaluate the features selected by these iterative procedures across multiple datasets.
IiiB Global Feature Selectors
There is limited work on MI based global feature selection. In [4], RodriguezLujan et al. introduced the Quadratic Programming Feature Selection (QPFS). This method can be viewed as a global alternative to MRMR. The second global technique proposed by Nguyen et al. [10], is related to our method presented in (BQP). Nguyen et al. [10], model a global feature selection problem based on the CMI matrix , and apply Spectral and SDP methods to approximate the solution.
Iv Experiments
We consider two factors to evaluate the feature selection algorithms discussed so far, viz., time complexity and classification accuracies. For the global methods, since we are approximating the solution, we also consider the tightness of the approximation. We conducted our experiments using MATLAB on an Intel Corei7 2.3 GHz processor with 16GB of memory.
Iva Feature Selectors: a test of scalability
Greedy feature selectors have time complexities of the order , which is negligible compared to the time complexities of the global feature selectors. Table I lists the time complexities for the global algorithms. To study time complexities, we conduct multiple experiments , where we simulate a CMI matrix by a random positive symmetric matrix of size , and select features. The time complexity for experiment , is the average time of convergence over 10 runs. We use the same set of random matrices for each of the algorithms in the experiment. Figure (2) depicts the convergence times for different experiments. Linear algorithm is the most efficient, followed closely by Spectral and TPower methods. We used the CVX [18] implementation with SDPT3 solver [19], for all our SDP experiments. The SDP solver has a huge memory footprint and with matrix sizes , we run into ‘Out of Memory’ errors. For the LowRank method, we used the following parameters, for all our experiments. Please refer to [9] for more details on the LowRank.
Linear  Spectral  SDP  LowRank^{*}  TPower^{**} 
^{*} is approximation rank, ^{**} is number of iterations 
IvB BQP Methods: a test of approximation
For the next set of experiments, we evaluated the degree of approximation for the global algorithms. Since we do not know the optimal solution for the BQP, we compared the methods on their relative objective values. We estimated the binary feature vector after applying each of the methods and then evaluated the objective value . We evaluated the percentage difference of every algorithm’s objective value with the Linear method’s objective. Similar to the experimental evaluation used for time complexity, we generated random data for each experiment and averaged the values over 10 random runs. Figure (3) presents the results of the experiment. TPower and LowRank displayed the largest percentage increase from the Linear. Since TPower and LowRank approximate the BQP better than other methods, they must therefore be better feature selectors compared to Linear, Spectral and SDP. The TPower is also very efficient in terms of execution time and would be the ideal feature selector when considering both speed and accuracy.
Data  Error  Ref.  

Arrhythmia  258  420  2  31.1  [20] 
Colon  2000  62  2  37.0  [21] 
Gisette  4995  6000  2  2.5  [20] 
Leukemia  7070  72  2  26.4  [21] 
Lung  325  73  7  9.6  [21] 
Lymphoma  4026  96  9  81.3  [21] 
Madelon  500  2000  2  45.5  [20] 
MultiFeat  649  2000  10  1.5  [20] 
Musk2  166  6598  2  4.6  [20] 
OptDigits  62  3823  10  3.3  [20] 
Promoter  57  106  2  26.0  [20] 
Spambase  57  4601  2  7.5  [20] 
Waveform  21  5000  3  13.1  [20] 
IvC Feature Selectors: a test of classification error
In this section we compare the TPower and the LowRank with other algorithms in terms of classification accuracies. For our experiments, we chose 13 publicly available datasets that are widely used to study MI based feature selection as in [4, 5, 10, 15]. The details of the datasets are captured in Table II. We performed feature selection for a set of values and estimated the classifier performance across all values of . Starting at , we incremented in steps of till or , whichever is smaller. We evaluated the classifier performance using LeaveOneOut cross validation (if ) or 10fold cross validation and obtained the cross validation errors(%) for each fold. Since the average error across all values of
is not a good measure of the classifier performance, we applied the paired ttest, as also mentioned in
[4, 10, 22], across the cross validation folds. For a fixed dataset and a fixed value of , to compare TPower with say, MaxRel, we applied the one sided paired ttest at 5% significance over the error(%) of the cross validation folds for the two algorithms. We set the performance of TPower vs. MaxRel to win = , tie = and loss = , based on the largest number of ttest decisions across all the values. Along the lines of earlier studies in feature selection, we used linear SVM as the classifier. To estimate the CMI we need to discretize the features. We believe the role of discretization is not unduly critical as long as it is consistent across all the experiments. We discretized the features using the Class Attribute Interdependence Maximization (CAIM) algorithm developed by Kurgan and Cios et al. [23]. Feature selection was performed on discretized data but the classification (after feature selection) was performed on the original feature space. Using the above procedure, we compared the performance of TPower and LowRank with all the other algorithms. Tables III and IV display the results of the experiment. The values in Table III (likewise IV) correspond to the difference in the average of classification error(%) between TPower (likewise LowRank) and all the other algorithms. From the results in these tables, we find that TPower and LowRank do well on most of the datasets across all the algorithms. When compared against each other LowRank does better than TPower. For the sake of brevity we have not displayed the comparisons between other pairs of algorithms. The win/tie/loss numbers by themselves do not provide a complete picture of the comparison. The difference in the average error also needs to be taken into account to assess the performance. A large percentage of negative values in the columns and their magnitudes indicate the low error values in classification for TPower and LowRank. Figure (4) displays the average classification error(%) trends for varying values of for datasets. Figures (3(a)) and (3(d)) for the Colon dataset, suggest that the addition of more features does not necessarily reduce classification error. Classification error trends also help us cross validate the best value of for a dataset. For a given dataset, the error trends between the global and greedy procedures follow a similar pattern. This perhaps indicates that nearly similar features are being selected using both types of methods. We also note that for huge datasets with large values of , greedy methods may not be a bad choice for feature selection.Data  MaxRel  MRMR  JMI  QPFS  Spectral  SDP  LowRank 

Arrythmia  0.37 1.4  0.32 1.0  0.02 1.0  0.20 1.8  0.08 1.1  0.18 1.0  0.05 0.8 
Colon  7.28 4.6  4.42 4.2  2.47 3.8  6.70 4.6  0.60 2.8  N/A  4.03 4.5 
Gisette  1.32 0.6  0.00 0.7  1.12 0.6  1.38 0.7  1.26 0.6  N/A  0.33 0.6 
Leukemia  0.11 1.4  1.40 1.6  1.59 1.8  0.41 1.1  0.03 0.6  N/A  1.49 1.3 
Lung  9.43 4.1  2.52 4.2  3.83 4.2  0.60 2.8  0.88 2.2  0.88 2.1  1.59 2.4 
Lymphoma  2.76 4.8  3.35 4.7  2.93 5.3  4.99 3.3  1.86 2.5  N/A  3.29 4.2 
Madelon  0.32 0.5  0.80 0.9  0.01 0.4  0.22 0.7  0.01 0.6  0.15 0.6  0.11 0.4 
MultiFeatures  0.02 0.3  0.24 0.3  0.17 0.3  0.42 0.3  0.10 0.3  0.11 0.3  0.01 0.3 
Musk2  0.45 0.6  0.22 0.7  0.18 0.5  0.31 0.6  0.06 0.4  0.03 0.5  0.05 0.4 
OptDigits  0.19 0.5  0.01 0.6  0.16 0.6  0.65 1.0  0.03 0.3  2.53 13.0  0.08 0.4 
Promoter  0.73 3.0  0.04 3.2  0.19 3.0  1.29 3.8  0.48 2.8  0.56 2.9  0.17 3.1 
Spambase  0.34 0.3  0.06 0.2  0.23 0.3  0.03 0.4  0.09 0.3  0.10 0.3  0.02 0.1 
Waveform  0.13 0.3  0.06 0.3  0.01 0.0  0.04 0.2  0.04 0.1  0.00 0.2  0.01 0.2 
#//:  10/1/2  5/1/7  6/1/6  7/2/4  10/2/1  7/1/1  3/2/8 
Data  MaxRel  MRMR  JMI  QPFS  Spectral  SDP  TPower 

Arrythmia  0.32 1.3  0.36 1.0  0.07 1.0  0.25 1.7  0.03 1.1  0.13 1.0  0.05 0.8 
Colon  11.31 4.7  8.45 4.3  6.50 3.9  10.73 5.3  4.63 5.0  N/A  4.03 4.5 
Gisette  1.65 0.5  0.32 0.5  1.44 0.7  1.70 0.6  1.58 0.6  N/A  0.33 0.6 
Leukemia  1.39 1.4  0.09 1.5  0.09 1.7  1.09 1.3  1.52 1.2  N/A  1.49 1.3 
Lung  7.83 4.1  0.92 4.5  2.23 4.5  2.19 3.7  0.71 2.5  0.71 2.3  1.59 2.4 
Lymphoma  6.06 3.5  0.06 2.1  0.36 2.1  1.70 2.5  5.15 3.8  N/A  3.29 4.2 
Madelon  0.43 0.5  0.91 0.8  0.12 0.5  0.11 0.7  0.10 0.6  0.26 0.5  0.11 0.4 
MultiFeatures  0.01 0.4  0.23 0.3  0.16 0.4  0.43 0.3  0.10 0.4  0.10 0.4  0.01 0.3 
Musk2  0.50 0.4  0.27 0.7  0.23 0.5  0.36 0.5  0.02 0.3  0.02 0.4  0.05 0.4 
OptDigits  0.26 0.6  0.09 0.4  0.08 0.4  0.72 1.2  0.04 0.4  2.61 13.1  0.08 0.4 
Promoter  0.90 3.1  0.13 2.7  0.35 2.9  1.13 3.7  0.31 2.8  0.40 2.5  0.17 3.1 
Spambase  0.36 0.3  0.04 0.2  0.24 0.3  0.02 0.4  0.11 0.3  0.12 0.3  0.02 0.1 
Waveform  0.12 0.4  0.07 0.1  0.00 0.2  0.05 0.1  0.05 0.1  0.02 0.1  0.01 0.2 
#//:  9/0/4  6/1/6  6/2/5  8/2/3  5/4/4  3/2/4  8/2/3 
V Discussion and Conclusions
Feature selection is a NPhard problem and newer methods to approximate the solution will help drive research in this area. We have demonstrated that current methods applying MI and CMI for feature selection, inherently assume conditional independence between features. The conditional independence assumption limits the number of features compared to only 2 or 3 features at a time. There is need to derive better measures to approximate the importance of a group of selected features. To estimate the probability distributions, we had to discretize the features. We have not studied the effect of discretization in our work. Progress along all of these fronts will provide directions to improve MI based feature selection. In conclusion, we can state that both TPower and LowRank perform better than existing global and iterative techniques across most of the datasets. While LowRank slightly outperforms TPower, it does not compare well with regards to time.
Vi Acknowledgments
This material is based upon work supported by the National Science Foundation (NSF) under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
We express our thanks to Nguyen X. Vinh [10], for providing his implementation for comparison.
References
 [1] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani, The elements of statistical learning. Springer, 2009, vol. 2, no. 1.
 [2] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” AI, vol. 97, no. 1, pp. 273–324, 1997.
 [3] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” JMLR, vol. 3, pp. 1157–1182, 2003.
 [4] I. RodriguezLujan, R. Huerta, C. Elkan, and C. S. Cruz, “Quadratic programming feature selection,” JMLR, vol. 11, pp. 1491–1516, 2010.
 [5] G. Brown, A. Pocock, M.J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” JMLR, vol. 13, no. 1, pp. 27–66, 2012.
 [6] M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NPCompleteness. New York, NY, USA: W. H. Freeman & Co., 1990.

[7]
X.T. Yuan and T. Zhang, “Truncated power method for sparse eigenvalue
problems,”
The Journal of Machine Learning Research
, vol. 14, no. 1, pp. 899–925, 2013.  [8] D. S. Papailiopoulos, A. G. Dimakis, and S. Korokythakis, “Sparse pca through lowrank approximations,” arXiv:1303.0551, 2013.
 [9] D. Papailiopoulos, I. Mitliagkas, A. Dimakis, and C. Caramanis, “Finding dense subgraphs via lowrank bilinear optimization,” in (ICML14), 2014, pp. 1890–1898.
 [10] X. V. Nguyen, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection,” in 20th ACM SIGKDD. ACM, 2014, pp. 512–521.
 [11] M. X. Goemans and D. P. Williamson, “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming,” Journal of the ACM (JACM), vol. 42, no. 6, pp. 1115–1145, 1995.
 [12] J. Tang, S. Alelyani, and H. Liu, “Feature selection for classification: A review,” Data Classification: Algorithms and Applications, p. 37, 2014.
 [13] W. Duch, “Filter methods,” in Feature Extraction. Springer, 2006, pp. 89–117.
 [14] D. D. Lewis, “Feature selection and feature extraction for text categorization,” in Proc. of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 212–217.
 [15] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy,” IEEE PAMI, vol. 27, no. 8, pp. 1226–1238, 2005.

[16]
H. H. Yang and J. Moody, “Data visualization and feature selection: New algorithms for nongaussian data,”
NIPS, vol. 12, 1999.  [17] P. E. Meyer, C. Schretter, and G. Bontempi, “Informationtheoretic feature selection in microarray data using variable complementarity,” Selected Topics in Signal Processing, IEEE Journal of, vol. 2, no. 3, pp. 261–274, 2008.
 [18] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.
 [19] K.C. Toh, M. J. Todd, and R. H. Tütüncü, “Sdpt3ء matlab software package for semidefinite programming, version 1.3,” Optimization methods and software, vol. 11, no. 14, pp. 545–581, 1999.
 [20] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
 [21] C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” Journal of bioinformatics and computational biology, vol. 3, no. 02, pp. 185–205, 2005.
 [22] G. Herman, B. Zhang, Y. Wang, G. Ye, and F. Chen, “Mutual informationbased method for selecting informative feature sets,” Pattern Recognition, vol. 46, no. 12, pp. 3315–3327, 2013.
 [23] L. A. Kurgan and K. J. Cios, “Caim discretization algorithm,” Knowledge and Data Engineering, IEEE Transactions on, vol. 16, no. 2, pp. 145–153, 2004.
Appendix A LP2 provides a Good Lower bound for BQP
We would like to study the goodness of approximation to BQP provided by the solution to LP2. We define an equivalent problem in terms of BQP.
Proposition A.1
is equilvalent to BQP.
is the sum of all elements in . Since , BQP . Therefore, . Since is a constant for a matrix, under the same set of constraints,
We define some new quantities for the derivation of the bound. Let be the solution of BQP. Let be the solution of LP2. Since , we can expand in terms of any binary vector . Specifically we define in terms of ,
Definition A.1
(7)  
(8)  
(9)  
(10) 
Lemma A.1
From (7) we have,
Let denote the maximum value of BQP and let denote maximum value of LP2. If is the solution of BQP and is the solution of LP2. We have the following result:
Lemma A.2
We are now ready to state the bound for LP2.
Theorem A.1
(11) 
Comments
There are no comments yet.