Module 10 — Attribute Selection¶

10.1 Selecting Meaningful Attributes ⬤ ¶

Consider the example of stock market data. There are millions of data points one can use as attributes. For the sake of argument, let's say there are $10^6$ attributes that are potentially available to us. We can imagine that within these $10^6$ attributes, there exist $10$ which, if selected, make the data points linearly separable. In that case, we could use a perceptron to completely predict days when the market is up or down, and try to become rich.

This is of course a fictional example, but not totally crazy. Many hedge funds and finance companies essentially work on detecting such 'market inefficiencies', where a small set of attributes expose an opportunity to make money.

Going back to our fictional example, we could attempt to find those 10 ideal attributes, by trying all possibilities. That is, for each subset of 10 attributes, we could restrict our data to those 10 attributes only, and then try to fit a perceptron. And we could keep trying until we found those 10 "golden" attributes. But if we have $10^6$ attributes, that would require checking roughly $\left(10^6\right)^{10} = 10^{60}$ possibilities, which is an extremely large number (much more than the total number of atoms making up planet Earth!).

While we can use our human intuition and understanding in trying to find some good attributes, we would also like to have some automated way (that is, an algorithm) for doing that.

Selecting a small number of important features can also help with computational efficiency. Fewer features require less computation from smaller models, and therefore can more easily deployed in smaller devices.

We have already seen that $l_1$-regularization can be used to select attributes. We will now describe some additional algorithms.

10.2 Important Attributes via Random Forests ⬤ ¶

When training a single decision tree, each attribute $i$ is used one or more times to split the dataset or its subsets. Each time attribute $i$ is used for a branching decision, we can measure the information gain for that decision. We can then compute a weighted information gain contributed by $i$ over the entire decision tree.

In a random forest we will train $N$ different decision trees, and thus we can compute an information gain $g_{i,j}$ for attribute $i$ in each decision tree $j$. Then, the overall importance for attribute $i$ can be computed as follows:

$$ G_i = \sum_{j=1}^N g_{i,j}\, . $$

Finally we can compute a normalized importance $$ \hat{G}_i = G_i\bigg/\sum_i G_i\, . $$

The computation of feature importances adds very little extra computation to the training of the random forest, and as such they can be computed efficiently.

On the other hand, possible issues include:

Correlated features. Datasets very often contain groups of highly correlated features. While the importance of a single feature may be high in isolation (i.e. in the absence of other features), that feature may end up getting assigned a lower importance when considered together with other features correlated with it. We can think of this intuitively as a dilution of the importance score through a sharing of 'credit' among the correlated features.
High Cardinality Features. Features with many possible categorical values tend to receive higher importance, simply due to their higher ability to partition the dataset.

10.3 Random Forests Notebook ⬤ ¶

Here is the code notebook.

10.4 Greedy Attribute Elimination ⬤ ¶

Suppose we have a dataset $X$ with $d$ features. We can ask the question: if we had to remove only one feature, what would that be? Which choice would be safest?

One natural way of answering the question is to train $d$ different models $m_{1},\ldots,m_d$, where $m_i$ is trained on the data $X$, but with feature $i$ dropped from $X$. Then, if $m_j$ is the model with highest reference score (e.g. accuracy) among the $d$ models, that would imply that attribute $j$ is not so important.

After removing attribute from $j$, we get a new dataset $X'$, and the process can continue in exactly the same way with $X'$.

By doing this a number of $k$ times, we will have eliminated $k$ attributes. Because in each iteration we 'greedily' pick the feature that currently appears to be not so important, the whole process is known as greedy attribute elimination.

This process can be applied with any model (not just random forests), but one problem with it is that it requires the training of many models. Eliminating $k$ attributes means that we would need to train $O(dk)$ models.

10.5 Greedy Feature Elimination Notebook ⬤ ¶

Here is the code notebook.

10.6. Permutation-Based Feature Importance ⬤ ¶

Permutation-based feature selection is a method that is not prone to the issues of random forests. It requires less computation than greedy elimination because it trains only one model $m$, and after $k$ different rounds of validation (on different validation sets), outputs an importance measure for each attribute.

The main idea is that in the $i^{th}$ round of validation, we take the validation set $X_{\mathit{valid}}$, shuffle its $i^{th}$ column and then report the validation score of model $m$ on it. For higher robustness, we can repeat the same procedures with other random shuffles and report an average validation.

In general, attributes whose permutation results in lower validation, are considered more important.

Note: Permuting a coordinate does not affect certain general statistics for the attribute (such as mean, variance, and other moments). Unlike greedy elimination, shuffling a column allows reuse of the same learning model.

10.7 Permutation-Based Feature Importance Pseudocode ⬤ ¶

Permutation Feature Importance

Input:
    fitted model $m$
    dataset $D$
    number of repetitions $K$
    scoring function $\text{score}(\cdot, \cdot)$

Compute reference score:
    $s \leftarrow \text{score}(m, D)$

For each feature $f_j$, corresponding to column $j$ of $D$:

    For $k = 1, \dots, K$:

        Construct $\tilde{D}_{k,j}$ by randomly shuffling column $j$ of $D$
        while leaving all other columns unchanged.

        $s_{k,j} \leftarrow \text{score}(m, \tilde{D}_{k,j})$

    $i_j \leftarrow s - \frac{1}{K} \sum_{k=1}^K s_{k,j}$

Output:
    feature importances $i_1, \dots, i_d$

For how to use this algorithm, see the scikit-learn manual.