Population-based guiding for evolutionary neural architecture search

Tabular NAS benchmarks, such as NAS-Bench-101 and NATS-Bench, provide a standardized framework that enhances reproducibility, accessibility, and comparability across different NAS methods. The primary advantage of using tabular benchmarks is the efficient use of precomputed data, which significantly reduces the time required for architecture evaluation. This allows researchers to focus on algorithm development and makes it easier to perform statistically significant comparisons of average run performances. Additionally, these benchmarks ensure robust testing, promoting reproducibility and reliability. In all experiments, 1000 runs were averaged in accordance with the guidelines proposed in . NAS-Bench-101 consists of 423,624 neural networks trained on CIFAR-10 for 108 epochs, using 11 and 33 convolutions, and 33 max pooling in a cell-based search space. The topology search of NATS-Bench includes 15,625 possible architectures using 5 operations: zeroize, skip connection, 11 and 33 convolutions, and 33 average pooling. It evaluates networks on CIFAR-10, CIFAR-100, and ImageNet16-120, providing a controlled comparison setting for NAS methods.

Encoding

For NAS-Bench-101, architectures were encoded using a flattened adjacency matrix and one-hot encoded operations. To ensure consistency, we used categorical encoding, representing each adjacency entry with two bits and operations with a one-hot vector. A visualization is shown in Fig. 4.

However, closer investigation of the available operations revealed an operation bias: when mutation is biased to favor a specific operation, it leads to different performance outcomes. As shown in Fig. 5, favoring 3x3 convolutions in mutations improves algorithmic performance but poses a risk of overfitting on the benchmark.

For NATS-Bench (topology search space), the encoding differs: each of the 6 possible edges in the graph can represent one of 5 possible operations. This results in a -dimensional representation. To support seamless switching between different benchmarks and their specific encodings, we explored a unified encoding approach, which we describe in Section Benchmark unification.

Shadowed nodes

Node shadowing is a phenomenon that occurs in the context of encoding neural network architectures using a binary adjacency matrix for connections and associating operations with nodes rather than edges. Node shadowing happens when a node is neither directly nor indirectly connected to neither the input nor the output of the network. As a result, the operation assigned to that node becomes irrelevant because the node does not affect the overall computation. When a node is shadowed, its connecting operation is thus also shadowed. Although these nodes are technically pruned and do not directly impact the model's performance, it is still possible to mutate the shadowed operations. While temporarily shadowed, their nodes may become active in the future if subsequent mutations and crossovers reconnect them, thereby influencing the model's behavior. This highlights the necessity of implementing a pruning mechanism to identify shadowed nodes and further investigate their effects. A distinction must be made between incoming and outgoing shadowing, leading to a duality in the problem.

Incoming shadowing refers to no preceding nodes being connected to the considered node (direct), as shown in Fig. 6. Conversely, outgoing shadowing refers to neither succeeding nodes nor connection to the output node as shown in Fig. 7.

The upper triangle adjacency matrix encoding for the directed graph maps node connections from the origin node (row) to the destination node (column). Detecting incoming shadowing is as simple as checking if the column of the considered node consists entirely of zeros, indicating that no other node connects to it. Similarly, outgoing shadowing is identified by checking if the row contains zeros, suggesting no outgoing connections. It is important to note that checking for all zeros in one column or row is sufficient to prove shadowing, but not necessary, as nodes can indirectly be shadowed when connecting to other shadowed nodes themselves. This refers to indirect shadowing, as shown in Fig. 8.

Due to the nature of the DAG, this issue can be resolved iteratively. Starting from the input (output) node for incoming (outgoing) shadowing, the algorithm traverses up (down) the node order, checking the columns (rows) for zeros and ignoring non-zero entries when these nodes have been identified as shadowed before.

For an empirical evaluation of occurrences, we counted the number of shadowing nodes averaged over 1000 runs, yielding 431 mutation processes, with the first node being shadowed 206 times, and in 14 instances, it was the only operation mutated. On average, 2 out of the 5 nodes were shadowed per mutation. Opting not to mutate shadowed nodes, thereby avoiding shadowed operations, can potentially yield advantageous outcomes that could be exploited.

Benchmark unification

A simplified example tries to showcase the unification of the two representations. The operation-nodes representation with adjacency matrix and operation list, as in the NAS-Bench-101 uses nodes as operations and edges as connections. The operation-edges representation uses the operations, including skip connections and zeroizations as edges as in the NATS-Bench. The simplification for operation-nodes representation uses 2 nodes that can represent one of three possible operations, as shown in the left panel of Fig. 9. This brings the adjacency matrix combinations to possibly (6 edges can take values of either 0 or 1) and the operations to (2 nodes can either be op1, op2, or op3) resulting in combinations, not counting graph isomorphisms. Figure 9 shows a unification approach of both representations. To map the operation-nodes representation (left) to an operation-edges representation (middle), two nodes are also required, as one node would lead to two operations for the same edge (right), which is invalid for this representation.

The valid operation-edges representation with two nodes contains 6 possible edges with 5 possible operations (3 original operations plus skip connection and zeroization), resulting in possible combinations. This covers a significantly larger space (7776 vs 576), which results in combinations that are not present in the original operation-nodes representation. This results in a representation mapping that is not a surjective function (injectivity would require careful consideration of graph isomorphisms). A true unification of both representations would require a bijective mapping between the two representations. Applying the operation-edges representation for both datasets would create new architecture configurations that are not included in the original NAS-Bench-101 dataset and can not be evaluated. The application of an operation-nodes representation would reduce the number of architectures from the NATS-Bench dataset significantly, as many architectures can not be transferred to this representation.

Guiding mutation based on a priori known best models

We investigate how guiding mutation by sampling indices from different groups of architectures - grouped by their performance - affects the search process. Using a pre-evaluated architecture benchmark such as NAS-Bench-101, we can sort all architectures by their performance and then apply different groups of averaged architectures to form a mutation index probability vector. We use averaged groups of the top 1, 10, 100, and 1000 models, selected first from the entire dataset, and then from the dataset after excluding the top 50%. It is important to note that this experiment is intended to investigate the influence of guiding the search when the solutions (i.e., the top models) are known already, making this algorithm ineffective in practice.

Benchmarking PBG on NAS-Bench-101

Following the guided models that require knowledge of the best models a priori, we iteratively use the most recently evaluated population (which equals the current generation) to guide our mutation. We used greedy selection, random-point crossover, and guided mutation, applying one mutation index for the adjacency matrix and one for operations per individual. We evaluated both variants of guiding mutation based on both probs1 and probs0 for sampling mutation indices. For population size (psize), we chose 50 individuals, as a comparison showed relatively similar results regardless of population size. Table 1 compares algorithms with different population sizes on a time budget of 1e7 seconds. Note that these results are based on unoptimized raw data, so the performance indicated in this table may not be fully representative and could be misleading.

The regularized evolution algorithm was empirically evaluated to have a mutation rate of 0.72 as optimal. As state-of-the-art baseline models, we found that regularized evolution provides a competitive baseline, which shows robust performance across benchmarks. While newer methods claim state-of-the-art performance on benchmarks such as , we had difficulties reproducing these results, lacking large enough sample sizes in their experiments. A more thorough comparison against newer methods should be part of future work.

Generalization test on NATS-bench

In order to test the generalization ability of PBG, it was tested on the topology search space of the NATS-Bench. As this benchmark uses a unified encoding of combining both edges and operations, the novel guided mutation approach used the probs0 vector (PBG-0 variant) to sample the mutation index. This allowed us to test the guided mutation in its pure form, avoiding benchmark-specific encoding issues as shown in Section Encoding. We tested our algorithm against the best-performing algorithms of regularized evolution and REINFORCE as implemented in the NATS-Bench paper.

Ablation studies

In order to differentiate the individual contributions of the two proposed methods of greedy selection and guided mutation, we performed ablation studies. Greedy selection was combined with guided mutation, random mutation that mutates every index with a probability specified by a mutation rate of 0.5, and swapping mutation by interchanging two operations (see Section Greedy selection works best with guided mutation). Guided mutation was combined with greedy selection, random parent pair selection, as well as tournament selection with a tournament size of 10 (see Section Guided mutation benefits from greedy selection). We kept the random-point crossover operation throughout all experiments. For the NAS-Bench-101 specific encoding (see Section Encoding), with loss of generality, we further studied the impact of separately mutating the adjacency matrix and operations by independently sampling mutation indices for each category and combining different approaches of sampling from probs1 or from probs0 (see Section Exploitative guided mutation on operations drives performance gains). We evaluate PBG's performance for both identical settings (00 and 11) and mixed settings with probs0 for the adjacency matrix and probs1 for operations (01), or vice versa (10). This mixed approach should be considered with caution, as there are benchmark-specific operation biases, as described in Section Encoding, which result in a loss of generality for other benchmarks.

Rapid Reads News

Population-based guiding for evolutionary neural architecture search

POPULAR CATEGORY

misc

entertainment

corporate

research

wellness

athletics