Abstract
Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.
Demystifying the Myths and Legends of Nonconvex Convergence of SGD
Authors: Authors: Aritra Dutta, El Houcine Bergou, Soumia Boucherouite, Nicklas Werge, Melih Kandemir, Xin Li
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
Stochastic gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems with nonconvex objective functions. Although the convergence of SGDs in the (strongly) convex case is well-understood, their convergence for nonconvex functions stands on weak mathematical foundations. Most existing studies on the nonconvex convergence of SGD show the complexity results based on either the minimum of the expected gradient norm or the functional sub-optimality gap (for functions with extra structural property) by searching the entire range of iterates. Hence the last iterations of SGDs do not necessarily maintain the same complexity guarantee. This paper shows that an $\epsilon$-stationary point exists in the final iterates of SGDs, given a large enough total iteration budget, $T$, not just anywhere in the entire range of iterates -- a much stronger result than the existing one. Additionally, our analyses allow us to measure the density of the $\epsilon$-stationary points in the final iterates of SGD, and we recover the classical $O(\frac{1}{\sqrt{T}})$ asymptotic rate under various existing assumptions on the objective function and the bounds on the stochastic gradient. As a result of our analyses, we addressed certain myths and legends related to the nonconvex convergence of SGD and posed some thought-provoking questions that could set new directions for research.
Keyword: optimization
Balancing exploration and exploitation phases in whale optimization algorithm: an insightful and empirical analysis
Authors: Authors: Aram M. Ahmed, Tarik A. Rashid, Bryar A. Hassan, Jaffer Majidpour, Kaniaw A. Noori, Chnoor Maheadeen Rahman, Mohmad Hussein Abdalla, Shko M. Qader, Noor Tayfor, Naufel B Mohammed
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
Agents of any metaheuristic algorithms are moving in two modes, namely exploration and exploitation. Obtaining robust results in any algorithm is strongly dependent on how to balance between these two modes. Whale optimization algorithm as a robust and well recognized metaheuristic algorithm in the literature, has proposed a novel scheme to achieve this balance. It has also shown superior results on a wide range of applications. Moreover, in the previous chapter, an equitable and fair performance evaluation of the algorithm was provided. However, to this point, only comparison of the final results is considered, which does not explain how these results are obtained. Therefore, this chapter attempts to empirically analyze the WOA algorithm in terms of the local and global search capabilities i.e. the ratio of exploration and exploitation phases. To achieve this objective, the dimension-wise diversity measurement is employed, which, at various stages of the optimization process, statistically evaluates the population's convergence and diversity.
Adjoint-based inversion of frictional parameters in earthquake modeling
Authors: Authors: Vidar Stiernström, Martin Almquist, Eric M. Dunham
Abstract
We present an adjoint-based optimization method to invert for frictional parameters used in earthquake modeling. The forward problem is linear elasticity with nonlinear rate-and-state frictional faults. The misfit functional quantifies the difference between simulated and measured particle displacements or velocities at receiver locations. The misfit may include windowing or filtering operators. We derive the corresponding adjoint problem, which is linear elasticity with linear rate-and-state friction. The gradient of the misfit is efficiently computed by convolving forward and adjoint variables on the fault. The method thus extends the framework of full-waveform inversion to include frictional faults with rate-and-state friction. In addition, we present a space-time dual-consistent discretization of a dynamic rupture problem with a rough fault in antiplane shear, using high-order accurate summation-by-parts finite differences in combination with explicit Runge-Kutta time integration. The dual consistency of the discretization ensures that the discrete adjoint-based gradient is the exact gradient of the discrete cost functional as well as a consistent approximation of the continuous gradient. Our theoretical results are corroborated by inversions with synthetic data.
Asynchronous Distributed Smoothing and Mapping via On-Manifold Consensus ADMM
Authors: Authors: Daniel McGann, Kyle Lassak, Michael Kaess
Abstract
In this paper we present a fully distributed, asynchronous, and general purpose optimization algorithm for Consensus Simultaneous Localization and Mapping (CSLAM). Multi-robot teams require that agents have timely and accurate solutions to their state as well as the states of the other robots in the team. To optimize this solution we develop a CSLAM back-end based on Consensus ADMM called MESA (Manifold, Edge-based, Separable ADMM). MESA is fully distributed to tolerate failures of individual robots, asynchronous to tolerate practical network conditions, and general purpose to handle any CSLAM problem formulation. We demonstrate that MESA exhibits superior convergence rates and accuracy compare to existing state-of-the art CSLAM back-end optimizers.
New Environment Adaptation with Few Shots for OFDM Receiver and mmWave Beamforming
Authors: Authors: Ouya Wang, Shenglong Zhou, Geoffrey Ye Li
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Few-shot learning (FSL) enables adaptation to new tasks with only limited training data. In wireless communications, channel environments can vary drastically; therefore, FSL techniques can quickly adjust transceiver accordingly. In this paper, we develop two FSL frameworks that fit in wireless transceiver design. Both frameworks are base on optimization programs that can be solved by well-known algorithms like the inexact alternating direction method of multipliers (iADMM) and the inexact alternating direction method (iADM). As examples, we demonstrate how the proposed two FSL frameworks are used for the OFDM receiver and beamforming (BF) for the millimeter wave (mmWave) system. The numerical experiments confirm their desirable performance in both applications compared to other popular approaches, such as transfer learning (TL) and model-agnostic meta-learning.
Quantum Computing for MIMO Beam Selection Problem: Model and Optical Experimental Solution
Abstract
Massive multiple-input multiple-output (MIMO) has gained widespread popularity in recent years due to its ability to increase data rates, improve signal quality, and provide better coverage in challenging environments. In this paper, we investigate the MIMO beam selection (MBS) problem, which is proven to be NP-hard and computationally intractable. To deal with this problem, quantum computing that can provide faster and more efficient solutions to large-scale combinatorial optimization is considered. MBS is formulated in a quadratic unbounded binary optimization form and solved with Coherent Ising Machine (CIM) physical machine. We compare the performance of our solution with two classic heuristics, simulated annealing and Tabu search. The results demonstrate an average performance improvement by a factor of 261.23 and 20.6, respectively, which shows that CIM-based solution performs significantly better in terms of selecting the optimal subset of beams. This work shows great promise for practical 5G operation and promotes the application of quantum computing in solving computationally hard problems in communication.
Low rank approximation method for perturbed linear systems with applications to elliptic type stochastic PDEs
Authors: Authors: Yujun Zhu, Ju Ming, Jie Zhu, Zhongming Wang
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Optimization and Control (math.OC)
Abstract
In this paper, we propose a low rank approximation method for efficiently solving stochastic partial differential equations. Specifically, our method utilizes a novel low rank approximation of the stiffness matrices, which can significantly reduce the computational load and storage requirements associated with matrix inversion without losing accuracy. To demonstrate the versatility and applicability of our method, we apply it to address two crucial uncertainty quantification problems: stochastic elliptic equations and optimal control problems governed by stochastic elliptic PDE constraints. Based on varying dimension reduction ratios, our algorithm exhibits the capability to yield a high precision numerical solution for stochastic partial differential equations, or provides a rough representation of the exact solutions as a pre-processing phase. Meanwhile, our algorithm for solving stochastic optimal control problems allows a diverse range of gradient-based unconstrained optimization methods, rendering it particularly appealing for computationally intensive large-scale problems. Numerical experiments are conducted and the results provide strong validation of the feasibility and effectiveness of our algorithm.
Abstract
Reconfigurable intelligent surface (RIS) emerges as an efficient and promising technology for the next wireless generation networks and has attracted a lot of attention owing to the capability of extending wireless coverage by reflecting signals toward targeted receivers. In this paper, we consider a RIS-assisted high-speed train (HST) communication system to enhance wireless coverage and improve coverage probability. First, coverage performance of the downlink single-input-single-output system is investigated, and the closed-form expression of coverage probability is derived. Moreover, travel distance maximization problem is formulated to facilitate RIS discrete phase design and RIS placement optimization, which is subject to coverage probability constraint. Simulation results validate that better coverage performance and higher travel distance can be achieved with deployment of RIS. The impacts of some key system parameters including transmission power, signal-to-noise ratio threshold, number of RIS elements, number of RIS quantization bits, horizontal distance between base station and RIS, and speed of HST on system performance are investigated. In addition, it is found that RIS can well improve coverage probability with limited power consumption for HST communications.
Auto Search Indexer for End-to-End Document Retrieval
Abstract
Generative retrieval, which is a new advanced paradigm for document retrieval, has recently attracted research interests, since it encodes all documents into the model and directly generates the retrieved documents. However, its power is still underutilized since it heavily relies on the "preprocessed" document identifiers (docids), thus limiting its retrieval performance and ability to retrieve new documents. In this paper, we propose a novel fully end-to-end retrieval paradigm. It can not only end-to-end learn the best docids for existing and new documents automatically via a semantic indexing module, but also perform end-to-end document retrieval via an encoder-decoder-based generative model, namely Auto Search Indexer (ASI). Besides, we design a reparameterization mechanism to combine the above two modules into a joint optimization framework. Extensive experimental results demonstrate the superiority of our model over advanced baselines on both public and industrial datasets and also verify the ability to deal with new documents.
Parallel Bayesian Optimization Using Satisficing Thompson Sampling for Time-Sensitive Black-Box Optimization
Authors: Authors: Xiaobin Song, Benben Jiang
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract
Bayesian optimization (BO) is widely used for black-box optimization problems, and have been shown to perform well in various real-world tasks. However, most of the existing BO methods aim to learn the optimal solution, which may become infeasible when the parameter space is extremely large or the problem is time-sensitive. In these contexts, switching to a satisficing solution that requires less information can result in better performance. In this work, we focus on time-sensitive black-box optimization problems and propose satisficing Thompson sampling-based parallel Bayesian optimization (STS-PBO) approaches, including synchronous and asynchronous versions. We shift the target from an optimal solution to a satisficing solution that is easier to learn. The rate-distortion theory is introduced to construct a loss function that balances the amount of information that needs to be learned with sub-optimality, and the Blahut-Arimoto algorithm is adopted to compute the target solution that reaches the minimum information rate under the distortion limit at each step. Both discounted and undiscounted Bayesian cumulative regret bounds are theoretically derived for the proposed STS-PBO approaches. The effectiveness of the proposed methods is demonstrated on a fast-charging design problem of Lithium-ion batteries. The results are accordant with theoretical analyses, and show that our STS-PBO methods outperform both sequential counterparts and parallel BO with traditional Thompson sampling in both synchronous and asynchronous settings.
Solving Expensive Optimization Problems in Dynamic Environments with Meta-learning
Authors: Authors: Huan Zhang, Jinliang Ding, Liang Feng, Kay Chen Tan, Ke Li
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
Dynamic environments pose great challenges for expensive optimization problems, as the objective functions of these problems change over time and thus require remarkable computational resources to track the optimal solutions. Although data-driven evolutionary optimization and Bayesian optimization (BO) approaches have shown promise in solving expensive optimization problems in static environments, the attempts to develop such approaches in dynamic environments remain rarely unexplored. In this paper, we propose a simple yet effective meta-learning-based optimization framework for solving expensive dynamic optimization problems. This framework is flexible, allowing any off-the-shelf continuously differentiable surrogate model to be used in a plug-in manner, either in data-driven evolutionary optimization or BO approaches. In particular, the framework consists of two unique components: 1) the meta-learning component, in which a gradient-based meta-learning approach is adopted to learn experience (effective model parameters) across different dynamics along the optimization process. 2) the adaptation component, where the learned experience (model parameters) is used as the initial parameters for fast adaptation in the dynamic environment based on few shot samples. By doing so, the optimization process is able to quickly initiate the search in a new environment within a strictly restricted computational budget. Experiments demonstrate the effectiveness of the proposed algorithm framework compared to several state-of-the-art algorithms on common benchmark test problems under different dynamic characteristics.
Large Language Model for Multi-objective Evolutionary Optimization
Abstract
Multiobjective evolutionary algorithms (MOEAs) are major methods for solving multiobjective optimization problems (MOPs). Many MOEAs have been proposed in the past decades, of which the operators need carefully handcrafted design with domain knowledge. Recently, some attempts have been made to replace the manually designed operators in MOEAs with learning-based operators (e.g., neural network models). However, much effort is still required for designing and training such models, and the learned operators might not generalize well to solve new problems. To tackle the above challenges, this work investigates a novel approach that leverages the powerful large language model (LLM) to design MOEA operators. With proper prompt engineering, we successfully let a general LLM serve as a black-box search operator for decomposition-based MOEA (MOEA/D) in a zero-shot manner. In addition, by learning from the LLM behavior, we further design an explicit white-box operator with randomness and propose a new version of decomposition-based MOEA, termed MOEA/D-LO. Experimental studies on different test benchmarks show that our proposed method can achieve competitive performance with widely used MOEAs. It is also promising to see the operator only learned from a few instances can have robust generalization performance on unseen problems with quite different patterns and settings. The results reveal the potential benefits of using pre-trained LLMs in the design of MOEAs.
Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers
Abstract
The quality of explanations for the predictions of complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how correctly the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both insertion and deletion scores of the explanations while keeping their predictive accuracy. Since the original insertion and deletion metrics are indifferentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics to be differentiable and use them to formalize insertion and deletion metric-based regularizers. The experimental results on image and tabular datasets show that the deep neural networks-based predictors fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easy-to-interpret explanations while keeping high predictive accuracy.
GMEM: Generalized Memory Management for Peripheral Devices
Authors: Authors: Weixi Zhu, Alan L. Cox, Scott Rixner
Abstract
This paper presents GMEM, generalized memory management, for peripheral devices. GMEM provides OS support for centralized memory management of both CPU and devices. GMEM provides a high-level interface that decouples MMU-specific functions. Device drivers can thus attach themselves to a process's address space and let the OS take charge of their memory management. This eliminates the need for device drivers to "reinvent the wheel" and allows them to benefit from general memory optimizations integrated by GMEM. Furthermore, GMEM internally coordinates all attached devices within each virtual address space. This drastically improves user-level programmability, since programmers can use a single address space within their program, even when operating across the CPU and multiple devices. A case study on device drivers demonstrates these benefits. A GMEM-based IOMMU driver eliminates around seven hundred lines of code and obtains 54% higher network receive throughput utilizing 32% less CPU compared to the state-of-the-art. In addition, the GMEM-based driver of a simulated GPU takes less than 70 lines of code, excluding its MMU functions.
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
Abstract
Artificial intelligence (AI) systems possess significant potential to drive societal progress. However, their deployment often faces obstacles due to substantial safety concerns. Safe reinforcement learning (SafeRL) emerges as a solution to optimize policies while simultaneously adhering to multiple constraints, thereby addressing the challenge of integrating reinforcement learning in safety-critical scenarios. In this paper, we present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios, accepting vector and vision-only input. Additionally, we offer a library of algorithms named Safe Policy Optimization (SafePO), comprising 16 state-of-the-art SafeRL algorithms. This comprehensive library can serve as a validation tool for the research community. By introducing this benchmark, we aim to facilitate the evaluation and comparison of safety performance, thus fostering the development of reinforcement learning for safer, more reliable, and responsible real-world applications. The website of this project can be accessed at https://sites.google.com/view/safety-gymnasium.
Quantum computing through the lens of control: A tutorial introduction
Authors: Authors: Julian Berberich, Daniel Fink
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC); Quantum Physics (quant-ph)
Abstract
Quantum computing is a fascinating interdisciplinary research field that promises to revolutionize computing by efficiently solving previously intractable problems. Recent years have seen tremendous progress on both the experimental realization of quantum computing devices as well as the development and implementation of quantum algorithms. Yet, realizing computational advantages of quantum computers in practice remains a widely open problem due to numerous fundamental challenges. Interestingly, many of these challenges are connected to performance, robustness, scalability, optimization, or feedback, all of which are central concepts in control theory. This paper provides a tutorial introduction to quantum computing from the perspective of control theory. We introduce the mathematical framework of quantum algorithms ranging from basic elements including quantum bits and quantum gates to more advanced concepts such as variational quantum algorithms and quantum errors. The tutorial only requires basic knowledge of linear algebra and, in particular, no prior exposure to quantum physics. Our main goal is to equip readers with the mathematical basics required to understand and possibly solve (control-related) problems in quantum computing. In particular, beyond the tutorial introduction, we provide a list of research challenges in the field of quantum computing and discuss their connections to control.
How a student becomes a teacher: learning and forgetting through Spectral methods
Authors: Authors: Lorenzo Giambagli, Lorenzo Buffoni, Lorenzo Chicchi, Duccio Fanelli
Abstract
In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.
An Improved Metarounding Algorithm via Frank-Wolfe
Abstract
Metarounding is an approach to convert an approximation algorithm for linear optimization over some combinatorial classes to an online linear optimization algorithm for the same class. We propose a new metarounding algorithm under a natural assumption that a relax-based approximation algorithm exists for the combinatorial class. Our algorithm is much more efficient in both theoretical and practical aspects.
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Authors: Authors: Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhenisbek Assylbekov
Abstract
Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.
Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining
Abstract
Extensive system scales (i.e. thousands of GPU/TPUs) and prolonged training periods (i.e. months of pretraining) significantly escalate the probability of failures when training large language models (LLMs). Thus, efficient and reliable fault-tolerance methods are in urgent need. Checkpointing is the primary fault-tolerance method to periodically save parameter snapshots from GPU memory to disks via CPU memory. In this paper, we identify the frequency of existing checkpoint-based fault-tolerance being significantly limited by the storage I/O overheads, which results in hefty re-training costs on restarting from the nearest checkpoint. In response to this gap, we introduce an in-memory fault-tolerance framework for large-scale LLM pretraining. The framework boosts the efficiency and reliability of fault tolerance from three aspects: (1) Reduced Data Transfer and I/O: By asynchronously caching parameters, i.e., sharded model parameters, optimizer states, and RNG states, to CPU volatile memory, Our framework significantly reduces communication costs and bypasses checkpoint I/O. (2) Enhanced System Reliability: Our framework enhances parameter protection with a two-layer hierarchy: snapshot management processes (SMPs) safeguard against software failures, together with Erasure Coding (EC) protecting against node failures. This double-layered protection greatly improves the survival probability of the parameters compared to existing checkpointing methods. (3) Improved Snapshotting Frequency: Our framework achieves more frequent snapshotting compared with asynchronous checkpointing optimizations under the same saving time budget, which improves the fault tolerance efficiency. Empirical results demonstrate that Our framework minimizes the overhead of fault tolerance of LLM pretraining by effectively leveraging redundant CPU resources.
On the Optimization and Generalization of Multi-head Attention
Abstract
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.
Generating Robust Adversarial Examples against Online Social Networks (OSNs)
Abstract
Online Social Networks (OSNs) have blossomed into prevailing transmission channels for images in the modern era. Adversarial examples (AEs) deliberately designed to mislead deep neural networks (DNNs) are found to be fragile against the inevitable lossy operations conducted by OSNs. As a result, the AEs would lose their attack capabilities after being transmitted over OSNs. In this work, we aim to design a new framework for generating robust AEs that can survive the OSN transmission; namely, the AEs before and after the OSN transmission both possess strong attack capabilities. To this end, we first propose a differentiable network termed SImulated OSN (SIO) to simulate the various operations conducted by an OSN. Specifically, the SIO network consists of two modules: 1) a differentiable JPEG layer for approximating the ubiquitous JPEG compression and 2) an encoder-decoder subnetwork for mimicking the remaining operations. Based upon the SIO network, we then formulate an optimization framework to generate robust AEs by enforcing model outputs with and without passing through the SIO to be both misled. Extensive experiments conducted over Facebook, WeChat and QQ demonstrate that our attack methods produce more robust AEs than existing approaches, especially under small distortion constraints; the performance gain in terms of Attack Success Rate (ASR) could be more than 60%. Furthermore, we build a public dataset containing more than 10,000 pairs of AEs processed by Facebook, WeChat or QQ, facilitating future research in the robust AEs generation. The dataset and code are available at https://github.com/csjunjun/RobustOSNAttack.git.
Capacity Limitation and Optimization Strategy for Flexible Point-to-Multi-Point Optical Networks
Authors: Authors: Ji Zhou, Haide Wang, Liangchuan Li, Weiping Liu, Changyuan Yu, Zhaohui Li
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
Point-to-multi-point (PtMP) optical networks become the main solutions for network-edge applications such as passive optical networks and radio access networks. Entropy-loading digital subcarrier multiplexing (DSCM) is the core technology to achieve low latency and approach high capacity for flexible PtMP optical networks. However, the high peak-to-average power ratio of the entropy-loading DSCM signal limits the power budget and restricts the capacity, which can be reduced effectively by clipping operation. In this paper, we derive the theoretical capacity limitation of the flexible PtMP optical networks based on the entropy-loading DSCM signal. Meanwhile, an optimal clipping ratio for the clipping operation is acquired to approach the highest capacity limitation. Based on an accurate clipping-noise model under the optimal clipping ratio, we establish a three-dimensional look-up table for bit-error ratio, spectral efficiency, and link loss. Based on the three-dimensional look-up table, an optimization strategy is proposed to acquire optimal spectral efficiencies for achieving a higher capacity of the flexible PtMP optical networks.
Curvature Aligned Simplex Gradient: Principled Sample Set Construction For Numerical Differentiation
Authors: Authors: Daniel Lengyel, Panos Parpas, Nikolas Kantas, Nicholas R. Jennings
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
The simplex gradient, a popular numerical differentiation method due to its flexibility, lacks a principled method by which to construct the sample set, specifically the location of function evaluations. Such evaluations, especially from real-world systems, are often noisy and expensive to obtain, making it essential that each evaluation is carefully chosen to reduce cost and increase accuracy. This paper introduces the curvature aligned simplex gradient (CASG), which provably selects the optimal sample set under a mean squared error objective. As CASG requires function-dependent information often not available in practice, we additionally introduce a framework which exploits a history of function evaluations often present in practical applications. Our numerical results, focusing on applications in sensitivity analysis and derivative free optimization, show that our methodology significantly outperforms or matches the performance of the benchmark gradient estimator given by forward differences (FD) which is given exact function-dependent information that is not available in practice. Furthermore, our methodology is comparable to the performance of central differences (CD) that requires twice the number of function evaluations.
Learn from the Past: A Proxy based Adversarial Defense Framework to Boost Robustness
Abstract
In light of the vulnerability of deep learning models to adversarial samples and the ensuing security issues, a range of methods, including Adversarial Training (AT) as a prominent representative, aimed at enhancing model robustness against various adversarial attacks, have seen rapid development. However, existing methods essentially assist the current state of target model to defend against parameter-oriented adversarial attacks with explicit or implicit computation burdens, which also suffers from unstable convergence behavior due to inconsistency of optimization trajectories. Diverging from previous work, this paper reconsiders the update rule of target model and corresponding deficiency to defend based on its current state. By introducing the historical state of the target model as a proxy, which is endowed with much prior information for defense, we formulate a two-stage update rule, resulting in a general adversarial defense framework, which we refer to as `LAST' ({\bf L}earn from the P{\bf ast}). Besides, we devise a Self Distillation (SD) based defense objective to constrain the update process of the proxy model without the introduction of larger teacher models. Experimentally, we demonstrate consistent and significant performance enhancements by refining a series of single-step and multi-step AT methods (e.g., up to $\bf 9.2\%$ and $\bf 20.5\%$ improvement of Robust Accuracy (RA) on CIFAR10 and CIFAR100 datasets, respectively) across various datasets, backbones and attack modalities, and validate its ability to enhance training stability and ameliorate catastrophic overfitting issues meanwhile.
Discretize Relaxed Solution of Spectral Clustering via a Non-Heuristic Algorithm
Abstract
Spectral clustering and its extensions usually consist of two steps: (1) constructing a graph and computing the relaxed solution; (2) discretizing relaxed solutions. Although the former has been extensively investigated, the discretization techniques are mainly heuristic methods, e.g., k-means, spectral rotation. Unfortunately, the goal of the existing methods is not to find a discrete solution that minimizes the original objective. In other words, the primary drawback is the neglect of the original objective when computing the discrete solution. Inspired by the first-order optimization algorithms, we propose to develop a first-order term to bridge the original problem and discretization algorithm, which is the first non-heuristic to the best of our knowledge. Since the non-heuristic method is aware of the original graph cut problem, the final discrete solution is more reliable and achieves the preferable loss value. We also theoretically show that the continuous optimum is beneficial to discretization algorithms though simply finding its closest discrete solution is an existing heuristic algorithm which is also unreliable. Sufficient experiments significantly show the superiority of our method.
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Authors: Authors: Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
Abstract
With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.
Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
Authors: Authors: Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Prompt-based learning has been an effective paradigm for large pretrained language models (LLM), enabling few-shot or even zero-shot learning. Black-box prompt search has received growing interest recently for its distinctive properties of gradient-free optimization, proven particularly useful and powerful for model-as-a-service usage. However, the discrete nature and the complexity of combinatorial optimization hinder the efficiency of modern black-box approaches. Despite extensive research on search algorithms, the crucial aspect of search space design and optimization has been largely overlooked. In this paper, we first conduct a sensitivity analysis by prompting LLM, revealing that only a small number of tokens exert a disproportionate amount of influence on LLM predictions. Leveraging this insight, we propose the Clustering and Pruning for Efficient Black-box Prompt Search (ClaPS), a simple black-box search method that first clusters and prunes the search space to focus exclusively on influential prompt tokens. By employing even simple search methods within the pruned search space, ClaPS achieves state-of-the-art performance across various tasks and LLMs, surpassing the performance of complex approaches while significantly reducing search costs. Our findings underscore the critical role of search space design and optimization in enhancing both the usefulness and the efficiency of black-box prompt-based learning.
Neural Degradation Representation Learning for All-In-One Image Restoration
Abstract
Existing methods have demonstrated effective performance on a single degradation type. In practical applications, however, the degradation is often unknown, and the mismatch between the model and the degradation will result in a severe performance drop. In this paper, we propose an all-in-one image restoration network that tackles multiple degradations. Due to the heterogeneous nature of different types of degradations, it is difficult to process multiple degradations in a single network. To this end, we propose to learn a neural degradation representation (NDR) that captures the underlying characteristics of various degradations. The learned NDR decomposes different types of degradations adaptively, similar to a neural dictionary that represents basic degradation components. Subsequently, we develop a degradation query module and a degradation injection module to effectively recognize and utilize the specific degradation based on NDR, enabling the all-in-one restoration ability for multiple degradations. Moreover, we propose a bidirectional optimization strategy to effectively drive NDR to learn the degradation representation by optimizing the degradation and restoration processes alternately. Comprehensive experiments on representative types of degradations (including noise, haze, rain, and downsampling) demonstrate the effectiveness and generalization capability of our method.
An Enumerative Perspective on Connectivity
Authors: Authors: Shyan Akmal
Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM)
Abstract
Connectivity (or equivalently, unweighted maximum flow) is an important measure in graph theory and combinatorial optimization. Given a graph $G$ with vertices $s$ and $t$, the connectivity $\lambda(s,t)$ from $s$ to $t$ is defined to be the maximum number of edge-disjoint paths from $s$ to $t$ in $G$. Much research has gone into designing fast algorithms for computing connectivities in graphs. Previous work showed that it is possible to compute connectivities for all pairs of vertices in directed graphs with $m$ edges in $\tilde{O}(m^\omega)$ time [Chueng, Lau, and Leung, FOCS 2011], where $\omega \in [2,2.3716)$ is the exponent of matrix multiplication. For the related problem of computing "small connectivities," it was recently shown that for any positive integer $k$, we can compute $\min(k,\lambda(s,t))$ for all pairs of vertices $(s,t)$ in a directed graph with $n$ nodes in $\tilde{O}((kn)^\omega)$ time [Akmal and Jin, ICALP 2023]. In this paper, we present an alternate exposition of these $\tilde{O}(m^\omega)$ and $\tilde{O}((kn)^\omega)$ time algorithms, with simpler proofs of correctness. Earlier proofs were somewhat indirect, introducing an elegant but ad hoc "flow vector framework" for showing correctness of these algorithms. In contrast, we observe that these algorithms for computing exact and small connectivity values can be interpreted as testing whether certain generating functions enumerating families of edge-disjoint paths are nonzero. This new perspective yields more transparent proofs, and ties the approach for these problems more closely to the literature surrounding algebraic graph algorithms.
Eureka: Human-Level Reward Design via Coding Large Language Models
Abstract
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.
End-to-End Delay Minimization based on Joint Optimization of DNN Partitioning and Resource Allocation for Cooperative Edge Inference
Abstract
Cooperative inference in Mobile Edge Computing (MEC), achieved by deploying partitioned Deep Neural Network (DNN) models between resource-constrained user equipments (UEs) and edge servers (ESs), has emerged as a promising paradigm. Firstly, we consider scenarios of continuous Artificial Intelligence (AI) task arrivals, like the object detection for video streams, and utilize a serial queuing model for the accurate evaluation of End-to-End (E2E) delay in cooperative edge inference. Secondly, to enhance the long-term performance of inference systems, we formulate a multi-slot stochastic E2E delay optimization problem that jointly considers model partitioning and multi-dimensional resource allocation. Finally, to solve this problem, we introduce a Lyapunov-guided Multi-Dimensional Optimization algorithm (LyMDO) that decouples the original problem into per-slot deterministic problems, where Deep Reinforcement Learning (DRL) and convex optimization are used for joint optimization of partitioning decisions and complementary resource allocation. Simulation results show that our approach effectively improves E2E delay while balancing long-term resource constraints.
Demystifying the Myths and Legends of Nonconvex Convergence of SGD
Authors: Authors: Aritra Dutta, El Houcine Bergou, Soumia Boucherouite, Nicklas Werge, Melih Kandemir, Xin Li
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
Stochastic gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems with nonconvex objective functions. Although the convergence of SGDs in the (strongly) convex case is well-understood, their convergence for nonconvex functions stands on weak mathematical foundations. Most existing studies on the nonconvex convergence of SGD show the complexity results based on either the minimum of the expected gradient norm or the functional sub-optimality gap (for functions with extra structural property) by searching the entire range of iterates. Hence the last iterations of SGDs do not necessarily maintain the same complexity guarantee. This paper shows that an $\epsilon$-stationary point exists in the final iterates of SGDs, given a large enough total iteration budget, $T$, not just anywhere in the entire range of iterates -- a much stronger result than the existing one. Additionally, our analyses allow us to measure the density of the $\epsilon$-stationary points in the final iterates of SGD, and we recover the classical $O(\frac{1}{\sqrt{T}})$ asymptotic rate under various existing assumptions on the objective function and the bounds on the stochastic gradient. As a result of our analyses, we addressed certain myths and legends related to the nonconvex convergence of SGD and posed some thought-provoking questions that could set new directions for research.
Keyword: adam
Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization
Abstract
Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.
Stochastic Average Gradient : A Simple Empirical Investigation
Authors: Authors: Pascal Junior Tikeng Notsawo
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract
Despite the recent growth of theoretical studies and empirical successes of neural networks, gradient backpropagation is still the most widely used algorithm for training such networks. On the one hand, we have deterministic or full gradient (FG) approaches that have a cost proportional to the amount of training data used but have a linear convergence rate, and on the other hand, stochastic gradient (SG) methods that have a cost independent of the size of the dataset, but have a less optimal convergence rate than the determinist approaches. To combine the cost of the stochastic approach with the convergence rate of the deterministic approach, a stochastic average gradient (SAG) has been proposed. SAG is a method for optimizing the sum of a finite number of smooth convex functions. Like SG methods, the SAG method's iteration cost is independent of the number of terms in the sum. In this work, we propose to compare SAG to some standard optimizers used in machine learning. SAG converges faster than other optimizers on simple toy problems and performs better than many other optimizers on simple machine learning problems. We also propose a combination of SAG with the momentum algorithm and Adam. These combinations allow empirically higher speed and obtain better performance than the other methods, especially when the landscape of the function to optimize presents obstacles or is ill-conditioned.
Keyword: gradient
Adjoint-based inversion of frictional parameters in earthquake modeling
Authors: Authors: Vidar Stiernström, Martin Almquist, Eric M. Dunham
Abstract
We present an adjoint-based optimization method to invert for frictional parameters used in earthquake modeling. The forward problem is linear elasticity with nonlinear rate-and-state frictional faults. The misfit functional quantifies the difference between simulated and measured particle displacements or velocities at receiver locations. The misfit may include windowing or filtering operators. We derive the corresponding adjoint problem, which is linear elasticity with linear rate-and-state friction. The gradient of the misfit is efficiently computed by convolving forward and adjoint variables on the fault. The method thus extends the framework of full-waveform inversion to include frictional faults with rate-and-state friction. In addition, we present a space-time dual-consistent discretization of a dynamic rupture problem with a rough fault in antiplane shear, using high-order accurate summation-by-parts finite differences in combination with explicit Runge-Kutta time integration. The dual consistency of the discretization ensures that the discrete adjoint-based gradient is the exact gradient of the discrete cost functional as well as a consistent approximation of the continuous gradient. Our theoretical results are corroborated by inversions with synthetic data.
Provable Guarantees for Neural Networks via Gradient Feature Learning
Abstract
Neural networks have achieved remarkable empirical performance, while the current theoretical analysis is not adequate for understanding their success, e.g., the Neural Tangent Kernel approach fails to capture their key feature learning ability, while recent analyses on feature learning are typically problem-specific. This work proposes a unified analysis framework for two-layer networks trained by gradient descent. The framework is centered around the principle of feature learning from gradients, and its effectiveness is demonstrated by applications in several prototypical problems, such as mixtures of Gaussians and parity functions. The framework also sheds light on interesting network learning phenomena such as feature learning beyond kernels and the lottery ticket hypothesis.
Low rank approximation method for perturbed linear systems with applications to elliptic type stochastic PDEs
Authors: Authors: Yujun Zhu, Ju Ming, Jie Zhu, Zhongming Wang
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Optimization and Control (math.OC)
Abstract
In this paper, we propose a low rank approximation method for efficiently solving stochastic partial differential equations. Specifically, our method utilizes a novel low rank approximation of the stiffness matrices, which can significantly reduce the computational load and storage requirements associated with matrix inversion without losing accuracy. To demonstrate the versatility and applicability of our method, we apply it to address two crucial uncertainty quantification problems: stochastic elliptic equations and optimal control problems governed by stochastic elliptic PDE constraints. Based on varying dimension reduction ratios, our algorithm exhibits the capability to yield a high precision numerical solution for stochastic partial differential equations, or provides a rough representation of the exact solutions as a pre-processing phase. Meanwhile, our algorithm for solving stochastic optimal control problems allows a diverse range of gradient-based unconstrained optimization methods, rendering it particularly appealing for computationally intensive large-scale problems. Numerical experiments are conducted and the results provide strong validation of the feasibility and effectiveness of our algorithm.
Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping
Authors: Authors: Zijie Pan, Jiachen Lu, Xiatian Zhu, Li Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model's capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently, while preserving crucial texture-related gradient directions. Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGC in enhancing the performance of existing 3D generative models for high-resolution object rendering.
Solving Expensive Optimization Problems in Dynamic Environments with Meta-learning
Authors: Authors: Huan Zhang, Jinliang Ding, Liang Feng, Kay Chen Tan, Ke Li
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
Dynamic environments pose great challenges for expensive optimization problems, as the objective functions of these problems change over time and thus require remarkable computational resources to track the optimal solutions. Although data-driven evolutionary optimization and Bayesian optimization (BO) approaches have shown promise in solving expensive optimization problems in static environments, the attempts to develop such approaches in dynamic environments remain rarely unexplored. In this paper, we propose a simple yet effective meta-learning-based optimization framework for solving expensive dynamic optimization problems. This framework is flexible, allowing any off-the-shelf continuously differentiable surrogate model to be used in a plug-in manner, either in data-driven evolutionary optimization or BO approaches. In particular, the framework consists of two unique components: 1) the meta-learning component, in which a gradient-based meta-learning approach is adopted to learn experience (effective model parameters) across different dynamics along the optimization process. 2) the adaptation component, where the learned experience (model parameters) is used as the initial parameters for fast adaptation in the dynamic environment based on few shot samples. By doing so, the optimization process is able to quickly initiate the search in a new environment within a strictly restricted computational budget. Experiments demonstrate the effectiveness of the proposed algorithm framework compared to several state-of-the-art algorithms on common benchmark test problems under different dynamic characteristics.
Multilevel Picard algorithm for general semilinear parabolic PDEs with gradient-dependent nonlinearities
Authors: Authors: Ariel Neufeld, Sizhou Wu
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Probability (math.PR)
Abstract
In this paper we introduce a multilevel Picard approximation algorithm for general semilinear parabolic PDEs with gradient-dependent nonlinearities whose coefficient functions do not need to be constant. We also provide a full convergence and complexity analysis of our algorithm. To obtain our main results, we consider a particular stochastic fixed-point equation (SFPE) motivated by the Feynman-Kac representation and the Bismut-Elworthy-Li formula. We show that the PDE under consideration has a unique viscosity solution which coincides with the first component of the unique solution of the stochastic fixed-point equation. Moreover, if the PDE admits a strong solution, then the gradient of the unique solution of the PDE coincides with the second component of the unique solution of the stochastic fixed-point equation.
Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers
Abstract
The quality of explanations for the predictions of complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how correctly the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both insertion and deletion scores of the explanations while keeping their predictive accuracy. Since the original insertion and deletion metrics are indifferentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics to be differentiable and use them to formalize insertion and deletion metric-based regularizers. The experimental results on image and tabular datasets show that the deep neural networks-based predictors fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easy-to-interpret explanations while keeping high predictive accuracy.
How a student becomes a teacher: learning and forgetting through Spectral methods
Authors: Authors: Lorenzo Giambagli, Lorenzo Buffoni, Lorenzo Chicchi, Duccio Fanelli
Abstract
In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Authors: Authors: Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhenisbek Assylbekov
Abstract
Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.
Neural networks for insurance pricing with frequency and severity data: a benchmark study from data preprocessing to technical tariff
Abstract
Insurers usually turn to generalized linear models for modelling claim frequency and severity data. Due to their success in other fields, machine learning techniques are gaining popularity within the actuarial toolbox. Our paper contributes to the literature on frequency-severity insurance pricing with machine learning via deep learning structures. We present a benchmark study on four insurance data sets with frequency and severity targets in the presence of multiple types of input features. We compare in detail the performance of: a generalized linear model on binned input data, a gradient-boosted tree model, a feed-forward neural network (FFNN), and the combined actuarial neural network (CANN). Our CANNs combine a baseline prediction established with a GLM and GBM, respectively, with a neural network correction. We explain the data preprocessing steps with specific focus on the multiple types of input features typically present in tabular insurance data sets, such as postal codes, numeric and categorical covariates. Autoencoders are used to embed the categorical variables into the neural network and we explore their potential advantages in a frequency-severity setting. Finally, we construct global surrogate models for the neural nets' frequency and severity models. These surrogates enable the translation of the essential insights captured by the FFNNs or CANNs to GLMs. As such, a technical tariff table results that can easily be deployed in practice.
On the Optimization and Generalization of Multi-head Attention
Abstract
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.
Discrete-to-Continuum Rates of Convergence for $p$-Laplacian Regularization
Authors: Authors: Adrien Weihs, Jalal Fadili, Matthew Thorpe
Abstract
Higher-order regularization problem formulations are popular frameworks used in machine learning, inverse problems and image/signal processing. In this paper, we consider the computational problem of finding the minimizer of the Sobolev $\mathrm{W}^{1,p}$ semi-norm with a data-fidelity term. We propose a discretization procedure and prove convergence rates between our numerical solution and the target function. Our approach consists of discretizing an appropriate gradient flow problem in space and time. The space discretization is a nonlocal approximation of the p-Laplacian operator and our rates directly depend on the localization parameter $\epsilon_n$ and the time mesh-size $\tau_n$. We precisely characterize the asymptotic behaviour of $\epsilon_n$ and $\tau_n$ in order to ensure convergence to the considered minimizer. Finally, we apply our results to the setting of random graph models.
Representation Learning via Consistent Assignment of Views over Random Partitions
Abstract
We present Consistent Assignment of Views over Random Partitions (CARP), a self-supervised clustering method for representation learning of visual features. CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. CARP optimizes a new pretext task based on random partitions of prototypes that regularizes the model and enforces consistency between views' assignments. Additionally, our method improves training stability and prevents collapsed solutions in joint-embedding training. Through an extensive evaluation, we demonstrate that CARP's representations are suitable for learning downstream tasks. We evaluate CARP's representations capabilities in 17 datasets across many standard protocols, including linear evaluation, few-shot classification, k-NN, k-means, image retrieval, and copy detection. We compare CARP performance to 11 existing self-supervised methods. We extensively ablate our method and demonstrate that our proposed random partition pretext task improves the quality of the learned representations by devising multiple random classification tasks. In transfer learning tasks, CARP achieves the best performance on average against many SSL methods trained for a longer time.
Curvature Aligned Simplex Gradient: Principled Sample Set Construction For Numerical Differentiation
Authors: Authors: Daniel Lengyel, Panos Parpas, Nikolas Kantas, Nicholas R. Jennings
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
The simplex gradient, a popular numerical differentiation method due to its flexibility, lacks a principled method by which to construct the sample set, specifically the location of function evaluations. Such evaluations, especially from real-world systems, are often noisy and expensive to obtain, making it essential that each evaluation is carefully chosen to reduce cost and increase accuracy. This paper introduces the curvature aligned simplex gradient (CASG), which provably selects the optimal sample set under a mean squared error objective. As CASG requires function-dependent information often not available in practice, we additionally introduce a framework which exploits a history of function evaluations often present in practical applications. Our numerical results, focusing on applications in sensitivity analysis and derivative free optimization, show that our methodology significantly outperforms or matches the performance of the benchmark gradient estimator given by forward differences (FD) which is given exact function-dependent information that is not available in practice. Furthermore, our methodology is comparable to the performance of central differences (CD) that requires twice the number of function evaluations.
Stochastic Average Gradient : A Simple Empirical Investigation
Authors: Authors: Pascal Junior Tikeng Notsawo
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract
Despite the recent growth of theoretical studies and empirical successes of neural networks, gradient backpropagation is still the most widely used algorithm for training such networks. On the one hand, we have deterministic or full gradient (FG) approaches that have a cost proportional to the amount of training data used but have a linear convergence rate, and on the other hand, stochastic gradient (SG) methods that have a cost independent of the size of the dataset, but have a less optimal convergence rate than the determinist approaches. To combine the cost of the stochastic approach with the convergence rate of the deterministic approach, a stochastic average gradient (SAG) has been proposed. SAG is a method for optimizing the sum of a finite number of smooth convex functions. Like SG methods, the SAG method's iteration cost is independent of the number of terms in the sum. In this work, we propose to compare SAG to some standard optimizers used in machine learning. SAG converges faster than other optimizers on simple toy problems and performs better than many other optimizers on simple machine learning problems. We also propose a combination of SAG with the momentum algorithm and Adam. These combinations allow empirically higher speed and obtain better performance than the other methods, especially when the landscape of the function to optimize presents obstacles or is ill-conditioned.
Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
Authors: Authors: Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Prompt-based learning has been an effective paradigm for large pretrained language models (LLM), enabling few-shot or even zero-shot learning. Black-box prompt search has received growing interest recently for its distinctive properties of gradient-free optimization, proven particularly useful and powerful for model-as-a-service usage. However, the discrete nature and the complexity of combinatorial optimization hinder the efficiency of modern black-box approaches. Despite extensive research on search algorithms, the crucial aspect of search space design and optimization has been largely overlooked. In this paper, we first conduct a sensitivity analysis by prompting LLM, revealing that only a small number of tokens exert a disproportionate amount of influence on LLM predictions. Leveraging this insight, we propose the Clustering and Pruning for Efficient Black-box Prompt Search (ClaPS), a simple black-box search method that first clusters and prunes the search space to focus exclusively on influential prompt tokens. By employing even simple search methods within the pruned search space, ClaPS achieves state-of-the-art performance across various tasks and LLMs, surpassing the performance of complex approaches while significantly reducing search costs. Our findings underscore the critical role of search space design and optimization in enhancing both the usefulness and the efficiency of black-box prompt-based learning.
Model Merging by Uncertainty-Based Gradient Matching
Authors: Authors: Nico Daheim, Thomas Möllenhoff, Edoardo Maria Ponti, Iryna Gurevych, Mohammad Emtiyaz Khan
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.
Eureka: Human-Level Reward Design via Coding Large Language Models
Abstract
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.
Demystifying the Myths and Legends of Nonconvex Convergence of SGD
Authors: Authors: Aritra Dutta, El Houcine Bergou, Soumia Boucherouite, Nicklas Werge, Melih Kandemir, Xin Li
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
Stochastic gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems with nonconvex objective functions. Although the convergence of SGDs in the (strongly) convex case is well-understood, their convergence for nonconvex functions stands on weak mathematical foundations. Most existing studies on the nonconvex convergence of SGD show the complexity results based on either the minimum of the expected gradient norm or the functional sub-optimality gap (for functions with extra structural property) by searching the entire range of iterates. Hence the last iterations of SGDs do not necessarily maintain the same complexity guarantee. This paper shows that an $\epsilon$-stationary point exists in the final iterates of SGDs, given a large enough total iteration budget, $T$, not just anywhere in the entire range of iterates -- a much stronger result than the existing one. Additionally, our analyses allow us to measure the density of the $\epsilon$-stationary points in the final iterates of SGD, and we recover the classical $O(\frac{1}{\sqrt{T}})$ asymptotic rate under various existing assumptions on the objective function and the bounds on the stochastic gradient. As a result of our analyses, we addressed certain myths and legends related to the nonconvex convergence of SGD and posed some thought-provoking questions that could set new directions for research.
Variational Inference for SDEs Driven by Fractional Noise
Abstract
We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,-an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.
Keyword: sgd
Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization
Demystifying the Myths and Legends of Nonconvex Convergence of SGD
Keyword: optimization
Balancing exploration and exploitation phases in whale optimization algorithm: an insightful and empirical analysis
Adjoint-based inversion of frictional parameters in earthquake modeling
Asynchronous Distributed Smoothing and Mapping via On-Manifold Consensus ADMM
New Environment Adaptation with Few Shots for OFDM Receiver and mmWave Beamforming
Quantum Computing for MIMO Beam Selection Problem: Model and Optical Experimental Solution
Low rank approximation method for perturbed linear systems with applications to elliptic type stochastic PDEs
Reconfigurable Intelligent Surface Assisted High-Speed Train Communications: Coverage Performance Analysis and Placement Optimization
Auto Search Indexer for End-to-End Document Retrieval
Parallel Bayesian Optimization Using Satisficing Thompson Sampling for Time-Sensitive Black-Box Optimization
Solving Expensive Optimization Problems in Dynamic Environments with Meta-learning
Large Language Model for Multi-objective Evolutionary Optimization
Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers
GMEM: Generalized Memory Management for Peripheral Devices
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
Quantum computing through the lens of control: A tutorial introduction
How a student becomes a teacher: learning and forgetting through Spectral methods
An Improved Metarounding Algorithm via Frank-Wolfe
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining
On the Optimization and Generalization of Multi-head Attention
Generating Robust Adversarial Examples against Online Social Networks (OSNs)
Capacity Limitation and Optimization Strategy for Flexible Point-to-Multi-Point Optical Networks
Curvature Aligned Simplex Gradient: Principled Sample Set Construction For Numerical Differentiation
Learn from the Past: A Proxy based Adversarial Defense Framework to Boost Robustness
Discretize Relaxed Solution of Spectral Clustering via a Non-Heuristic Algorithm
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
Neural Degradation Representation Learning for All-In-One Image Restoration
An Enumerative Perspective on Connectivity
Eureka: Human-Level Reward Design via Coding Large Language Models
End-to-End Delay Minimization based on Joint Optimization of DNN Partitioning and Resource Allocation for Cooperative Edge Inference
Demystifying the Myths and Legends of Nonconvex Convergence of SGD
Keyword: adam
Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization
Stochastic Average Gradient : A Simple Empirical Investigation
Keyword: gradient
Adjoint-based inversion of frictional parameters in earthquake modeling
Provable Guarantees for Neural Networks via Gradient Feature Learning
Low rank approximation method for perturbed linear systems with applications to elliptic type stochastic PDEs
Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping
Solving Expensive Optimization Problems in Dynamic Environments with Meta-learning
Multilevel Picard algorithm for general semilinear parabolic PDEs with gradient-dependent nonlinearities
Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers
How a student becomes a teacher: learning and forgetting through Spectral methods
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Neural networks for insurance pricing with frequency and severity data: a benchmark study from data preprocessing to technical tariff
On the Optimization and Generalization of Multi-head Attention
Discrete-to-Continuum Rates of Convergence for $p$-Laplacian Regularization
Representation Learning via Consistent Assignment of Views over Random Partitions
Curvature Aligned Simplex Gradient: Principled Sample Set Construction For Numerical Differentiation
Stochastic Average Gradient : A Simple Empirical Investigation
Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
Model Merging by Uncertainty-Based Gradient Matching
Eureka: Human-Level Reward Design via Coding Large Language Models
Demystifying the Myths and Legends of Nonconvex Convergence of SGD
Variational Inference for SDEs Driven by Fractional Noise
Keyword: super-resolution
There is no result