New submissions for Mon, 27 Nov 23

Keyword: sgd

A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer

Authors: Authors: Chenyang Li, Zhao Song, Weixin Wang, Chiwun Yang
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.13624
Pdf link: https://arxiv.org/pdf/2311.13624
Abstract The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients. This approach poses a substantial threat to the privacy of individuals and organizations alike. This research presents a comprehensive analysis of the gradient leakage method when applied specifically to transformer-based models. Through meticulous examination, we showcase the capability to accurately recover data solely from gradients and rigorously investigate the conditions under which gradient attacks can be executed, providing compelling evidence. Furthermore, we reevaluate the approach of introducing additional noise on gradients as a protective measure against gradient attacks. To address this, we outline a theoretical proof that analyzes the associated privacy costs within the framework of differential privacy. Additionally, we affirm the convergence of the Stochastic Gradient Descent (SGD) algorithm under perturbed gradients. The primary objective of this study is to augment the understanding of gradient leakage attack and defense strategies while actively contributing to the development of privacy-preserving techniques specifically tailored for transformer-based models. By shedding light on the vulnerabilities and countermeasures associated with gradient leakage, this research aims to foster advancements in safeguarding sensitive data and upholding privacy in the context of transformer-based models.
Sample as You Infer: Predictive Coding With Langevin Dynamics
Authors: Authors: Umais Zahid, Qinghai Guo, Zafeirios Fountas
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2311.13664
Pdf link: https://arxiv.org/pdf/2311.13664
Abstract We present a novel algorithm for parameter learning in generic deep generative models that builds upon the predictive coding (PC) framework of computational neuroscience. Our approach modifies the standard PC algorithm to bring performance on-par and exceeding that obtained from standard variational auto-encoder (VAE) training. By injecting Gaussian noise into the PC inference procedure we re-envision it as an overdamped Langevin sampling, which facilitates optimisation with respect to a tight evidence lower bound (ELBO). We improve the resultant encoder-free training method by incorporating an encoder network to provide an amortised warm-start to our Langevin sampling and test three different objectives for doing so. Finally, to increase robustness to the sampling step size and reduce sensitivity to curvature, we validate a lightweight and easily computable form of preconditioning, inspired by Riemann Manifold Langevin and adaptive optimizers from the SGD literature. We compare against VAEs by training like-for-like generative models using our technique against those trained with standard reparameterisation-trick-based ELBOs. We observe our method out-performs or matches performance across a number of metrics, including sample quality, while converging in a fraction of the number of SGD training iterations.
DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release
Authors: Authors: Jie Fu, Qingqing Ye, Haibo Hu, Zhili Chen, Lulu Wang, Kuncan Wang, Ran Xun
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.14056
Pdf link: https://arxiv.org/pdf/2311.14056
Abstract Machine learning models are known to memorize private data to reduce their training loss, which can be inadvertently exploited by privacy attacks such as model inversion and membership inference. To protect against these attacks, differential privacy (DP) has become the de facto standard for privacy-preserving machine learning, particularly those popular training algorithms using stochastic gradient descent, such as DPSGD. Nonetheless, DPSGD still suffers from severe utility loss due to its slow convergence. This is partially caused by the random sampling, which brings bias and variance to the gradient, and partially by the Gaussian noise, which leads to fluctuation of gradient updates. Our key idea to address these issues is to apply selective updates to the model training, while discarding those useless or even harmful updates. Motivated by this, this paper proposes DPSUR, a Differentially Private training framework based on Selective Updates and Release, where the gradient from each iteration is evaluated based on a validation test, and only those updates leading to convergence are applied to the model. As such, DPSUR ensures the training in the right direction and thus can achieve faster convergence than DPSGD. The main challenges lie in two aspects -- privacy concerns arising from gradient evaluation, and gradient selection strategy for model update. To address the challenges, DPSUR introduces a clipping strategy for update randomization and a threshold mechanism for gradient selection. Experiments conducted on MNIST, FMNIST, CIFAR-10, and IMDB datasets show that DPSUR significantly outperforms previous works in terms of convergence speed and model utility.
Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
Authors: Authors: Markus Gross, Arne P. Raulf, Christoph Räth
Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
Arxiv link: https://arxiv.org/abs/2311.14120
Pdf link: https://arxiv.org/pdf/2311.14120
Abstract We investigate the stationary (late-time) training regime of single- and two-layer linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly oversampled regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but experience an isotropic loss. For a two-layer network, we obtain the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a new source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations experience an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a deep linear network model.
Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
Authors: Authors: Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, Quanquan Gu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.14222
Pdf link: https://arxiv.org/pdf/2311.14222
Abstract Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.
DP-NMT: Scalable Differentially-Private Machine Translation
Authors: Authors: Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2311.14465
Pdf link: https://arxiv.org/pdf/2311.14465
Abstract Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling
Authors: Authors: Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy J. Mitra, Gurprit Singh
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14468
Pdf link: https://arxiv.org/pdf/2311.14468
Abstract Machine learning problems rely heavily on stochastic gradient descent (SGD) for optimization. The effectiveness of SGD is contingent upon accurately estimating gradients from a mini-batch of data samples. Instead of the commonly used uniform sampling, adaptive or importance sampling reduces noise in gradient estimation by forming mini-batches that prioritize crucial data points. Previous research has suggested that data points should be selected with probabilities proportional to their gradient norm. Nevertheless, existing algorithms have struggled to efficiently integrate importance sampling into machine learning frameworks. In this work, we make two contributions. First, we present an algorithm that can incorporate existing importance functions into our framework. Second, we propose a simplified importance function that relies solely on the loss gradient of the output layer. By leveraging our proposed gradient estimation techniques, we observe improved convergence in classification and regression tasks with minimal computational overhead. We validate the effectiveness of our adaptive and importance-sampling approach on image and point-cloud datasets.
Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach
Authors: Authors: Xinwei Zhang, Zhiqi Bu, Zhiwei Steven Wu, Mingyi Hong
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.14632
Pdf link: https://arxiv.org/pdf/2311.14632
Abstract Differentially Private Stochastic Gradient Descent with gradient clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, using DPSGD-GC to ensure Differential Privacy (DP) comes at the cost of model performance degradation due to DP noise injection and gradient clipping. Existing research has extensively analyzed the theoretical convergence of DPSGD-GC, and has shown that it only converges when using large clipping thresholds that are dependent on problem-specific parameters. Unfortunately, these parameters are often unknown in practice, making it hard to choose the optimal clipping threshold. Therefore, in practice, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which not only offers a diminishing utility bound without inducing a constant clipping bias, but more importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{\'e}nyi DP. Additionally, we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on Cifar-10/100 and E2E datasets, show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee.
Keyword: optimization

BackboneLearn: A Library for Scaling Mixed-Integer Optimization-Based Machine Learning
Authors: Authors: Vassilis Digalakis Jr, Christos Ziakas
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.13695
Pdf link: https://arxiv.org/pdf/2311.13695
Abstract We present BackboneLearn: an open-source software package and framework for scaling mixed-integer optimization (MIO) problems with indicator variables to high-dimensional problems. This optimization paradigm can naturally be used to formulate fundamental problems in interpretable supervised learning (e.g., sparse regression and decision trees), in unsupervised learning (e.g., clustering), and beyond; BackboneLearn solves the aforementioned problems faster than exact methods and with higher accuracy than commonly used heuristics. The package is built in Python and is user-friendly and easily extensible: users can directly implement a backbone algorithm for their MIO problem at hand. The source code of BackboneLearn is available on GitHub (link: https://github.com/chziakas/backbone_learn).
Nova$^+$: Generative Language Models for Binaries
Authors: Authors: Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.13721
Pdf link: https://arxiv.org/pdf/2311.13721
Abstract Generative large language models (LLMs) pre-trained on code have shown impressive effectiveness in code generation, program repair, and document analysis. However, existing generative LLMs focus on source code and are not specialized for binaries. There are three main challenges for LLMs to model and learn binary code: hex-decimal values, complex global dependencies, and compiler optimization levels.To bring the benefit of LLMs to the binary domain, we develop Nova and Nova$^+$, which are LLMs pre-trained on binary corpora. Nova is pre-trained with the standard language modeling task, showing significantly better capability on five benchmarks for three downstream tasks: binary code similarity detection (BCSD), binary code translation (BCT), and binary code recovery (BCR), over GPT-3.5 and other existing techniques. We build Nova$^+$ to further boost Nova using two new pre-training tasks, i.e., optimization generation and optimization level prediction, which are designed to learn binary optimization and align equivalent binaries. Nova$^+$ shows overall the best performance for all three downstream tasks on five benchmarks, demonstrating the contributions of the new pre-training tasks.
Federated Learning Assisted Distributed Energy Optimization
Authors: Authors: Yuhan Du, Nuno Mendes, Simin Rasouli, Javad Mohammadi, Pedro Moura
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2311.13785
Pdf link: https://arxiv.org/pdf/2311.13785
Abstract The increased penetration of distributed energy resources and the adoption of sensing and control technologies are driving the transition from our current centralized electric grid to a distributed system controlled by multiple entities (agents). The Transactive Energy Community (TEC) serves as an established example of this transition. Distributed energy management approaches can effectively address the scalability, resilience, and privacy requirements of the evolving grid. In this context, the accuracy of agents' estimations becomes crucial for the performance of distributed and multi-agent decision-making paradigms. This paper specifically focuses on integrating Federated Learning (FL) with the multi-agent energy management procedure. FL is utilized to forecast agents' local energy generation and demand, aiming to accelerate the convergence of the distributed decision-making process. To enhance energy aggregation in TECs, we propose an FL-assisted distributed Consensus + Innovations approach. The results demonstrate that employing FL significantly reduces errors in predicting net power demand. The improved forecast accuracy, in turn, introduces less error in the distributed optimization process, thereby enhancing its convergence behavior.
Posterior Distillation Sampling
Authors: Authors: Juil Koo, Chanho Park, Minhyuk Sung
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2311.13831
Pdf link: https://arxiv.org/pdf/2311.13831
Abstract We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces.
Exact Combinatorial Optimization with Temporo-Attentional Graph Neural Networks
Authors: Authors: Mehdi Seyfi, Amin Banitalebi-Dehkordi, Zirui Zhou, Yong Zhang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
Arxiv link: https://arxiv.org/abs/2311.13843
Pdf link: https://arxiv.org/pdf/2311.13843
Abstract Combinatorial optimization finds an optimal solution within a discrete set of variables and constraints. The field has seen tremendous progress both in research and industry. With the success of deep learning in the past decade, a recent trend in combinatorial optimization has been to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning (ML) models. In this paper, we investigate two essential aspects of machine learning algorithms for combinatorial optimization: temporal characteristics and attention. We argue that for the task of variable selection in the branch-and-bound (B&B) algorithm, incorporating the temporal information as well as the bipartite graph attention improves the solver's performance. We support our claims with intuitions and numerical results over several standard datasets used in the literature and competitions. Code is available at: https://developer.huaweicloud.com/develop/aigallery/notebook/detail?id=047c6cf2-8463-40d7-b92f-7b2ca998e935
A Deep Reinforcement Learning Approach for Improving Age of Information in Mission-Critical IoT
Authors: Authors: Hossam Farag, Mikael Gidlund, Cedomir Stefanovic
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2311.13861
Pdf link: https://arxiv.org/pdf/2311.13861
Abstract The emerging mission-critical Internet of Things (IoT) play a vital role in remote healthcare, haptic interaction, and industrial automation, where timely delivery of status updates is crucial. The Age of Information (AoI) is an effective metric to capture and evaluate information freshness at the destination. A system design based solely on the optimization of the average AoI might not be adequate to capture the requirements of mission-critical applications, since averaging eliminates the effects of extreme events. In this paper, we introduce a Deep Reinforcement Learning (DRL)-based algorithm to improve AoI in mission-critical IoT applications. The objective is to minimize an AoI-based metric consisting of the weighted sum of the average AoI and the probability of exceeding an AoI threshold. We utilize the actor-critic method to train the algorithm to achieve optimized scheduling policy to solve the formulated problem. The performance of our proposed method is evaluated in a simulated setup and the results show a significant improvement in terms of the average AoI and the AoI violation probability compared to the related-work.
Locally Optimal Descent for Dynamic Stepsize Scheduling
Authors: Authors: Gilad Yehudai, Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, Mariano Schain
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.13877
Pdf link: https://arxiv.org/pdf/2311.13877
Abstract We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method within the context of smooth non-convex stochastic optimization, matching state-of-the-art bounds while only assuming knowledge of the smoothness parameter. We then present a practical implementation of our algorithm and conduct systematic experiments across diverse datasets and optimization algorithms, comparing our scheme with existing state-of-the-art learning-rate schedulers. Our findings indicate that our method needs minimal tuning when compared to existing approaches, removing the need for auxiliary manual schedules and warm-up phases and achieving comparable performance with drastically reduced parameter tuning.
Unsupervised Learning for Topological Classification of Transportation Networks
Authors: Authors: Sina Sabzekar, Mohammad Reza Valipour Malakshah, Zahra Amini
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.13887
Pdf link: https://arxiv.org/pdf/2311.13887
Abstract With increasing urbanization, transportation plays an increasingly critical role in city development. The number of studies on modeling, optimization, simulation, and data analysis of transportation systems is on the rise. Many of these studies utilize transportation test networks to represent real-world transportation systems in urban areas, examining the efficacy of their proposed approaches. Each of these networks exhibits unique characteristics in their topology, making their applications distinct for various study objectives. Despite their widespread use in research, there is a lack of comprehensive study addressing the classification of these networks based on their topological characteristics. This study aims to fill this gap by employing unsupervised learning methods, particularly clustering. We present a comprehensive framework for evaluating various topological network characteristics. Additionally, we employ two dimensionality reduction techniques, namely Principal Component Analysis (PCA) and Isometric Feature Mapping (ISOMAP), to reduce overlaps of highly correlated features and enhance the interpretability of the subsequent classification results. We then utilize two clustering algorithms, K-means and HDBSCAN, to classify 14 transportation networks. The PCA method, followed by the K-means clustering approach, outperforms other alternatives with a Silhouette score of $0.510$, enabling the classification of transportation networks into five clusters. We also provide a detailed discussion on the resulting classification.
Parameter Exchange for Robust Dynamic Domain Generalization
Authors: Authors: Luojun Lin, Zhifeng Shen, Zhishu Sun, Yuanlong Yu, Lei Zhang, Weijie Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.13928
Pdf link: https://arxiv.org/pdf/2311.13928
Abstract Agnostic domain shift is the main reason of model degradation on the unknown target domains, which brings an urgent need to develop Domain Generalization (DG). Recent advances at DG use dynamic networks to achieve training-free adaptation on the unknown target domains, termed Dynamic Domain Generalization (DDG), which compensates for the lack of self-adaptability in static models with fixed weights. The parameters of dynamic networks can be decoupled into a static and a dynamic component, which are designed to learn domain-invariant and domain-specific features, respectively. Based on the existing arts, in this work, we try to push the limits of DDG by disentangling the static and dynamic components more thoroughly from an optimization perspective. Our main consideration is that we can enable the static component to learn domain-invariant features more comprehensively by augmenting the domain-specific information. As a result, the more comprehensive domain-invariant features learned by the static component can then enforce the dynamic component to focus more on learning adaptive domain-specific features. To this end, we propose a simple yet effective Parameter Exchange (PE) method to perturb the combination between the static and dynamic components. We optimize the model using the gradients from both the perturbed and non-perturbed feed-forward jointly to implicitly achieve the aforementioned disentanglement. In this way, the two components can be optimized in a mutually-beneficial manner, which can resist the agnostic domain shifts and improve the self-adaptability on the unknown target domain. Extensive experiments show that PE can be easily plugged into existing dynamic networks to improve their generalization ability without bells and whistles.
Efficient Trigger Word Insertion
Authors: Authors: Yueqi Zeng, Ziqiang Li, Pengfei Xia, Lei Liu, Bin Li
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2311.13957
Pdf link: https://arxiv.org/pdf/2311.13957
Abstract With the boom in the natural language processing (NLP) field these years, backdoor attacks pose immense threats against deep neural network models. However, previous works hardly consider the effect of the poisoning rate. In this paper, our main objective is to reduce the number of poisoned samples while still achieving a satisfactory Attack Success Rate (ASR) in text backdoor attacks. To accomplish this, we propose an efficient trigger word insertion strategy in terms of trigger word optimization and poisoned sample selection. Extensive experiments on different datasets and models demonstrate that our proposed method can significantly improve attack effectiveness in text classification tasks. Remarkably, our approach achieves an ASR of over 90% with only 10 poisoned samples in the dirty-label setting and requires merely 1.5% of the training data in the clean-label setting.
An optimal first-order Taylor-like formula with a minimized remainder
Authors: Authors: Joël Chaskalovic, Franck Assous
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2311.13989
Pdf link: https://arxiv.org/pdf/2311.13989
Abstract In this paper, we derive an optimal first-order Taylor-like formula. In a seminal paper [14], we introduced a new first-order Taylor-like formula that yields a reduced remainder compared to the classical Taylor's formula. Here, we relax the assumption of equally spaced points in our formula. Instead, we consider a sequence of unknown points and a sequence of unknown weights. Then, we solve an optimization problem to determine the best distribution of points and weights that ensures that the remainder is as minimal as possible.
Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit
Authors: Authors: Tian Huang, Ke Li
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.14003
Pdf link: https://arxiv.org/pdf/2311.14003
Abstract Optimization problems find widespread use in both single-objective and multi-objective scenarios. In practical applications, users aspire for solutions that converge to the region of interest (ROI) along the Pareto front (PF). While the conventional approach involves approximating a fitness function or an objective function to reflect user preferences, this paper explores an alternative avenue. Specifically, we aim to discover a method that sidesteps the need for calculating the fitness function, relying solely on human feedback. Our proposed approach entails conducting direct preference learning facilitated by an active dueling bandit algorithm. The experimental phase is structured into three sessions. Firstly, we assess the performance of our active dueling bandit algorithm. Secondly, we implement our proposed method within the context of Multi-objective Evolutionary Algorithms (MOEAs). Finally, we deploy our method in a practical problem, specifically in protein structure prediction (PSP). This research presents a novel interactive preference-based MOEA framework that not only addresses the limitations of traditional techniques but also unveils new possibilities for optimization problems.
On the Hyperparameter Landscapes of Machine Learning Algorithms
Authors: Authors: Mingyu Huang, Ke Li
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14014
Pdf link: https://arxiv.org/pdf/2311.14014
Abstract Despite the recent success in a plethora of hyperparameter optimization (HPO) methods for machine learning (ML) models, the intricate interplay between model hyperparameters (HPs) and predictive losses (a.k.a fitness), which is a key prerequisite for understanding HPO, remain notably underexplored in our community. This results in limited explainability in the HPO process, rendering a lack of human trust and difficulties in pinpointing algorithm bottlenecks. In this paper, we aim to shed light on this black box by conducting large-scale fitness landscape analysis (FLA) on 1,500 HP loss landscapes of 6 ML models with more than 11 model configurations, across 67 datasets and different levels of fidelities. We reveal the first unified, comprehensive portrait of their topographies in terms of smoothness, neutrality and modality. We also show that such properties are highly transferable across datasets and fidelities, providing fundamental evidence for the success of multi-fidelity and transfer learning methods. These findings are made possible by developing a dedicated FLA framework that incorporates a combination of visual and quantitative measures. We further demonstrate the potential of this framework by analyzing the NAS-Bench-101 landscape, and we believe it is able to faciliate fundamental understanding of a broader range of AutoML tasks.
SySMOL: A Hardware-software Co-design Framework for Ultra-Low and Fine-Grained Mixed-Precision Neural Networks
Authors: Authors: Cyrus Zhou, Vaughn Richard, Pedro Savarese, Zachary Hassman, Michael Maire, Michael DiBrino, Yanjing Li
Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2311.14114
Pdf link: https://arxiv.org/pdf/2311.14114
Abstract Recent advancements in quantization and mixed-precision techniques offer significant promise for improving the run-time and energy efficiency of neural networks. In this work, we further showed that neural networks, wherein individual parameters or activations can take on different precisions ranging between 1 and 4 bits, can achieve accuracies comparable to or exceeding the full-precision counterparts. However, the deployment of such networks poses numerous challenges, stemming from the necessity to manage and control the compute/communication/storage requirements associated with these extremely fine-grained mixed precisions for each piece of data. There is a lack of existing efficient hardware and system-level support tailored to these unique and challenging requirements. Our research introduces the first novel holistic hardware-software co-design approach for these networks, which enables a continuous feedback loop between hardware design, training, and inference to facilitate systematic design exploration. As a proof-of-concept, we illustrate this co-design approach by designing new, configurable CPU SIMD architectures tailored for these networks, tightly integrating the architecture with new system-aware training and inference techniques. We perform systematic design space exploration using this framework to analyze various tradeoffs. The design for mixed-precision networks that achieves optimized tradeoffs corresponds to an architecture that supports 1, 2, and 4-bit fixed-point operations with four configurable precision patterns, when coupled with system-aware training and inference optimization -- networks trained for this design achieve accuracies that closely match full-precision accuracies, while compressing and improving run-time efficiency of the neural networks drastically by 10-20x, compared to full-precision networks.
Variational Annealing on Graphs for Combinatorial Optimization
Authors: Authors: Sebastian Sanokowski, Wilhelm Berghammer, Sepp Hochreiter, Sebastian Lehner
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.14156
Pdf link: https://arxiv.org/pdf/2311.14156
Abstract Several recent unsupervised learning methods use probabilistic approaches to solve combinatorial optimization (CO) problems based on the assumption of statistically independent solution variables. We demonstrate that this assumption imposes performance limitations in particular on difficult problem instances. Our results corroborate that an autoregressive approach which captures statistical dependencies among solution variables yields superior performance on many popular CO problems. We introduce subgraph tokenization in which the configuration of a set of solution variables is represented by a single token. This tokenization technique alleviates the drawback of the long sequential sampling procedure which is inherent to autoregressive methods without sacrificing expressivity. Importantly, we theoretically motivate an annealed entropy regularization and show empirically that it is essential for efficient and stable learning.
TCuPGAN: A novel framework developed for optimizing human-machine interactions in citizen science
Authors: Authors: Ramanakumar Sankar, Kameswara Mantha, Lucy Fortson, Helen Spiers, Thomas Pengo, Douglas Mashek, Myat Mo, Mark Sanders, Trace Christensen, Jeffrey Salisbury, Laura Trouille
Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14177
Pdf link: https://arxiv.org/pdf/2311.14177
Abstract In the era of big data in scientific research, there is a necessity to leverage techniques which reduce human effort in labeling and categorizing large datasets by involving sophisticated machine tools. To combat this problem, we present a novel, general purpose model for 3D segmentation that leverages patch-wise adversariality and Long Short-Term Memory to encode sequential information. Using this model alongside citizen science projects which use 3D datasets (image cubes) on the Zooniverse platforms, we propose an iterative human-machine optimization framework where only a fraction of the 2D slices from these cubes are seen by the volunteers. We leverage the patch-wise discriminator in our model to provide an estimate of which slices within these image cubes have poorly generalized feature representations, and correspondingly poor machine performance. These images with corresponding machine proposals would be presented to volunteers on Zooniverse for correction, leading to a drastic reduction in the volunteer effort on citizen science projects. We trained our model on ~2300 liver tissue 3D electron micrographs. Lipid droplets were segmented within these images through human annotation via the `Etch A Cell - Fat Checker' citizen science project, hosted on the Zooniverse platform. In this work, we demonstrate this framework and the selection methodology which resulted in a measured reduction in volunteer effort by more than 60%. We envision this type of joint human-machine partnership will be of great use on future Zooniverse projects.
Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
Authors: Authors: Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, Quanquan Gu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.14222
Pdf link: https://arxiv.org/pdf/2311.14222
Abstract Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.
Formulations to select assets for constructing sparse index tracking portfolios
Authors: Authors: Yutaka Sakurai, Daiki Wakabayashi, Fumio Ishizaki
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2311.14228
Pdf link: https://arxiv.org/pdf/2311.14228
Abstract In this paper, we study asset selection methods to construct a sparse index tracking portfolio. For its advantage over full replication portfolio, the concept of sparse index tracking portfolio has significant attention in the field of finance and investment management. We propose useful formulations to select assets for sparse index tracking portfolio. Our formulations are described as combinatorial optimization problems, and they can yield various asset selection methods, including some existing methods, by adjusting the values of parameters. As a result, the proposed formulations can provide a well-balanced asset selection to create successful sparse index tracking portfolios. We also provide numerical examples to compare the tracking performance of resulting sparse index tracking portfolios.
Constant-Time Wasmtime, for Real This Time: End-to-End Verified Zero-Overhead Constant-Time Programming for the Web and Beyond
Authors: Authors: Garrett Gu, Hovav Shacham
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.14246
Pdf link: https://arxiv.org/pdf/2311.14246
Abstract We claim that existing techniques and tools for generating and verifying constant-time code are incomplete, since they rely on assumptions that compiler optimization passes do not break constant-timeness or that certain operations execute in constant time on the hardware. We present the first end-to-end constant-time-aware compilation process that preserves constant-time semantics at every step from a high-level language down to microarchitectural guarantees, provided by the forthcoming ARM PSTATE.DIT feature. First, we present a new compiler-verifier suite based on the JIT-style runtime Wasmtime, modified to compile ct-wasm, a preexisting type-safe constant-time extension of WebAssembly, into ARM machine code while maintaining the constant-time property throughout all optimization passes. The resulting machine code is then fed into an automated verifier that requires no human intervention and uses static dataflow analysis in Ghidra to check the constant-timeness of the output. Our verifier leverages characteristics unique to ct-wasm-generated code in order to speed up verification while preserving both soundness and wide applicability. We also consider the resistance of our compilation and verification against speculative timing leakages such as Spectre. Finally, in order to expose ct-Wasmtime at a high level, we present a port of FaCT, a preexisting constant-time-aware DSL, to target ct-wasm.
Segmentation-Based Parametric Painting
Authors: Authors: Manuel Ladron de Guevara, Matthew Fisher, Aaron Hertzmann
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14271
Pdf link: https://arxiv.org/pdf/2311.14271
Abstract We introduce a novel image-to-painting method that facilitates the creation of large-scale, high-fidelity paintings with human-like quality and stylistic variation. To process large images and gain control over the painting process, we introduce a segmentation-based painting process and a dynamic attention map approach inspired by human painting strategies, allowing optimization of brush strokes to proceed in batches over different image regions, thereby capturing both large-scale structure and fine details, while also allowing stylistic control over detail. Our optimized batch processing and patch-based loss framework enable efficient handling of large canvases, ensuring our painted outputs are both aesthetically compelling and functionally superior as compared to previous methods, as confirmed by rigorous evaluations. Code available at: https://github.com/manuelladron/semantic\_based\_painting.git
Stable Cluster Discrimination for Deep Clustering
Authors: Authors: Qi Qian
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2311.14310
Pdf link: https://arxiv.org/pdf/2311.14310
Abstract Deep clustering can optimize representations of instances (i.e., representation learning) and explore the inherent data distribution (i.e., clustering) simultaneously, which demonstrates a superior performance over conventional clustering methods with given features. However, the coupled objective implies a trivial solution that all instances collapse to the uniform features. To tackle the challenge, a two-stage training strategy is developed for decoupling, where it introduces an additional pre-training stage for representation learning and then fine-tunes the obtained model for clustering. Meanwhile, one-stage methods are developed mainly for representation learning rather than clustering, where various constraints for cluster assignments are designed to avoid collapsing explicitly. Despite the success of these methods, an appropriate learning objective tailored for deep clustering has not been investigated sufficiently. In this work, we first show that the prevalent discrimination task in supervised learning is unstable for one-stage clustering due to the lack of ground-truth labels and positive instances for certain clusters in each mini-batch. To mitigate the issue, a novel stable cluster discrimination (SeCu) task is proposed and a new hardness-aware clustering criterion can be obtained accordingly. Moreover, a global entropy constraint for cluster assignments is studied with efficient optimization. Extensive experiments are conducted on benchmark data sets and ImageNet. SeCu achieves state-of-the-art performance on all of them, which demonstrates the effectiveness of one-stage deep clustering. Code is available at \url{https://github.com/idstcv/SeCu}.
BHGNN-RT: Network embedding for directed heterogeneous graphs
Authors: Authors: Xiyang Sun, Fumiyasu Komaki
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14404
Pdf link: https://arxiv.org/pdf/2311.14404
Abstract Networks are one of the most valuable data structures for modeling problems in the real world. However, the most recent node embedding strategies have focused on undirected graphs, with limited attention to directed graphs, especially directed heterogeneous graphs. In this study, we first investigated the network properties of directed heterogeneous graphs. Based on network analysis, we proposed an embedding method, a bidirectional heterogeneous graph neural network with random teleport (BHGNN-RT), for directed heterogeneous graphs, that leverages bidirectional message-passing process and network heterogeneity. With the optimization of teleport proportion, BHGNN-RT is beneficial to overcome the over-smoothing problem. Extensive experiments on various datasets were conducted to verify the efficacy and efficiency of BHGNN-RT. Furthermore, we investigated the effects of message components, model layer, and teleport proportion on model performance. The performance comparison with all other baselines illustrates that BHGNN-RT achieves state-of-the-art performance, outperforming the benchmark methods in both node classification and unsupervised clustering tasks.
Receding Horizon Optimization with PPUM: An Approach for Autonomous Robot Path Planning in Uncertain Environments
Authors: Authors: Zijian Ge, Jingjing Jiang, Matthew Coombes, Liang Sun
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2311.14411
Pdf link: https://arxiv.org/pdf/2311.14411
Abstract The ability to understand spatial-temporal patterns for crowds of people is crucial for achieving long-term autonomy of mobile robots deployed in human environments. However, traditional historical data-driven memory models are inadequate for handling anomalies, resulting in poor reasoning by robot in estimating the crowd spatial distribution. In this article, a Receding Horizon Optimization (RHO) formulation is proposed that incorporates a Probability-related Partially Updated Memory (PPUM) for robot path planning in crowded environments with uncertainties. The PPUM acts as a memory layer that combines real-time sensor observations with historical knowledge using a weighted evidence fusion theory to improve robot's adaptivity to the dynamic environments. RHO then utilizes the PPUM as a informed knowledge to generate a path that minimizes the likelihood of encountering dense crowds while reducing the cost of local motion planning. The proposed approach provides an innovative solution to the problem of robot's long-term safe interaction with human in uncertain crowded environments. In simulation, the results demonstrate the superior performance of our approach compared to benchmark methods in terms of crowd distribution estimation accuracy, adaptability to anomalies and path planning efficiency.
Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling
Authors: Authors: Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy J. Mitra, Gurprit Singh
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14468
Pdf link: https://arxiv.org/pdf/2311.14468
Abstract Machine learning problems rely heavily on stochastic gradient descent (SGD) for optimization. The effectiveness of SGD is contingent upon accurately estimating gradients from a mini-batch of data samples. Instead of the commonly used uniform sampling, adaptive or importance sampling reduces noise in gradient estimation by forming mini-batches that prioritize crucial data points. Previous research has suggested that data points should be selected with probabilities proportional to their gradient norm. Nevertheless, existing algorithms have struggled to efficiently integrate importance sampling into machine learning frameworks. In this work, we make two contributions. First, we present an algorithm that can incorporate existing importance functions into our framework. Second, we propose a simplified importance function that relies solely on the loss gradient of the output layer. By leveraging our proposed gradient estimation techniques, we observe improved convergence in classification and regression tasks with minimal computational overhead. We validate the effectiveness of our adaptive and importance-sampling approach on image and point-cloud datasets.
MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation
Authors: Authors: Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2311.14494
Pdf link: https://arxiv.org/pdf/2311.14494
Abstract We introduce MVControl, a novel neural network architecture that enhances existing pre-trained multi-view 2D diffusion models by incorporating additional input conditions, e.g. edge maps. Our approach enables the generation of controllable multi-view images and view-consistent 3D content. To achieve controllable multi-view image generation, we leverage MVDream as our base model, and train a new neural network module as additional plugin for end-to-end task-specific condition learning. To precisely control the shapes and views of generated images, we innovatively propose a new conditioning mechanism that predicts an embedding encapsulating the input spatial and view conditions, which is then injected to the network globally. Once MVControl is trained, score-distillation (SDS) loss based optimization can be performed to generate 3D content, in which process we propose to use a hybrid diffusion prior. The hybrid prior relies on a pre-trained Stable-Diffusion network and our trained MVControl for additional guidance. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.
StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
Authors: Authors: Shida Wang, Qianxiao Li
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2311.14495
Pdf link: https://arxiv.org/pdf/2311.14495
Abstract In this paper, we investigate the long-term memory learning capabilities of state-space models (SSMs) from the perspective of parameterization. We prove that state-space models without any reparameterization exhibit a memory limitation similar to that of traditional RNNs: the target relationships that can be stably approximated by state-space models must have an exponential decaying memory. Our analysis identifies this "curse of memory" as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability. We validate our findings using synthetic datasets and language models.
Pitfalls of Projection: A study of Newton-type solvers for incremental potentials
Authors: Authors: Andreas Longva (1), Fabian Löschner (1), José Antonio Fernández-Fernández (1), Egor Larionov (Meta Reality Labs), Uri M. Ascher (University of British Columbia), Jan Bender (1) ((1) RWTH Aachen University)
Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2311.14526
Pdf link: https://arxiv.org/pdf/2311.14526
Abstract Nonlinear systems arising from time integrators like Backward Euler can sometimes be reformulated as optimization problems, known as incremental potentials. We show through a comprehensive experimental analysis that the widely used Projected Newton method, which relies on unconditional semidefinite projection of Hessian contributions, typically exhibits a reduced convergence rate compared to classical Newton's method. We demonstrate how factors like resolution, element order, projection method, material model and boundary handling impact convergence of Projected Newton and Newton. Drawing on these findings, we propose the hybrid method Project-on-Demand Newton, which projects only conditionally, and show that it enjoys both the robustness of Projected Newton and convergence rate of Newton. We additionally introduce Kinetic Newton, a regularization-based method that takes advantage of the structure of incremental potentials and avoids projection altogether. We compare the four solvers on hyperelasticity and contact problems. We also present a nuanced discussion of convergence criteria, and propose a new acceleration-based criterion that avoids problems associated with existing residual norm criteria and is easier to interpret. We finally address a fundamental limitation of the Armijo backtracking line search that occasionally blocks convergence, especially for stiff problems. We propose a novel parameter-free, robust line search technique to eliminate this issue.
Electric Vehicles coordination for grid balancing using multi-objective Harris Hawks Optimization
Authors: Authors: Cristina Bianca Pop, Tudor Cioara, Viorica Chifu, Ionut Anghel, Francesco Bellesini
Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2311.14563
Pdf link: https://arxiv.org/pdf/2311.14563
Abstract The rise of renewables coincides with the shift towards Electrical Vehicles (EVs) posing technical and operational challenges for the energy balance of the local grid. Nowadays, the energy grid cannot deal with a spike in EVs usage leading to a need for more coordinated and grid aware EVs charging and discharging strategies. However, coordinating power flow from multiple EVs into the grid requires sophisticated algorithms and load-balancing strategies as the complexity increases with more control variables and EVs, necessitating large optimization and decision search spaces. In this paper, we propose an EVs fleet coordination model for the day ahead aiming to ensure a reliable energy supply and maintain a stable local grid, by utilizing EVs to store surplus energy and discharge it during periods of energy deficit. The optimization problem is addressed using Harris Hawks Optimization (HHO) considering criteria related to energy grid balancing, time usage preference, and the location of EV drivers. The EVs schedules, associated with the position of individuals from the population, are adjusted through exploration and exploitation operations, and their technical and operational feasibility is ensured, while the rabbit individual is updated with a non-dominated EV schedule selected per iteration using a roulette wheel algorithm. The solution is evaluated within the framework of an e-mobility service in Terni city. The results indicate that coordinated charging and discharging of EVs not only meet balancing service requirements but also align with user preferences with minimal deviations.
A Survey and Analysis of Evolutionary Operators for Permutations
Authors: Authors: Vincent A. Cicirello
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
Arxiv link: https://arxiv.org/abs/2311.14595
Pdf link: https://arxiv.org/pdf/2311.14595
Abstract There are many combinatorial optimization problems whose solutions are best represented by permutations. The classic traveling salesperson seeks an optimal ordering over a set of cities. Scheduling problems often seek optimal orderings of tasks or activities. Although some evolutionary approaches to such problems utilize the bit strings of a genetic algorithm, it is more common to directly represent solutions with permutations. Evolving permutations directly requires specialized evolutionary operators. Over the years, many crossover and mutation operators have been developed for solving permutation problems with evolutionary algorithms. In this paper, we survey the breadth of evolutionary operators for permutations. We implemented all of these in Chips-n-Salsa, an open source Java library for evolutionary computation. Finally, we empirically analyze the crossover operators on artificial fitness landscapes isolating different permutation features.
Target-driven splitting SPH optimization of thermal conductivity distribution
Authors: Authors: Bo Zhang, Chi Zhang, Xiangyu Hu
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2311.14598
Pdf link: https://arxiv.org/pdf/2311.14598
Abstract Efficiently enhancing heat conduction through optimized distribution of a limited quantity of high thermal conductivity material is paramount in cooling electronic devices and numerous other applications. This paper introduces a target-driven all-at-once approach for PDE-constrained optimization and derives a splitting smoothed particle hydrodynamics (SPH) method for optimizing the distribution of thermal conductivity in heat conduction problems. In this method, the optimization iteration of the system is split into several easily addressed steps. A targeting step is employed to progressively enforce the direct target, which potentially leads to increased PDE residuals. Then, these residuals are recovered through an evolution step of the design variable. After this, a PDE solution step is carried out to further decrease the PDE residuals, and the system is ready for the next iteration. Unlike the simulation-based approaches, the present method does not rely on the adjoint state equation and converged state variable field in each iteration, and the optimization process is significantly simplified and accelerated. With the utilization of an implicit SPH splitting operator and a general numerical regularization formulation, the information propagation is further accelerated and the numerical stability is greatly enhanced. Typical examples of heat conduction optimization demonstrate that the current method yields optimal results comparable to previous methods and exhibits considerable computational efficiency. Moreover, the optimal results feature more moderate extreme values, which offers distinct advantages for the easier selection of appropriate material with high thermal conductivity.
Evolution of Neural Architectures for Financial Forecasting: A Note on Data Incompatibility during Crisis Periods
Authors: Authors: Faizal Hafiz, Jan Broekaert, Akshya Swain
Subjects: Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2311.14604
Pdf link: https://arxiv.org/pdf/2311.14604
Abstract This note focuses on the optimization of neural architectures for stock index movement forecasting following a major market disruption or crisis. Given that such crises may introduce a shift in market dynamics, this study aims to investigate whether the training data from market dynamics prior to the crisis are compatible with the data during the crisis period. To this end, two distinct learning environments are designed to evaluate and reconcile the effects of possibly different market dynamics. These environments differ principally based on the role assigned to the pre-crisis data. In both environments, a set of non-dominated architectures are identified to satisfy the multi-criteria co-evolution problem, which simultaneously addresses the selection issues related to features and hidden layer topology. To test the hypothesis of pre-crisis data incompatibility, the day-ahead movement prediction of the NASDAQ index is considered during two recent and major market disruptions; the 2008 financial crisis and the COVID-19 pandemic. The results of a detailed comparative evaluation convincingly support the incompatibility hypothesis and highlight the need to select re-training windows carefully.
Received Signal and Channel Parameter Estimation in Molecular Communications
Authors: Authors: O. Tansel Baydas, Ozgur B. Akan
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2311.14621
Pdf link: https://arxiv.org/pdf/2311.14621
Abstract Molecular communication (MC) is a paradigm that employs molecules as information transmitters, hence, requiring unconventional transceivers and detection techniques for the Internet of Bio-Nano Things (IoBNT). In this study, we provide a novel MC model that incorporates a spherical transmitter and receiver with partial absorption. This model offers a more realistic representation than receiver architectures in literature, e.g. passive or entirely absorbing configurations. An optimization-based technique utilizing particle swarm optimization (PSO) is employed to accurately estimate the cumulative number of molecules received. This technique yields nearly constant correction parameters and demonstrates a significant improvement of 5 times in terms of root mean square error (RMSE). The estimated channel model provides an approximate analytical impulse response; hence, it is used for estimating channel parameters such as distance, diffusion coefficient, or a combination of both. We apply iterative maximum likelihood estimation (MLE) for the parameter estimation, which gives consistent errors compared to the estimated Cramer-Rao Lower Bound (CLRB).
A General Framework for User-Guided Bayesian Optimization
Authors: Authors: Carl Hvarfner, Frank Hutter, Luigi Nardi
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.14645
Pdf link: https://arxiv.org/pdf/2311.14645
Abstract The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.
Keyword: adam

There is no result

Keyword: gradient

A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer
Authors: Authors: Chenyang Li, Zhao Song, Weixin Wang, Chiwun Yang
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.13624
Pdf link: https://arxiv.org/pdf/2311.13624
Abstract The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients. This approach poses a substantial threat to the privacy of individuals and organizations alike. This research presents a comprehensive analysis of the gradient leakage method when applied specifically to transformer-based models. Through meticulous examination, we showcase the capability to accurately recover data solely from gradients and rigorously investigate the conditions under which gradient attacks can be executed, providing compelling evidence. Furthermore, we reevaluate the approach of introducing additional noise on gradients as a protective measure against gradient attacks. To address this, we outline a theoretical proof that analyzes the associated privacy costs within the framework of differential privacy. Additionally, we affirm the convergence of the Stochastic Gradient Descent (SGD) algorithm under perturbed gradients. The primary objective of this study is to augment the understanding of gradient leakage attack and defense strategies while actively contributing to the development of privacy-preserving techniques specifically tailored for transformer-based models. By shedding light on the vulnerabilities and countermeasures associated with gradient leakage, this research aims to foster advancements in safeguarding sensitive data and upholding privacy in the context of transformer-based models.
A Joint Gradient and Loss Based Clustered Federated Learning Design
Authors: Authors: Licheng Lin, Mingzhe Chen, Zhaohui Yang, Yusen Wu, Yuchen Liu
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.13665
Pdf link: https://arxiv.org/pdf/2311.13665
Abstract In this paper, a novel clustered FL framework that enables distributed edge devices with non-IID data to independently form several clusters in a distributed manner and implement FL training within each cluster is proposed. In particular, our designed clustered FL algorithm must overcome two challenges associated with FL training. First, the server has limited FL training information (i.e., the parameter server can only obtain the FL model information of each device) and limited computational power for finding the differences among a large amount of devices. Second, each device does not have the data information of other devices for device clustering and can only use global FL model parameters received from the server and its data information to determine its cluster identity, which will increase the difficulty of device clustering. To overcome these two challenges, we propose a joint gradient and loss based distributed clustering method in which each device determines its cluster identity considering the gradient similarity and training loss. The proposed clustering method not only considers how a local FL model of one device contributes to each cluster but also the direction of gradient descent thus improving clustering speed. By delegating clustering decisions to edge devices, each device can fully leverage its private data information to determine its own cluster identity, thereby reducing clustering overhead and improving overall clustering performance. Simulation results demonstrate that our proposed clustered FL algorithm can reduce clustering iterations by up to 99% compared to the existing baseline.
Single-Shot Plug-and-Play Methods for Inverse Problems
Authors: Authors: Yanqi Cheng, Lipei Zhang, Zhenda Shen, Shujun Wang, Lequan Yu, Raymond H. Chan, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2311.13682
Pdf link: https://arxiv.org/pdf/2311.13682
Abstract The utilisation of Plug-and-Play (PnP) priors in inverse problems has become increasingly prominent in recent years. This preference is based on the mathematical equivalence between the general proximal operator and the regularised denoiser, facilitating the adaptation of various off-the-shelf denoiser priors to a wide range of inverse problems. However, existing PnP models predominantly rely on pre-trained denoisers using large datasets. In this work, we introduce Single-Shot PnP methods (SS-PnP), shifting the focus to solving inverse problems with minimal data. First, we integrate Single-Shot proximal denoisers into iterative methods, enabling training with single instances. Second, we propose implicit neural priors based on a novel function that preserves relevant frequencies to capture fine details while avoiding the issue of vanishing gradients. We demonstrate, through extensive numerical and visual experiments, that our method leads to better approximations.
OASIS: Offsetting Active Reconstruction Attacks in Federated Learning
Authors: Authors: Tre' R. Jeter, Truc Nguyen, Raed Alharbi, My T. Thai
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.13739
Pdf link: https://arxiv.org/pdf/2311.13739
Abstract Federated Learning (FL) has garnered significant attention for its potential to protect user privacy while enhancing model training efficiency. However, recent research has demonstrated that FL protocols can be easily compromised by active reconstruction attacks executed by dishonest servers. These attacks involve the malicious modification of global model parameters, allowing the server to obtain a verbatim copy of users' private data by inverting their gradient updates. Tackling this class of attack remains a crucial challenge due to the strong threat model. In this paper, we propose OASIS, a defense mechanism based on image augmentation that effectively counteracts active reconstruction attacks while preserving model performance. We first uncover the core principle of gradient inversion that enables these attacks and theoretically identify the main conditions by which the defense can be robust regardless of the attack strategies. We then construct OASIS with image augmentation showing that it can undermine the attack principle. Comprehensive evaluations demonstrate the efficacy of OASIS highlighting its feasibility as a solution.
Learning Hierarchical Polynomials with Three-Layer Neural Networks
Authors: Authors: Zihao Wang, Eshaan Nichani, Jason D. Lee
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.13774
Pdf link: https://arxiv.org/pdf/2311.13774
Abstract We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
Max-Min SINR Analysis of STAR-RIS Assisted Massive MIMO Systems with Hardware Impairments
Authors: Authors: Anastasios Papazafeiropoulos, Pandelis Kourtessis, Symeon Chatzinotas
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2311.13875
Pdf link: https://arxiv.org/pdf/2311.13875
Abstract Reconfigurable intelligent surface (RIS) has emerged as a cost-effective solution to improve wireless communication performance through just passive reflection. Recently, the concept of simultaneously transmitting and reflecting RIS (STAR-RIS) has appeared but the study of minimum signal-to-interference-plus-noise ratio (SINR) and the impact of hardware impairments (HWIs) remain open. In addition to previous works on STAR-RIS, we consider a massive multiple-input multiple-output (mMIMO) base station (BS) serving multiple user equipments (UEs) at both sides of the RIS. Specifically, in this work, focusing on the downlink of a single cell, we derive the minimum SINR obtained by the optimal linear precoder (OLP) with HWIs in closed form. The OLP maximises the minimum SINR subject to a given power constraint for any given passive beamforming matrix (PBM). Next, we obtain deterministic equivalents (DEs) for the OLP and the minimum SINR, which are then used to optimise the PBM. Notably, based on the DEs and statistical channel state information (CSI), we optimise simultaneously the amplitude and phase shift by using a projected gradient ascent algorithm (PGAM) for both energy splitting (ES) and mode switching (MS) STAR-RIS operation protocols with reduced feedback, \textcolor{black}{which is quite crucial for STAR-RIS systems that include the double number or variables compared to reflecting only RIS.} Simulations verify the analytical results, shed light on the impact of HWIs, and demonstrate the better performance of STAR-RIS compared to conventional RIS.
Locally Optimal Descent for Dynamic Stepsize Scheduling
Authors: Authors: Gilad Yehudai, Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, Mariano Schain
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.13877
Pdf link: https://arxiv.org/pdf/2311.13877
Abstract We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method within the context of smooth non-convex stochastic optimization, matching state-of-the-art bounds while only assuming knowledge of the smoothness parameter. We then present a practical implementation of our algorithm and conduct systematic experiments across diverse datasets and optimization algorithms, comparing our scheme with existing state-of-the-art learning-rate schedulers. Our findings indicate that our method needs minimal tuning when compared to existing approaches, removing the need for auxiliary manual schedules and warm-up phases and achieving comparable performance with drastically reduced parameter tuning.
Leveraging Optimal Transport via Projections on Subspaces for Machine Learning Applications
Authors: Authors: Clément Bonet
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.13883
Pdf link: https://arxiv.org/pdf/2311.13883
Abstract Optimal Transport has received much attention in Machine Learning as it allows to compare probability distributions by exploiting the geometry of the underlying space. However, in its original formulation, solving this problem suffers from a significant computational burden. Thus, a meaningful line of work consists at proposing alternatives to reduce this burden while still enjoying its properties. In this thesis, we focus on alternatives which use projections on subspaces. The main such alternative is the Sliced-Wasserstein distance, which we first propose to extend to Riemannian manifolds in order to use it in Machine Learning applications for which using such spaces has been shown to be beneficial in the recent years. We also study sliced distances between positive measures in the so-called unbalanced OT problem. Back to the original Euclidean Sliced-Wasserstein distance between probability measures, we study the dynamic of gradient flows when endowing the space with this distance in place of the usual Wasserstein distance. Then, we investigate the use of the Busemann function, a generalization of the inner product in metric spaces, in the space of probability measures. Finally, we extend the subspace detour approach to incomparable spaces using the Gromov-Wasserstein distance.
Parameter Exchange for Robust Dynamic Domain Generalization
Authors: Authors: Luojun Lin, Zhifeng Shen, Zhishu Sun, Yuanlong Yu, Lei Zhang, Weijie Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.13928
Pdf link: https://arxiv.org/pdf/2311.13928
Abstract Agnostic domain shift is the main reason of model degradation on the unknown target domains, which brings an urgent need to develop Domain Generalization (DG). Recent advances at DG use dynamic networks to achieve training-free adaptation on the unknown target domains, termed Dynamic Domain Generalization (DDG), which compensates for the lack of self-adaptability in static models with fixed weights. The parameters of dynamic networks can be decoupled into a static and a dynamic component, which are designed to learn domain-invariant and domain-specific features, respectively. Based on the existing arts, in this work, we try to push the limits of DDG by disentangling the static and dynamic components more thoroughly from an optimization perspective. Our main consideration is that we can enable the static component to learn domain-invariant features more comprehensively by augmenting the domain-specific information. As a result, the more comprehensive domain-invariant features learned by the static component can then enforce the dynamic component to focus more on learning adaptive domain-specific features. To this end, we propose a simple yet effective Parameter Exchange (PE) method to perturb the combination between the static and dynamic components. We optimize the model using the gradients from both the perturbed and non-perturbed feed-forward jointly to implicitly achieve the aforementioned disentanglement. In this way, the two components can be optimized in a mutually-beneficial manner, which can resist the agnostic domain shifts and improve the self-adaptability on the unknown target domain. Extensive experiments show that PE can be easily plugged into existing dynamic networks to improve their generalization ability without bells and whistles.
Unconstrained learning of networked nonlinear systems via free parametrization of stable interconnected operators
Authors: Authors: Leonardo Massai, Danilo Saccani, Luca Furieri, Giancarlo Ferrari-Trecate
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2311.13967
Pdf link: https://arxiv.org/pdf/2311.13967
Abstract This paper characterizes a new parametrization of nonlinear networked incrementally $L_2$-bounded operators in discrete time. The distinctive novelty is that our parametrization is \emph{free} -- that is, a sparse large-scale operator with bounded incremental $L_2$ gain is obtained for any choice of the real values of our parameters. This property allows one to freely search over optimal parameters via unconstrained gradient descent, enabling direct applications in large-scale optimal control and system identification. Further, we can embed prior knowledge about the interconnection topology and stability properties of the system directly into the large-scale distributed operator we design. Our approach is extremely general in that it can seamlessly encapsulate and interconnect state-of-the-art Neural Network (NN) parametrizations of stable dynamical systems. To demonstrate the effectiveness of this approach, we provide a simulation example showcasing the identification of a networked nonlinear system. The results underscore the superiority of our free parametrizations over standard NN-based identification methods where a prior over the system topology and local stability properties are not enforced.
When Side-Channel Attacks Break the Black-Box Property of Embedded Artificial Intelligence
Authors: Authors: Benoit Coqueret, Mathieu Carbone, Olivier Sentieys, Gabriel Zaid
Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.14005
Pdf link: https://arxiv.org/pdf/2311.14005
Abstract Artificial intelligence, and specifically deep neural networks (DNNs), has rapidly emerged in the past decade as the standard for several tasks from specific advertising to object detection. The performance offered has led DNN algorithms to become a part of critical embedded systems, requiring both efficiency and reliability. In particular, DNNs are subject to malicious examples designed in a way to fool the network while being undetectable to the human observer: the adversarial examples. While previous studies propose frameworks to implement such attacks in black box settings, those often rely on the hypothesis that the attacker has access to the logits of the neural network, breaking the assumption of the traditional black box. In this paper, we investigate a real black box scenario where the attacker has no access to the logits. In particular, we propose an architecture-agnostic attack which solve this constraint by extracting the logits. Our method combines hardware and software attacks, by performing a side-channel attack that exploits electromagnetic leakages to extract the logits for a given input, allowing an attacker to estimate the gradients and produce state-of-the-art adversarial examples to fool the targeted neural network. Through this example of adversarial attack, we demonstrate the effectiveness of logits extraction using side-channel as a first step for more general attack frameworks requiring either the logits or the confidence scores.
Understanding the Vulnerability of CLIP to Image Compression
Authors: Authors: Cangxiong Chen, Vinay P. Namboodiri, Julian Padget
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14029
Pdf link: https://arxiv.org/pdf/2311.14029
Abstract CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.
DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release
Authors: Authors: Jie Fu, Qingqing Ye, Haibo Hu, Zhili Chen, Lulu Wang, Kuncan Wang, Ran Xun
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.14056
Pdf link: https://arxiv.org/pdf/2311.14056
Abstract Machine learning models are known to memorize private data to reduce their training loss, which can be inadvertently exploited by privacy attacks such as model inversion and membership inference. To protect against these attacks, differential privacy (DP) has become the de facto standard for privacy-preserving machine learning, particularly those popular training algorithms using stochastic gradient descent, such as DPSGD. Nonetheless, DPSGD still suffers from severe utility loss due to its slow convergence. This is partially caused by the random sampling, which brings bias and variance to the gradient, and partially by the Gaussian noise, which leads to fluctuation of gradient updates. Our key idea to address these issues is to apply selective updates to the model training, while discarding those useless or even harmful updates. Motivated by this, this paper proposes DPSUR, a Differentially Private training framework based on Selective Updates and Release, where the gradient from each iteration is evaluated based on a validation test, and only those updates leading to convergence are applied to the model. As such, DPSUR ensures the training in the right direction and thus can achieve faster convergence than DPSGD. The main challenges lie in two aspects -- privacy concerns arising from gradient evaluation, and gradient selection strategy for model update. To address the challenges, DPSUR introduces a clipping strategy for update randomization and a threshold mechanism for gradient selection. Experiments conducted on MNIST, FMNIST, CIFAR-10, and IMDB datasets show that DPSUR significantly outperforms previous works in terms of convergence speed and model utility.
Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
Authors: Authors: Markus Gross, Arne P. Raulf, Christoph Räth
Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
Arxiv link: https://arxiv.org/abs/2311.14120
Pdf link: https://arxiv.org/pdf/2311.14120
Abstract We investigate the stationary (late-time) training regime of single- and two-layer linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly oversampled regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but experience an isotropic loss. For a two-layer network, we obtain the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a new source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations experience an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a deep linear network model.
Byzantine Robustness and Partial Participation Can Be Achieved Simultaneously: Just Clip Gradient Differences
Authors: Authors: Grigory Malinovsky, Peter Richtárik, Samuel Horváth, Eduard Gorbunov
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2311.14127
Pdf link: https://arxiv.org/pdf/2311.14127
Abstract Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients, which is not always practical due to the unavailability of some clients or communication constraints. In our work, we propose the first distributed method with client sampling and provable tolerance to Byzantine workers. The key idea behind the developed method is the use of gradient clipping to control stochastic gradient differences in recursive variance reduction. This allows us to bound the potential harm caused by Byzantine workers, even during iterations when all sampled clients are Byzantine. Furthermore, we incorporate communication compression into the method to enhance communication efficiency. Under quite general assumptions, we prove convergence rates for the proposed method that match the existing state-of-the-art (SOTA) theoretical results.
Machine Learning For An Explainable Cost Prediction of Medical Insurance
Authors: Authors: Ugochukwu Orji, Elochukwu Ukwandu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.14139
Pdf link: https://arxiv.org/pdf/2311.14139
Abstract Predictive modeling in healthcare continues to be an active actuarial research topic as more insurance companies aim to maximize the potential of Machine Learning approaches to increase their productivity and efficiency. In this paper, the authors deployed three regression-based ensemble ML models that combine variations of decision trees through Extreme Gradient Boosting, Gradient-boosting Machine, and Random Forest) methods in predicting medical insurance costs. Explainable Artificial Intelligence methods SHapley Additive exPlanations and Individual Conditional Expectation plots were deployed to discover and explain the key determinant factors that influence medical insurance premium prices in the dataset. The dataset used comprised 986 records and is publicly available in the KAGGLE repository. The models were evaluated using four performance evaluation metrics, including R-squared, Mean Absolute Error, Root Mean Squared Error, and Mean Absolute Percentage Error. The results show that all models produced impressive outcomes; however, the XGBoost model achieved a better overall performance although it also expanded more computational resources, while the RF model recorded a lesser prediction error and consumed far fewer computing resources than the XGBoost model. Furthermore, we compared the outcome of both XAi methods in identifying the key determinant features that influenced the PremiumPrices for each model and whereas both XAi methods produced similar outcomes, we found that the ICE plots showed in more detail the interactions between each variable than the SHAP analysis which seemed to be more high-level. It is the aim of the authors that the contributions of this study will help policymakers, insurers, and potential medical insurance buyers in their decision-making process for selecting the right policies that meet their specific needs.
Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus
Authors: Authors: Gabriele Maroni, Loris Cannelli, Dario Piga
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.14182
Pdf link: https://arxiv.org/pdf/2311.14182
Abstract Common regularization algorithms for linear regression, such as LASSO and Ridge regression, rely on a regularization hyperparameter that balances the tradeoff between minimizing the fitting error and the norm of the learned model coefficients. As this hyperparameter is scalar, it can be easily selected via random or grid search optimizing a cross-validation criterion. However, using a scalar hyperparameter limits the algorithm's flexibility and potential for better generalization. In this paper, we address the problem of linear regression with l2-regularization, where a different regularization hyperparameter is associated with each input variable. We optimize these hyperparameters using a gradient-based approach, wherein the gradient of a cross-validation criterion with respect to the regularization hyperparameters is computed analytically through matrix differential calculus. Additionally, we introduce two strategies tailored for sparse model learning problems aiming at reducing the risk of overfitting to the validation data. Numerical examples demonstrate that our multi-hyperparameter regularization approach outperforms LASSO, Ridge, and Elastic Net regression. Moreover, the analytical computation of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation, especially when handling a large number of input variables. Application to the identification of over-parameterized Linear Parameter-Varying models is also presented.
Learning to Solve Inverse Problems for Perceptual Sound Matching
Authors: Authors: Han Han, Vincent Lostanlen, Mathieu Lagrange
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2311.14213
Pdf link: https://arxiv.org/pdf/2311.14213
Abstract Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is perceptual-neural-physical loss (PNP), which aims at addressing a tradeoff between perceptual relevance and computational efficiency. The key idea behind PNP is to linearize the effect of synthesis parameters upon auditory features in the vicinity of each training sample. The linearization procedure is massively paralellizable, can be precomputed, and offers a 100-fold speedup during gradient descent compared to differentiable digital signal processing (DDSP). We demonstrate PNP on two datasets of nonstationary sounds: an AM/FM arpeggiator and a physical model of rectangular membranes. We show that PNP is able to accelerate DDSP with joint time-frequency scattering transform (JTFS) as auditory feature, while preserving its perceptual fidelity. Additionally, we evaluate the impact of other design choices in PSM: parameter rescaling, pretraining, auditory representation, and gradient clipping. We report state-of-the-art results on both datasets and find that PNP-accelerated JTFS has greater influence on PSM performance than any other design choice.
Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
Authors: Authors: Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, Quanquan Gu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2311.14222
Pdf link: https://arxiv.org/pdf/2311.14222
Abstract Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.
CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning
Authors: Authors: Shivam Aggarwal, Kuluhan Binici, Tulika Mitra
Subjects: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14272
Pdf link: https://arxiv.org/pdf/2311.14272
Abstract Machine learning pipelines for classification tasks often train a universal model to achieve accuracy across a broad range of classes. However, a typical user encounters only a limited selection of classes regularly. This disparity provides an opportunity to enhance computational efficiency by tailoring models to focus on user-specific classes. Existing works rely on unstructured pruning, which introduces randomly distributed non-zero values in the model, making it unsuitable for hardware acceleration. Alternatively, some approaches employ structured pruning, such as channel pruning, but these tend to provide only minimal compression and may lead to reduced model accuracy. In this work, we propose CRISP, a novel pruning framework leveraging a hybrid structured sparsity pattern that combines both fine-grained N:M structured sparsity and coarse-grained block sparsity. Our pruning strategy is guided by a gradient-based class-aware saliency score, allowing us to retain weights crucial for user-specific classes. CRISP achieves high accuracy with minimal memory consumption for popular models like ResNet-50, VGG-16, and MobileNetV2 on ImageNet and CIFAR-100 datasets. Moreover, CRISP delivers up to 14$\times$ reduction in latency and energy consumption compared to existing pruning methods while maintaining comparable accuracy. Our code is available at https://github.com/shivmgg/CRISP/.
Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling
Authors: Authors: Mingze Wang, Zeping Min, Lei Wu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2311.14387
Pdf link: https://arxiv.org/pdf/2311.14387
Abstract In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an {\em exponential rate}. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow {\em polynomial rate}. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) {\em provably fail} in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.
Directly Attention Loss Adjusted Prioritized Experience Replay
Authors: Authors: Zhuoying Chen, Huiping Li, Zhaoxu Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2311.14390
Pdf link: https://arxiv.org/pdf/2311.14390
Abstract Prioritized Experience Replay (PER) enables the model to learn more about relatively important samples by artificially changing their accessed frequencies. However, this non-uniform sampling method shifts the state-action distribution that is originally used to estimate Q-value functions, which brings about the estimation deviation. In this article, an novel off policy reinforcement learning training framework called Directly Attention Loss Adjusted Prioritized Experience Replay (DALAP) is proposed, which can directly quantify the changed extent of the shifted distribution through Parallel Self-Attention network, so as to accurately compensate the error. In addition, a Priority-Encouragement mechanism is designed simultaneously to optimize the sample screening criterion, and further improve the training efficiency. In order to verify the effectiveness and generality of DALAP, we integrate it with the value-function based, the policy-gradient based and multi-agent reinforcement learning algorithm, respectively. The multiple groups of comparative experiments show that DALAP has the significant advantages of both improving the convergence rate and reducing the training variance.
DP-NMT: Scalable Differentially-Private Machine Translation
Authors: Authors: Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2311.14465
Pdf link: https://arxiv.org/pdf/2311.14465
Abstract Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling
Authors: Authors: Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy J. Mitra, Gurprit Singh
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2311.14468
Pdf link: https://arxiv.org/pdf/2311.14468
Abstract Machine learning problems rely heavily on stochastic gradient descent (SGD) for optimization. The effectiveness of SGD is contingent upon accurately estimating gradients from a mini-batch of data samples. Instead of the commonly used uniform sampling, adaptive or importance sampling reduces noise in gradient estimation by forming mini-batches that prioritize crucial data points. Previous research has suggested that data points should be selected with probabilities proportional to their gradient norm. Nevertheless, existing algorithms have struggled to efficiently integrate importance sampling into machine learning frameworks. In this work, we make two contributions. First, we present an algorithm that can incorporate existing importance functions into our framework. Second, we propose a simplified importance function that relies solely on the loss gradient of the output layer. By leveraging our proposed gradient estimation techniques, we observe improved convergence in classification and regression tasks with minimal computational overhead. We validate the effectiveness of our adaptive and importance-sampling approach on image and point-cloud datasets.
Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach
Authors: Authors: Xinwei Zhang, Zhiqi Bu, Zhiwei Steven Wu, Mingyi Hong
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2311.14632
Pdf link: https://arxiv.org/pdf/2311.14632
Abstract Differentially Private Stochastic Gradient Descent with gradient clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, using DPSGD-GC to ensure Differential Privacy (DP) comes at the cost of model performance degradation due to DP noise injection and gradient clipping. Existing research has extensively analyzed the theoretical convergence of DPSGD-GC, and has shown that it only converges when using large clipping thresholds that are dependent on problem-specific parameters. Unfortunately, these parameters are often unknown in practice, making it hard to choose the optimal clipping threshold. Therefore, in practice, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which not only offers a diminishing utility bound without inducing a constant clipping bias, but more importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{\'e}nyi DP. Additionally, we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on Cifar-10/100 and E2E datasets, show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee.
Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks
Authors: Authors: Zhen Qin, Xuwei Tan, Zhihui Zhu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2311.14658
Pdf link: https://arxiv.org/pdf/2311.14658
Abstract Enforcing orthonormal or isometric property for the weight matrices has been shown to enhance the training of deep neural networks by mitigating gradient exploding/vanishing and increasing the robustness of the learned networks. However, despite its practical performance, the theoretical analysis of orthonormality in neural networks is still lacking; for example, how orthonormality affects the convergence of the training process. In this letter, we aim to bridge this gap by providing convergence analysis for training orthonormal deep linear neural networks. Specifically, we show that Riemannian gradient descent with an appropriate initialization converges at a linear rate for training orthonormal deep linear neural networks with a class of loss functions. Unlike existing works that enforce orthonormal weight matrices for all the layers, our approach excludes this requirement for one layer, which is crucial to establish the convergence guarantee. Our results shed light on how increasing the number of hidden layers can impact the convergence speed. Experimental results validate our theoretical analysis.
Keyword: super-resolution

Image Super-Resolution with Text Prompt Diffusion
Authors: Authors: Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, Xiaokang Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2311.14282
Pdf link: https://arxiv.org/pdf/2311.14282
Abstract Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into SR dataset through the text degradation representation and degradation model. The text representation applies a discretization manner based on the binning method to describe the degradation abstractly. This representation method can also maintain the flexibility of language. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR employs the diffusion model and the pre-trained language model (e.g., T5 and CLIP). We train the model on the generated text-image dataset. Extensive experiments indicate that introducing text prompts into image SR, yields excellent results on both synthetic and real-world images. Code: https://github.com/zhengchen1999/PromptSR.

zoq / arxiv-updates

New submissions for Mon, 27 Nov 23 #651

Keyword: sgd

A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer

Sample as You Infer: Predictive Coding With Langevin Dynamics

DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release

Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

DP-NMT: Scalable Differentially-Private Machine Translation

Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling

Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach

Keyword: optimization

BackboneLearn: A Library for Scaling Mixed-Integer Optimization-Based Machine Learning

Nova$^+$: Generative Language Models for Binaries

Federated Learning Assisted Distributed Energy Optimization

Posterior Distillation Sampling

Exact Combinatorial Optimization with Temporo-Attentional Graph Neural Networks

A Deep Reinforcement Learning Approach for Improving Age of Information in Mission-Critical IoT

Locally Optimal Descent for Dynamic Stepsize Scheduling

Unsupervised Learning for Topological Classification of Transportation Networks

Parameter Exchange for Robust Dynamic Domain Generalization

Efficient Trigger Word Insertion

An optimal first-order Taylor-like formula with a minimized remainder

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

On the Hyperparameter Landscapes of Machine Learning Algorithms

SySMOL: A Hardware-software Co-design Framework for Ultra-Low and Fine-Grained Mixed-Precision Neural Networks

Variational Annealing on Graphs for Combinatorial Optimization

TCuPGAN: A novel framework developed for optimizing human-machine interactions in citizen science

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

Formulations to select assets for constructing sparse index tracking portfolios

Constant-Time Wasmtime, for Real This Time: End-to-End Verified Zero-Overhead Constant-Time Programming for the Web and Beyond

Segmentation-Based Parametric Painting

Stable Cluster Discrimination for Deep Clustering

BHGNN-RT: Network embedding for directed heterogeneous graphs

Receding Horizon Optimization with PPUM: An Approach for Autonomous Robot Path Planning in Uncertain Environments

Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling

MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization

Pitfalls of Projection: A study of Newton-type solvers for incremental potentials

Electric Vehicles coordination for grid balancing using multi-objective Harris Hawks Optimization

A Survey and Analysis of Evolutionary Operators for Permutations

Target-driven splitting SPH optimization of thermal conductivity distribution

Evolution of Neural Architectures for Financial Forecasting: A Note on Data Incompatibility during Crisis Periods

Received Signal and Channel Parameter Estimation in Molecular Communications

A General Framework for User-Guided Bayesian Optimization

Keyword: adam

Keyword: gradient

A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer

A Joint Gradient and Loss Based Clustered Federated Learning Design

Single-Shot Plug-and-Play Methods for Inverse Problems

OASIS: Offsetting Active Reconstruction Attacks in Federated Learning

Learning Hierarchical Polynomials with Three-Layer Neural Networks

Max-Min SINR Analysis of STAR-RIS Assisted Massive MIMO Systems with Hardware Impairments

Locally Optimal Descent for Dynamic Stepsize Scheduling

Leveraging Optimal Transport via Projections on Subspaces for Machine Learning Applications

Parameter Exchange for Robust Dynamic Domain Generalization

Unconstrained learning of networked nonlinear systems via free parametrization of stable interconnected operators

When Side-Channel Attacks Break the Black-Box Property of Embedded Artificial Intelligence

Understanding the Vulnerability of CLIP to Image Compression

DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release

Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation

Byzantine Robustness and Partial Participation Can Be Achieved Simultaneously: Just Clip Gradient Differences

Machine Learning For An Explainable Cost Prediction of Medical Insurance

Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus

Learning to Solve Inverse Problems for Perceptual Sound Matching

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

Directly Attention Loss Adjusted Prioritized Experience Replay

DP-NMT: Scalable Differentially-Private Machine Translation

Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling

Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

Keyword: super-resolution

Image Super-Resolution with Text Prompt Diffusion