New submissions for Fri, 4 Nov 22

Keyword: sgd

A Convergence Theory for Federated Average: Beyond Smoothness

Authors: Authors: Xiaoxiao Li, Zhao Song, Runzhou Tao, Guangyi Zhang
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2211.01588
Pdf link: https://arxiv.org/pdf/2211.01588
Abstract Federated learning enables a large amount of edge computing devices to learn a model without data sharing jointly. As a leading algorithm in this setting, Federated Average FedAvg, which runs Stochastic Gradient Descent (SGD) in parallel on local devices and averages the sequences only once in a while, have been widely used due to their simplicity and low communication cost. However, despite recent research efforts, it lacks theoretical analysis under assumptions beyond smoothness. In this paper, we analyze the convergence of FedAvg. Different from the existing work, we relax the assumption of strong smoothness. More specifically, we assume the semi-smoothness and semi-Lipschitz properties for the loss function, which have an additional first-order term in assumption definitions. In addition, we also assume bound on the gradient, which is weaker than the commonly used bounded gradient assumption in the convergence analysis scheme. As a solution, this paper provides a theoretical convergence study on Federated Learning.
Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization
Authors: Authors: Feihu Huang
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2211.01883
Pdf link: https://arxiv.org/pdf/2211.01883
Abstract Composition optimization recently appears in many machine learning applications such as meta learning and reinforcement learning. Recently many composition optimization algorithms have been proposed and studied, however, few adaptive algorithm considers the composition optimization under the distributed setting. Meanwhile, the existing distributed composition optimization methods still suffer from high sample and communication complexities. In the paper, thus, we develop a class of faster momentum-based federated compositional gradient descent algorithms (i.e., MFCGD and AdaMFCGD) to solve the nonconvex distributed composition problems, which builds on the momentum-based variance reduced and local-SGD techniques. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Moreover, we provide a solid theoretical analysis for our algorithms under non-i.i.d. setting, and prove our algorithms obtain a lower sample and communication complexities simultaneously than the existing federated compositional algorithms. Specifically, our algorithms obtain lower sample complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding an $\epsilon$-stationary point. We conduct the experiments on robust federated learning and distributed meta learning tasks to demonstrate efficiency of our algorithms.
Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale
Authors: Authors: Moritz Kirschte, Sebastian Meiser, Saman Ardalan, Esfandiar Mohammadi
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2211.02003
Pdf link: https://arxiv.org/pdf/2211.02003
Abstract Distributing machine learning predictors enables the collection of large-scale datasets while leaving sensitive raw data at trustworthy sites. We show that locally training support vector machines (SVMs) and computing their averages leads to a learning technique that is scalable to a large number of users, satisfies differential privacy, and is applicable to non-trivial tasks, such as CIFAR-10. For a large number of participants, communication cost is one of the main challenges. We achieve a low communication cost by requiring only a single invocation of an efficient secure multiparty summation protocol. By relying on state-of-the-art feature extractors (SimCLR), we are able to utilize differentially private convex learners for non-trivial tasks such as CIFAR-10. Our experimental results illustrate that for $1{,}000$ users with $50$ data points each, our scheme outperforms state-of-the-art scalable distributed learning methods (differentially private federated learning, short DP-FL) while requiring around $500$ times fewer communication costs: For CIFAR-10, we achieve a classification accuracy of $79.7\,\%$ for an $\varepsilon = 0.59$ while DP-FL achieves $57.6\,\%$. More generally, we prove learnability properties for the average of such locally trained models: convergence and uniform stability. By only requiring strongly convex, smooth, and Lipschitz-continuous objective functions, locally trained via stochastic gradient descent (SGD), we achieve a strong utility-privacy tradeoff.
Keyword: optimization

Speeding up NAS with Adaptive Subset Selection
Authors: Authors: Vishak Prasad C, Colin White, Paarth Jain, Sibasis Nayak, Ganesh Ramakrishnan
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2211.01454
Pdf link: https://arxiv.org/pdf/2211.01454
Abstract A majority of recent developments in neural architecture search (NAS) have been aimed at decreasing the computational cost of various techniques without affecting their final performance. Towards this goal, several low-fidelity and performance prediction methods have been considered, including those that train only on subsets of the training data. In this work, we present an adaptive subset selection approach to NAS and present it as complementary to state-of-the-art NAS approaches. We uncover a natural connection between one-shot NAS algorithms and adaptive subset selection and devise an algorithm that makes use of state-of-the-art techniques from both areas. We use these techniques to substantially reduce the runtime of DARTS-PT (a leading one-shot NAS algorithm), as well as BOHB and DEHB (leading multifidelity optimization algorithms), without sacrificing accuracy. Our results are consistent across multiple datasets, and towards full reproducibility, we release our code at https: //anonymous.4open.science/r/SubsetSelection NAS-B132.
PI is back! Switching Acquisition Functions in Bayesian Optimization
Authors: Authors: Carolin Benjamins, Elena Raponi, Anja Jankovic, Koen van der Blom, Maria Laura Santoni, Marius Lindauer, Carola Doerr
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2211.01455
Pdf link: https://arxiv.org/pdf/2211.01455
Abstract Bayesian Optimization (BO) is a powerful, sample-efficient technique to optimize expensive-to-evaluate functions. Each of the BO components, such as the surrogate model, the acquisition function (AF), or the initial design, is subject to a wide range of design choices. Selecting the right components for a given optimization task is a challenging task, which can have significant impact on the quality of the obtained results. In this work, we initiate the analysis of which AF to favor for which optimization scenarios. To this end, we benchmark SMAC3 using Expected Improvement (EI) and Probability of Improvement (PI) as acquisition functions on the 24 BBOB functions of the COCO environment. We compare their results with those of schedules switching between AFs. One schedule aims to use EI's explorative behavior in the early optimization steps, and then switches to PI for a better exploitation in the final steps. We also compare this to a random schedule and round-robin selection of EI and PI. We observe that dynamic schedules oftentimes outperform any single static one. Our results suggest that a schedule that allocates the first 25 % of the optimization budget to EI and the last 75 % to PI is a reliable default. However, we also observe considerable performance differences for the 24 functions, suggesting that a per-instance allocation, possibly learned on the fly, could offer significant improvement over the state-of-the-art BO designs.
Regression Compatible Listwise Objectives for Calibrated Ranking
Authors: Authors: Aijun Bai, Rolf Jagerman, Zhen Qin, Pratyush Kar, Bing-Rong Lin, Xuanhui Wang, Michael Bendersky, Marc Najork
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2211.01494
Pdf link: https://arxiv.org/pdf/2211.01494
Abstract As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design -- for example, adding a constant to the score of each item on the list will not affect the list ordering. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives can be inherently conflicting, which makes the trade-off far from ideal for both of them. In this paper, we propose a novel regression compatible ranking (RCR) approach to achieve a better trade-off. The advantage of the proposed approach is that the regression and ranking components are well aligned which brings new opportunities for harmonious regression and ranking. Theoretically, we show that the two components share the same minimizer at global minima while the regression component ensures scale calibration. Empirically, we show that the proposed approach performs well on both regression and ranking metrics on several public LTR datasets, and significantly improves the Pareto frontiers in the context of multi-objective optimization. Furthermore, we evaluated the proposed approach on YouTube Search and found that it not only improved the ranking quality of the production pCTR model, but also brought gains to the click prediction accuracy.
Max Markov Chain
Authors: Authors: Yu Zhang, Mitchell Bucklew
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2211.01496
Pdf link: https://arxiv.org/pdf/2211.01496
Abstract In this paper, we introduce Max Markov Chain (MMC), a novel representation for a useful subset of High-order Markov Chains (HMCs) with sparse correlations among the states. MMC is parsimony while retaining the expressiveness of HMCs. Even though parameter optimization is generally intractable as with HMC approximate models, it has an analytical solution, better sample efficiency, and the desired spatial and computational advantages over HMCs and approximate HMCs. Simultaneously, efficient approximate solutions exist for this type of chains as we show empirically, which allow MMCs to scale to large domains where HMCs and approximate HMCs would struggle to perform. We compare MMC with HMC, first-order Markov chain, and an approximate HMC model in synthetic domains with various data types to demonstrate that MMC is a valuable alternative for modeling stochastic processes and has many potential applications.
On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach
Authors: Authors: Dennis Wei, Rahul Nair, Amit Dhurandhar, Kush R. Varshney, Elizabeth M. Daly, Moninder Singh
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2211.01498
Pdf link: https://arxiv.org/pdf/2211.01498
Abstract Interpretable and explainable machine learning has seen a recent surge of interest. We focus on safety as a key motivation behind the surge and make the relationship between interpretability and safety more quantitative. Toward assessing safety, we introduce the concept of maximum deviation via an optimization problem to find the largest deviation of a supervised learning model from a reference model regarded as safe. We then show how interpretability facilitates this safety assessment. For models including decision trees, generalized linear and additive models, the maximum deviation can be computed exactly and efficiently. For tree ensembles, which are not regarded as interpretable, discrete optimization techniques can still provide informative bounds. For a broader class of piecewise Lipschitz functions, we leverage the multi-armed bandit literature to show that interpretability produces tighter (regret) bounds on the maximum deviation. We present case studies, including one on mortgage approval, to illustrate our methods and the insights about models that may be obtained from deviation maximization.
$D^2$SLAM: Decentralized and Distributed Collaborative Visual-inertial SLAM System for Aerial Swarm
Authors: Authors: Hao Xu, Peize Liu, Xinyi Chen, Shaojie Shen
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2211.01538
Pdf link: https://arxiv.org/pdf/2211.01538
Abstract In recent years, aerial swarm technology has developed rapidly. In order to accomplish a fully autonomous aerial swarm, a key technology is decentralized and distributed collaborative SLAM (CSLAM) for aerial swarms, which estimates the relative pose and the consistent global trajectories. In this paper, we propose $D^2$SLAM: a decentralized and distributed ($D^2$) collaborative SLAM algorithm. This algorithm has high local accuracy and global consistency, and the distributed architecture allows it to scale up. $D^2$SLAM covers swarm state estimation in two scenarios: near-field state estimation for high real-time accuracy at close range and far-field state estimation for globally consistent trajectories estimation at the long-range between UAVs. Distributed optimization algorithms are adopted as the backend to achieve the $D^2$ goal. $D^2$SLAM is robust to transient loss of communication, network delays, and other factors. Thanks to the flexible architecture, $D^2$SLAM has the potential of applying in various scenarios.
nerf2nerf: Pairwise Registration of Neural Radiance Fields
Authors: Authors: Lily Goli, Daniel Rebain, Sara Sabour, Animesh Garg, Andrea Tagliasacchi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2211.01600
Pdf link: https://arxiv.org/pdf/2211.01600
Abstract We introduce a technique for pairwise registration of neural fields that extends classical optimization-based local registration (i.e. ICP) to operate on Neural Radiance Fields (NeRF) -- neural 3D scene representations trained from collections of calibrated images. NeRF does not decompose illumination and color, so to make registration invariant to illumination, we introduce the concept of a ''surface field'' -- a field distilled from a pre-trained NeRF model that measures the likelihood of a point being on the surface of an object. We then cast nerf2nerf registration as a robust optimization that iteratively seeks a rigid transformation that aligns the surface fields of the two scenes. We evaluate the effectiveness of our technique by introducing a dataset of pre-trained NeRF scenes -- our synthetic scenes enable quantitative evaluations and comparisons to classical registration techniques, while our real scenes demonstrate the validity of our technique in real-world scenarios. Additional results available at: https://nerf2nerf.github.io
Pairing optimization via statistics: Algebraic structure in pairing problems and its application to performance enhancement
Authors: Authors: Naoki Fujita, André Röhm, Takatomo Mihana, Ryoichi Horisaki, Aohan Li, Mikio Hasegawa, Makoto Naruse
Subjects: Data Structures and Algorithms (cs.DS); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2211.01661
Pdf link: https://arxiv.org/pdf/2211.01661
Abstract Fully pairing all elements of a set while attempting to maximize the total benefit is a combinatorically difficult problem. Such pairing problems naturally appear in various situations in science, technology, economics, and other fields. In our previous study, we proposed an efficient method to infer the underlying compatibilities among the entities, under the constraint that only the total compatibility is observable. Furthermore, by transforming the pairing problem into a traveling salesman problem with a multi-layer architecture, a pairing optimization algorithm was successfully demonstrated to derive a high-total-compatibility pairing. However, there is substantial room for further performance enhancement by further exploiting the underlying mathematical properties. In this study, we prove the existence of algebraic structures in the pairing problem. We transform the initially estimated compatibility information into an equivalent form where the variance of the individual compatibilities is minimized. We then demonstrate that the total compatibility obtained when using the heuristic pairing algorithm on the transformed problem is significantly higher compared to the previous method. With this improved perspective on the pairing problem using fundamental mathematical properties, we can contribute to practical applications such as wireless communications beyond 5G, where efficient pairing is of critical importance.
A Round and Bipartize Approximation Algorithm for Vertex Cover
Authors: Authors: Danish Kashaev, Guido Schäfer
Subjects: Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2211.01699
Pdf link: https://arxiv.org/pdf/2211.01699
Abstract The vertex cover problem is a fundamental and widely studied combinatorial optimization problem. It is known that its standard linear programming relaxation is integral for bipartite graphs and half-integral for general graphs. As a consequence, the natural rounding algorithm based on this relaxation computes an optimal solution for bipartite graphs and a $2$-approximation for general graphs. This raises the question of whether one can obtain improved bounds on the approximation ratio, depending on how close the graph is to being bipartite. In this paper, we consider a round-and-bipartize algorithm that exploits the knowledge of an induced bipartite subgraph to attain improved approximation ratios. Equivalently, we suppose that we have access to a subset of vertices $S$ whose removal bipartizes the graph. If $S$ is an independent set, we prove an approximation ratio of $1 + 1/\rho$, where $2\rho -1$ denotes the odd girth of the contracted graph $\tilde{\mathcal{G}} := \mathcal{G} /S$ and thus satisfies $\rho \geq 2$. We show that this is tight for any graph and independent set by providing a family of weight functions for which this bound is attained. In addition, we give tight upper bounds for the fractional chromatic number and the integrality gap of such graphs, both of which also depend on the odd girth. If $S$ is an arbitrary set, we prove a tight approximation ratio of $\left(1+1/\rho \right) (1 - \alpha) + 2 \alpha$, where $\alpha \in [0,1]$ denotes the total normalized dual sum of the edges lying inside of the set $S$. As an algorithmic application, we show that for any efficiently $k$-colorable graph with $k \geq 4$ we can find a bipartizing set satisfying $\alpha \leq 1 - 4/k$. This provides an approximation algorithm recovering the bound of $2 - 2/k$ in the worst case (i.e., when $\rho = 2$), which is best possible for this setting when using the standard relaxation.
Towards Discovering Neural Architectures from Scratch
Authors: Authors: Simon Schrodi, Danny Stoll, Binxin Ru, Rhea Sukthanker, Thomas Brox, Frank Hutter
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2211.01842
Pdf link: https://arxiv.org/pdf/2211.01842
Abstract The discovery of neural architectures from scratch is the long-standing goal of Neural Architecture Search (NAS). Searching over a wide spectrum of neural architectures can facilitate the discovery of previously unconsidered but well-performing architectures. In this work, we take a large step towards discovering neural architectures from scratch by expressing architectures algebraically. This algebraic view leads to a more general method for designing search spaces, which allows us to compactly represent search spaces that are 100s of orders of magnitude larger than common spaces from the literature. Further, we propose a Bayesian Optimization strategy to efficiently search over such huge spaces, and demonstrate empirically that both our search space design and our search strategy can be superior to existing baselines. We open source our algebraic NAS approach and provide APIs for PyTorch and TensorFlow.
Circling Back to Recurrent Models of Language
Authors: Authors: Gábor Melis
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2211.01848
Pdf link: https://arxiv.org/pdf/2211.01848
Abstract Just because some purely recurrent models suffer from being hard to optimize and inefficient on today's hardware, they are not necessarily bad models of language. We demonstrate this by the extent to which these models can still be improved by a combination of a slightly better recurrent cell, architecture, objective, as well as optimization. In the process, we establish a new state of the art for language modelling on small datasets and on enwik8 with dynamic evaluation.
Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization
Authors: Authors: Feihu Huang
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2211.01883
Pdf link: https://arxiv.org/pdf/2211.01883
Abstract Composition optimization recently appears in many machine learning applications such as meta learning and reinforcement learning. Recently many composition optimization algorithms have been proposed and studied, however, few adaptive algorithm considers the composition optimization under the distributed setting. Meanwhile, the existing distributed composition optimization methods still suffer from high sample and communication complexities. In the paper, thus, we develop a class of faster momentum-based federated compositional gradient descent algorithms (i.e., MFCGD and AdaMFCGD) to solve the nonconvex distributed composition problems, which builds on the momentum-based variance reduced and local-SGD techniques. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Moreover, we provide a solid theoretical analysis for our algorithms under non-i.i.d. setting, and prove our algorithms obtain a lower sample and communication complexities simultaneously than the existing federated compositional algorithms. Specifically, our algorithms obtain lower sample complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding an $\epsilon$-stationary point. We conduct the experiments on robust federated learning and distributed meta learning tasks to demonstrate efficiency of our algorithms.
Sub-network Multi-objective Evolutionary Algorithm for Filter Pruning
Authors: Authors: Xuhua Li, Weize Sun, Lei Huang, Shaowu Chen
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2211.01957
Pdf link: https://arxiv.org/pdf/2211.01957
Abstract Filter pruning is a common method to achieve model compression and acceleration in deep neural networks (DNNs).Some research regarded filter pruning as a combinatorial optimization problem and thus used evolutionary algorithms (EA) to prune filters of DNNs. However, it is difficult to find a satisfactory compromise solution in a reasonable time due to the complexity of solution space searching. To solve this problem, we first formulate a multi-objective optimization problem based on a sub-network of the full model and propose a Sub-network Multiobjective Evolutionary Algorithm (SMOEA) for filter pruning. By progressively pruning the convolutional layers in groups, SMOEA can obtain a lightweight pruned result with better performance.Experiments on VGG-14 model for CIFAR-10 verify the effectiveness of the proposed SMOEA. Specifically, the accuracy of the pruned model with 16.56% parameters decreases by 0.28% only, which is better than the widely used popular filter pruning criteria.
Analyzing Performance Issues of Virtual Reality Applications
Authors: Authors: Jason Hogan, Aaron Salo, Dhia Elhaq Rzig, Foyzul Hassan, Bruce Maxim
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2211.02013
Pdf link: https://arxiv.org/pdf/2211.02013
Abstract Extended Reality (XR) includes Virtual Reality (VR), Augmented Reality (AR) and Mixed Reality (MR). XR is an emerging technology that simulates a realistic environment for users. XR techniques have provided revolutionary user experiences in various application scenarios (e.g., training, education, product/architecture design, gaming, remote conference/tour, etc.). Due to the high computational cost of rendering real-time animation in limited-resource devices and constant interaction with user activity, XR applications often face performance bottlenecks, and these bottlenecks create a negative impact on the user experience of XR software. Thus, performance optimization plays an essential role in many industry-standard XR applications. Even though identifying performance bottlenecks in traditional software (e.g., desktop applications) is a widely explored topic, those approaches cannot be directly applied within XR software due to the different nature of XR applications. Moreover, XR applications developed in different frameworks such as Unity and Unreal Engine show different performance bottleneck patterns and thus, bottleneck patterns of Unity projects can't be applied for Unreal Engine (UE)-based XR projects. To fill the knowledge gap for XR performance optimizations of Unreal Engine-based XR projects, we present the first empirical study on performance optimizations from seven UE XR projects, 78 UE XR discussion issues and three sources of UE documentation. Our analysis identified 14 types of performance bugs, including 12 types of bugs related to UE settings issues and two types of CPP source code-related issues. To further assist developers in detecting performance bugs based on the identified bug patterns, we also developed a static analyzer, UEPerfAnalyzer, that can detect performance bugs in both configuration files and source code.
Keyword: adam

Adaptive sparse grid discontinuous Galerkin method: review and software implementation
Authors: Authors: Juntao Huang, Wei Guo, Yingda Cheng
Subjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2211.01531
Pdf link: https://arxiv.org/pdf/2211.01531
Abstract This paper reviews the adaptive sparse grid discontinuous Galerkin (aSG-DG) method for computing high dimensional partial differential equations (PDEs) and its software implementation. The C\texttt{++} software package called AdaM-DG, implementing the aSG-DG method, is available on Github at \url{https://github.com/JuntaoHuang/adaptive-multiresolution-DG}. The package is capable of treating a large class of high dimensional linear and nonlinear PDEs. We review the essential components of the algorithm and the functionality of the software, including the multiwavelets used, assembling of bilinear operators, fast matrix-vector product for data with hierarchical structures. We further demonstrate the performance of the package by reporting numerical error and CPU cost for several benchmark test, including linear transport equations, wave equations and Hamilton-Jacobi equations.
Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization
Authors: Authors: Feihu Huang
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2211.01883
Pdf link: https://arxiv.org/pdf/2211.01883
Abstract Composition optimization recently appears in many machine learning applications such as meta learning and reinforcement learning. Recently many composition optimization algorithms have been proposed and studied, however, few adaptive algorithm considers the composition optimization under the distributed setting. Meanwhile, the existing distributed composition optimization methods still suffer from high sample and communication complexities. In the paper, thus, we develop a class of faster momentum-based federated compositional gradient descent algorithms (i.e., MFCGD and AdaMFCGD) to solve the nonconvex distributed composition problems, which builds on the momentum-based variance reduced and local-SGD techniques. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Moreover, we provide a solid theoretical analysis for our algorithms under non-i.i.d. setting, and prove our algorithms obtain a lower sample and communication complexities simultaneously than the existing federated compositional algorithms. Specifically, our algorithms obtain lower sample complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding an $\epsilon$-stationary point. We conduct the experiments on robust federated learning and distributed meta learning tasks to demonstrate efficiency of our algorithms.
Keyword: gradient

A Convergence Theory for Federated Average: Beyond Smoothness
Authors: Authors: Xiaoxiao Li, Zhao Song, Runzhou Tao, Guangyi Zhang
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2211.01588
Pdf link: https://arxiv.org/pdf/2211.01588
Abstract Federated learning enables a large amount of edge computing devices to learn a model without data sharing jointly. As a leading algorithm in this setting, Federated Average FedAvg, which runs Stochastic Gradient Descent (SGD) in parallel on local devices and averages the sequences only once in a while, have been widely used due to their simplicity and low communication cost. However, despite recent research efforts, it lacks theoretical analysis under assumptions beyond smoothness. In this paper, we analyze the convergence of FedAvg. Different from the existing work, we relax the assumption of strong smoothness. More specifically, we assume the semi-smoothness and semi-Lipschitz properties for the loss function, which have an additional first-order term in assumption definitions. In addition, we also assume bound on the gradient, which is weaker than the commonly used bounded gradient assumption in the convergence analysis scheme. As a solution, this paper provides a theoretical convergence study on Federated Learning.
Meta-PDE: Learning to Solve PDEs Quickly Without a Mesh
Authors: Authors: Tian Qin, Alex Beatson, Deniz Oktay, Nick McGreivy, Ryan P. Adams
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2211.01604
Pdf link: https://arxiv.org/pdf/2211.01604
Abstract Partial differential equations (PDEs) are often computationally challenging to solve, and in many settings many related PDEs must be be solved either at every timestep or for a variety of candidate boundary conditions, parameters, or geometric domains. We present a meta-learning based method which learns to rapidly solve problems from a distribution of related PDEs. We use meta-learning (MAML and LEAP) to identify initializations for a neural network representation of the PDE solution such that a residual of the PDE can be quickly minimized on a novel task. We apply our meta-solving approach to a nonlinear Poisson's equation, 1D Burgers' equation, and hyperelasticity equations with varying parameters, geometries, and boundary conditions. The resulting Meta-PDE method finds qualitatively accurate solutions to most problems within a few gradient steps; for the nonlinear Poisson and hyper-elasticity equation this results in an intermediate accuracy approximation up to an order of magnitude faster than a baseline finite element analysis (FEA) solver with equivalent accuracy. In comparison to other learned solvers and surrogate models, this meta-learning approach can be trained without supervision from expensive ground-truth data, does not require a mesh, and can even be used when the geometry and topology varies between tasks.
Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively
Authors: Authors: Haojie Zhang, Ge Li, Jia Li, Zhongjin Zhang, Yuqi Zhu, Zhi Jin
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2211.01642
Pdf link: https://arxiv.org/pdf/2211.01642
Abstract Large-scale pre-trained language models have achieved impressive results on a wide range of downstream tasks recently. However, fine-tuning an extremely large-scale pre-trained language model on limited target datasets is often plagued by overfitting and representation degradation. In this paper, we propose a Dynamic Parameter Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning, which adaptively selects a more promising subnetwork to perform staging updates based on gradients of back-propagation. Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability, and consistently achieves better results with variable pre-trained language models. In addition, DPS brings a large magnitude of improvement in out-of-domain transferring experiments and low-resource scenarios, which shows that it can maintain stable general contextual features and reduce the representation collapse. We release our code at https://github.com/ZhangHaojie077/DPS
Jacobians and Gradients for Cartesian Differential Categories
Authors: Authors: Jean-Simon Pacaud Lemay (Mount Allison University)
Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2211.01835
Pdf link: https://arxiv.org/pdf/2211.01835
Abstract Cartesian differential categories come equipped with a differential combinator that formalizes the directional derivative from multivariable calculus. Cartesian differential categories provide a categorical semantics of the differential lambda-calculus and have also found applications in causal computation, incremental computation, game theory, differentiable programming, and machine learning. There has recently been a desire to provide a (coordinate-free) characterization of Jacobians and gradients in Cartesian differential categories. One's first attempt might be to consider Cartesian differential categories which are Cartesian closed, such as models of the differential lambda-calculus, and then take the curry of the derivative. Unfortunately, this approach excludes numerous important examples of Cartesian differential categories such as the category of real smooth functions. In this paper, we introduce linearly closed Cartesian differential categories, which are Cartesian differential categories that have an internal hom of linear maps, a bilinear evaluation map, and the ability to curry maps which are linear in their second argument. As such, the Jacobian of a map is defined as the curry of its derivative. Many well-known examples of Cartesian differential categories are linearly closed, such as, in particular, the category of real smooth functions. We also explain how a Cartesian closed differential category is linearly closed if and only if a certain linear idempotent on the internal hom splits. To define the gradient of a map, one must be able to define the transpose of the Jacobian, which can be done in a Cartesian reverse differential category. Thus, we define the gradient of a map to be the curry of its reverse derivative and show this equals the transpose of its Jacobian. We also explain how a linearly closed Cartesian reverse differential category is precisely a linearly closed Cartesian differential category with an appropriate notion of transpose.
Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization
Authors: Authors: Feihu Huang
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2211.01883
Pdf link: https://arxiv.org/pdf/2211.01883
Abstract Composition optimization recently appears in many machine learning applications such as meta learning and reinforcement learning. Recently many composition optimization algorithms have been proposed and studied, however, few adaptive algorithm considers the composition optimization under the distributed setting. Meanwhile, the existing distributed composition optimization methods still suffer from high sample and communication complexities. In the paper, thus, we develop a class of faster momentum-based federated compositional gradient descent algorithms (i.e., MFCGD and AdaMFCGD) to solve the nonconvex distributed composition problems, which builds on the momentum-based variance reduced and local-SGD techniques. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Moreover, we provide a solid theoretical analysis for our algorithms under non-i.i.d. setting, and prove our algorithms obtain a lower sample and communication complexities simultaneously than the existing federated compositional algorithms. Specifically, our algorithms obtain lower sample complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding an $\epsilon$-stationary point. We conduct the experiments on robust federated learning and distributed meta learning tasks to demonstrate efficiency of our algorithms.
On a Calderón preconditioner for the symmetric formulation of the electroencephalography forward problem without barycentric refinements
Authors: Authors: Viviana Giunzioni, John E. Ortiz G., Adrien Merlini, Simon B. Adrian, Francesco P. Andriulli
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2211.01967
Pdf link: https://arxiv.org/pdf/2211.01967
Abstract We present a Calder\'on preconditioning scheme for the symmetric formulation of the forward electroencephalographic (EEG) problem that cures both the dense discretization and the high-contrast breakdown. Unlike existing Calder\'on schemes presented for the EEG problem, it is refinement-free, that is, the electrostatic integral operators are not discretized with basis functions defined on the barycentrically-refined dual mesh. In fact, in the preconditioner, we reuse the original system matrix thus reducing computational burden. Moreover, the proposed formulation gives rise to a symmetric, positive-definite system of linear equations, which allows the application of the conjugate gradient method, an iterative method that exhibits a smaller computational cost compared to other Krylov subspace methods applicable to non-symmetric problems. Numerical results corroborate the theoretical analysis and attest of the efficacy of the proposed preconditioning technique on both canonical and realistic scenarios.
Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale
Authors: Authors: Moritz Kirschte, Sebastian Meiser, Saman Ardalan, Esfandiar Mohammadi
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2211.02003
Pdf link: https://arxiv.org/pdf/2211.02003
Abstract Distributing machine learning predictors enables the collection of large-scale datasets while leaving sensitive raw data at trustworthy sites. We show that locally training support vector machines (SVMs) and computing their averages leads to a learning technique that is scalable to a large number of users, satisfies differential privacy, and is applicable to non-trivial tasks, such as CIFAR-10. For a large number of participants, communication cost is one of the main challenges. We achieve a low communication cost by requiring only a single invocation of an efficient secure multiparty summation protocol. By relying on state-of-the-art feature extractors (SimCLR), we are able to utilize differentially private convex learners for non-trivial tasks such as CIFAR-10. Our experimental results illustrate that for $1{,}000$ users with $50$ data points each, our scheme outperforms state-of-the-art scalable distributed learning methods (differentially private federated learning, short DP-FL) while requiring around $500$ times fewer communication costs: For CIFAR-10, we achieve a classification accuracy of $79.7\,\%$ for an $\varepsilon = 0.59$ while DP-FL achieves $57.6\,\%$. More generally, we prove learnability properties for the average of such locally trained models: convergence and uniform stability. By only requiring strongly convex, smooth, and Lipschitz-continuous objective functions, locally trained via stochastic gradient descent (SGD), we achieve a strong utility-privacy tradeoff.
Keyword: super-resolution

Temporal Consistency Learning of inter-frames for Video Super-Resolution
Authors: Authors: Meiqin Liu, Shuo Jin, Chao Yao, Chunyu Lin, Yao Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2211.01639
Pdf link: https://arxiv.org/pdf/2211.01639
Abstract Video super-resolution (VSR) is a task that aims to reconstruct high-resolution (HR) frames from the low-resolution (LR) reference frame and multiple neighboring frames. The vital operation is to utilize the relative misaligned frames for the current frame reconstruction and preserve the consistency of the results. Existing methods generally explore information propagation and frame alignment to improve the performance of VSR. However, few studies focus on the temporal consistency of inter-frames. In this paper, we propose a Temporal Consistency learning Network (TCNet) for VSR in an end-to-end manner, to enhance the consistency of the reconstructed videos. A spatio-temporal stability module is designed to learn the self-alignment from inter-frames. Especially, the correlative matching is employed to exploit the spatial dependency from each frame to maintain structural stability. Moreover, a self-attention mechanism is utilized to learn the temporal correspondence to implement an adaptive warping operation for temporal consistency among multi-frames. Besides, a hybrid recurrent architecture is designed to leverage short-term and long-term information. We further present a progressive fusion module to perform a multistage fusion of spatio-temporal features. And the final reconstructed frames are refined by these fused features. Objective and subjective results of various experiments demonstrate that TCNet has superior performance on different benchmark datasets, compared to several state-of-the-art methods.
HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks
Authors: Authors: Filip Szatkowski, Karol J. Piczak, Przemysław Spurek, Jacek Tabor, Tomasz Trzciński
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2211.01839
Pdf link: https://arxiv.org/pdf/2211.01839
Abstract Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals. Recent applications of INRs include image super-resolution, compression of high-dimensional signals, or 3D rendering. However, these solutions usually focus on visual data, and adapting them to the audio domain is not trivial. Moreover, it requires a separately trained model for every data sample. To address this limitation, we propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time. We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.

zoq / arxiv-updates

New submissions for Fri, 4 Nov 22 #379

Keyword: sgd

A Convergence Theory for Federated Average: Beyond Smoothness

Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization

Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale

Keyword: optimization

Speeding up NAS with Adaptive Subset Selection

PI is back! Switching Acquisition Functions in Bayesian Optimization

Regression Compatible Listwise Objectives for Calibrated Ranking

Max Markov Chain

On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach

$D^2$SLAM: Decentralized and Distributed Collaborative Visual-inertial SLAM System for Aerial Swarm

nerf2nerf: Pairwise Registration of Neural Radiance Fields

Pairing optimization via statistics: Algebraic structure in pairing problems and its application to performance enhancement

A Round and Bipartize Approximation Algorithm for Vertex Cover

Towards Discovering Neural Architectures from Scratch

Circling Back to Recurrent Models of Language

Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization

Sub-network Multi-objective Evolutionary Algorithm for Filter Pruning

Analyzing Performance Issues of Virtual Reality Applications

Keyword: adam

Adaptive sparse grid discontinuous Galerkin method: review and software implementation

Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization

Keyword: gradient

A Convergence Theory for Federated Average: Beyond Smoothness

Meta-PDE: Learning to Solve PDEs Quickly Without a Mesh

Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

Jacobians and Gradients for Cartesian Differential Categories

Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization

On a Calderón preconditioner for the symmetric formulation of the electroencephalography forward problem without barycentric refinements

Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale

Keyword: super-resolution

Temporal Consistency Learning of inter-frames for Video Super-Resolution

HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks