Beyond the forgoing justifications for the implementation of the specific networks presented in the paper, there is a broader question as to whether the supervised learning approach is an ideal solution for all or even most cases. A variety of unsupervised, self-supervised, and semi-supervised approaches are available. Of course, there are far too many to compare all of these, but some discussion of these alternatives at the end of the paper is warranted. In particular, DeepSqueak [Coffey et al, 2019] offers a similar CNN-driven front end but with unsupervised clustering of features from USVs (and see [Goffinet et al, biorxiv]). Other recent work [Sainburg et al, 2020] applies fully unsupervised approaches to find song syllables and directly compares automated clusters to hand labels for several songbird species (but not canaries). It seems these unsupervised approaches would be better suited for highly variable songs (e.g. babbling, budgerigar warble). If this is the case then the authors should explicitly say so - so readers know what conditions are suited for TN and what conditions are not.
[ ] discuss relationship to DeepSqueak, Goffinet <-- object detection v. event detection networks
[ ] discuss relationship to Mupet
[ ] discuss relationship to Sainburg unsupervised approach