Fix the sinusoidal encoders

wfondrie commented 1 year ago

In https://github.com/Noble-Lab/casanovo/issues/145, @qiyueliuhuo23 thankfully noticed that the depthcharge implementation of the sinusoidal encoders did not match the paper! 😬

This PR tries to resolve these issues (Issue #27 here). Unfortunately in doing so, I found that both our ICML paper's equation and the depthcharge implementation to not work as intended; hence, I have updated the formula to the following, where $\lambda{\text{max}}$ is the maximum wavelength, $\lambda{\text{min}}$ is the minimum wavelength, $i$ is the index of the feature (zero-based), and $d$ is one less than the number of features---also referred to as the dimensionality of the model. [^1]

$$
f{i}= \left. \begin{cases} \sin(m{j} / (\frac{\lambda{\text{min}}}{2\pi}(\frac{\lambda{\text{max}}}{\lambda{min}})^{2i/d})), & \text{for } i \leq d/2 \ \cos(m{j} / (\frac{\lambda{\text{min}}}{2\pi}(\frac{\lambda{\text{max}}}{\lambda_{min}})^{2i/d - 1})), & \text{for } i > d/2 \ \end{cases} \right. $$

As an example, if we use $\lambda{\text{min}} = 0.1$, $\lambda{\text{max}} = 10$, and $d = 4$, then the first sinusoidal feature, $i = 0$ will evaluate to a wavelength of 0.1:

$$ sin(m{j} / (\frac{0.1}{2\pi}(\frac{10}{0.1})^{20/4} ) = sin(2\pi m{j} / 0.1) $$

The last sinusoidal feature, $i = 4$ will evaluate to a wavelength of 10:

$$ cos(m{j} / (\frac{0.1}{2\pi}(\frac{10}{0.1})^{24/4 - 1} ) = cos(2\pi m{j} / 10) $$

In fixing and writing tests for this bug, notably I also noticed a problem with our standard positional encoder that suffered a similar issue: the max_wavelength argument in the former iteration did not actually specify the maximum wavelength.

Notably, for all of these it's unlikely that these changes will have a great affect on future models---my guess is that using a reasonably diverse series of sinusoidal waveforms to encode a feature is all that is required. However, the fixes in this PR allow precise control over the wavelengths of the features used, as we thought we were achieving before.

@bittremieux please review this PR carefully and verify that I didn't make a dumb mistake again. 🙏

[^1]: Why one less? This is because of the 0-based index that allows our minimum and maximum wavelength to indeed be met. Otherwise, the geometric progression would exclude one or the other extrema.

codecov-commenter commented 1 year ago

Codecov Report

Merging #28 (26276d5) into main (c1626d2) will increase coverage by 0.27%. The diff coverage is 100.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main      #28      +/-   ##
==========================================
+ Coverage   85.21%   85.49%   +0.27%     
==========================================
  Files          16       15       -1     
  Lines         751      703      -48     
==========================================
- Hits          640      601      -39     
+ Misses        111      102       -9

Impacted Files	Coverage Δ
depthcharge/data/__init__.py	`100.00% <ø> (ø)`
depthcharge/components/encoders.py	`100.00% <100.00%> (ø)`
depthcharge/components/transformers.py	`88.07% <100.00%> (ø)`
depthcharge/version.py	`100.00% <100.00%> (+46.15%)`	:arrow_up:

... and 1 file with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

wfondrie commented 1 year ago

@bittremieux, in making the above edits, I've noticed that my equations as written fail for odd numbers of features. Instead, I propose:

Let $\lambda{\text{max}}$ is the maximum wavelength, $\lambda{\text{min}}$ is the minimum wavelength, $i$ is the index of the feature (zero-based), and $d{\text{model}}$ is the number of features---also referred to as the dimensionality of the model. We begin by defining the number of features that are to be represented by sine and cosine waveforms as $d{\text{sin}}$ and $d_{\text{cos}}$, respectively:

$$ d{\text{sin}} = \lceil \frac{ d{\text{model}} } {2} \rceil $$

$$ d{\text{cos}} = d{\text{model}} - d_{\text{sin}} $$

The encoded features are then calculated as:

$$
f_{i}(mj)= \left. \begin{cases} \sin(m{j} / (\frac{\lambda{\text{min}}}{2\pi}(\frac{\lambda{\text{max}}}{\lambda{min}})^{i/(d{\text{sin}} - 1)})), & \text{for } i \leq d/2 \ \cos(m{j} / (\frac{\lambda{\text{min}}}{2\pi}(\frac{\lambda{\text{max}}}{\lambda{min}})^{(i-d{\text{sin}})/(d{\text{cos}} - 1)})), & \text{for } i > d/2 \ \end{cases} \right. $$

I think that this is likely the final, most accurate version of what we're trying to do in the code.

melihyilmaz commented 1 year ago

As a sanity check, I propose unit testing the positional encoding based on whether it conforms to the criteria mentioned in Vaswani et al.:

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of P Epos.

Accordingly, a tensor T must exist as described here:

I wrote a simple test showing this how such this would work in the setup described above, let me know if this is something we'd want to develop and formally implement as a unit test:

d = 2
i = 1
m_j  = 92 # Arbitrary number to encode
m_step = 5 # Step size

lamb = 1/(10000 ** (2*i/d))

t_sin = np.sin(lamb*m_step)
t_cos = np.cos(lamb*m_step)

t_m_step = [[t_cos, t_sin],[-t_sin, t_cos]]

# Two embedding vectors encoding indices separated by m_step
v1 = np.array([np.sin(lamb * m_j), np.cos(lamb * m_j)])
v2 = np.array([np.sin(lamb * (m_j+m_step)), np.cos(lamb * (m_j+m_step))])

print(np.matmul(t, v1) - v2)

wfondrie commented 1 year ago

Thanks @melihyilmaz for the suggestion! Quick question though: isn't the above criteria true for any sinusoidal encoding function with a constant wavelength for each feature?

wfondrie / depthcharge

Fix the sinusoidal encoders #28

Codecov Report