fix(tsx): Extract positions based on utf-16 encoding

Princesseuh commented 3 months ago

Changes

Previously when extracting script and style tags, we tried to somewhat make it work with multibytes characters by counting them and skipping, not only was this cumbersome, it kinda didn't work because our loop would run multiple times over kinda the same characters, anyway it was annoying.

In this PR, I changed it so that we use the lineoffsets table from the sourcemapping logic to get the offsets. This is somewhat slower, especially in some extreme cases, but in most cases there's no difference, and at least it's now correct.

I also updated the frontmatter and body ranges extraction to use this method, as they suffered from the same problem

In theory, this is a breaking change, but the truth is that the numbers it'd spit out would be unusable in JS unless you did a lot of conversion yourself, now the numbers can be used as-is. Also, I doubt anyone other than me is using them...

Fixes https://github.com/withastro/language-tools/issues/921

Testing

Tests should pass + updated some + added more

Docs

N/A. Though I did a JSdoc comment on the type to say that it's UTF-16 based.

changeset-bot[bot] commented 3 months ago

🦋 Changeset detected

Latest commit: 2b2a8da47ee49ffbb4cd6bfd688a4ec36b6a394b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

| Name | Type | | ----------------- | ----- | | @astrojs/compiler | Patch |

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Princesseuh commented 3 months ago

Yeah, I was surprised to not find too much perf difference. To be clear there is one, but I could only get real differences in files with 10k+ characters, multiple script and style tags late into the file and obscene amount of emojis.

I'll take a quick look to see if there's a way to re-use the line offset table from the sourcemapping logic, it'd speed up things a bunch, otherwise I'm not too bothered, this still ends up being faster for the language server because the previous logic it did to get script and style tags was expensive

bluwy commented 3 months ago

The power of Go and native binaries 😄

Princesseuh commented 3 months ago

Refactored to use the sourcemapping line offsets instead, it's much faster! Sourcemapping is still the ultimate bottleneck, though.

withastro / compiler