whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.03k stars 2.62k forks source link

Potentially standardize window.find() #3539

Open annevk opened 6 years ago

annevk commented 6 years ago

See:

Related #2858.

js-choi commented 6 years ago

Good news.

Is the scope of this issue simply to standardize window.find’s existing behavior in Firefox, WebKit, and Chromium? How does its matching work? Are Unicode code points matched with any normalization? Does case folding occur; if so, how? Can paragraph breaks and line breaks be matched, and by what characters? Is there any fuzzy search (e.g. between straight quotes and curly quotes as some browsers do on some systems)?

The answer to all of these probably will be: “What do current browsers do? Let’s stick with that,” but it’d be good to be explicit about the goal. And there may be some platform inconsistencies, especially in fuzzy matching.

See also Charmod Norm, w3c/selection-api#37, whatwg/dom#431, tc39/proposal-intl-segmenter#17, whatwg/html#2424, the inactive String Search API the inactive FindText API, and the inactive RangeFinder API.

fred-wang commented 6 years ago

For the record some browsers also implement a document.execCommand("FindString", ...) command. https://w3c.github.io/editing/execCommand.html

grantcv1 commented 6 years ago

I always find it distressing that counting the usage of public-facing websites is used to make decisions. In my experience, there are far more complex web applications with big companies and big governments that would not (should not) be included within these statistics.

window.find() is very much needed in editing style applications and there is a need for this feature (or a better alternative). Support for case folding, regular expressions, and other things that would help with a fuzzy search are really needed.

It seems that one possible effort to standardize this capability, the FindText API, (http://www.w3.org/TR/findtext/) has been discontinued :-(

tilgovi commented 6 years ago

One way to accommodate some of the goals of FindText without requiring standardization to take a stance on algorithm would be to specify how window.find interacts with Symbol.search (or other relevant, well-known symbols).

vmpstr commented 4 years ago

I'm not sure if this should be a separate issue, but my proposal to start the process of standardizing window.find by standardizing some of the aspects of find-in-page commonly used first. For instance,

  1. Define terms like active match(?) vs potential match(?), meaning the thing that was found and highlighted vs the thing that could be found if the user or script continue searching for the same string
  2. Perhaps also define how find-in-page interacts with things like clipped out content, and opacity 0 content, etc.

By starting with definitions, I think we can start thinking about how to define the algorithm. However for some features, it might already be useful to reference definitions of find-in-page (e.g. https://drafts.csswg.org/css-scroll-anchoring/#anchor-priority-candidates 2nd candidate is "an element containing the current active selected match of the find-in-page user-agent algorithm" which could reference this)

As an aside, I put together a brief overview of behaviors of find-in-page in different browsers (Chrome, Firefox, Safari) to see the commonalities and differences in behaviors. The doc uses find-in-page dialog, not window.find though.

Does this seem like a good approach?

domenic commented 4 years ago

my proposal to start the process of standardizing window.find by standardizing some of the aspects of find-in-page commonly used first.

Interesting.

This falls into a gray area of web specs, of specifying UI. Generally we try to shy away from that, and only specify things which are observable from JavaScript. I believe nothing about find-in-page is observable, so we normally wouldn't specify it.

However, sometimes we bend this rule, when it's especially beneficial, and all the browsers are interested.

I guess I would ask what is the goal here, and for who. Are you trying to make things more predictable for web page authors? In what way, since find-in-page is not observable? Are you trying to make things easier for implementers?

If the goal is purely to work on a better spec for window.find, then I would probably treat that orthogonally to find-in-page...

vmpstr commented 4 years ago

I guess I would ask what is the goal here, and for who

Good question. The immediate benefit from having the definitions is for spec writers and implementers so that they can agree what is meant by terms like 'active match' (e.g. the scroll anchor spec I linked, and beforematch proposal; the latter would benefit from the algorithm specified as well since the timing of the event and timing of find-in-page scroll are dependent on each other).

I think the ultimate benefit of at least partially speccing the algorithm is for users to have a consistent experience across browsers (although I'm not sure how valuable it is, since I imagine users don't typically switch browsers very often). That is, you can see in the compat doc I linked that browsers tend to do different things in a number of situations. In some cases, none of the browser seem to do "the right thing". For instance, content clipped by overflow hidden can be found on the three browsers I tested. It is conceivable that the spec here would dictate what should and should not be found, if that makes sense.

As an aside, I assume that window.find essentially hooks into the find-in-page algorithm (maybe this is a wrong assumption), so any kind of specification for it is likely to be very similar. To put it differently, I think if window.find is specified and browsers update their implementations to match the spec, I suspect that they will also have to change the find-in-page behavior to simplify the code.

domenic commented 4 years ago

The immediate benefit from having the definitions is for spec writers and implementers so that they can agree what is meant by terms like 'active match' (e.g. the scroll anchor spec I linked, and beforematch proposal; the latter would benefit from the algorithm specified as well since the timing of the event and timing of find-in-page scroll are dependent on each other).

I definitely see the benefit there. That could probably be accomplished with a fairly minimal spec, that just hand-waves at how the feature works but builds around a skeleton of some <dfn>s like "active match" that other, more observable features can reference. I'm happy to support that much, at least.

I think the ultimate benefit of at least partially speccing the algorithm is for users to have a consistent experience across browsers (although I'm not sure how valuable it is, since I imagine users don't typically switch browsers very often). That is, you can see in the compat doc I linked that browsers tend to do different things in a number of situations. In some cases, none of the browser seem to do "the right thing". For instance, content clipped by overflow hidden can be found on the three browsers I tested. It is conceivable that the spec here would dictate what should and should not be found, if that makes sense.

I think you're right that this would be valuable for users, in that it would guide browsers toward doing "the right thing", where "the right thing" is what domain experts (HTML spec editors, CSS WG, i18n folks, and browser engineers) can collectively get together and agree upon. Maybe we wouldn't get total agreement, e.g. maybe one browser representative has a very different philosophical stance on what a "word" means, but that's fine. Any discussion at all would likely be an opportunity to improve things in this way.

In other words, since this isn't JS-developer-observable, the goal isn't to get total interop, but instead to get the other values that the standards process brings. And I suspect that even if not all browser engines want to spend to spend time on this, you'd be able to get good discussion from the rest of the web standards community, and from any interested web developers and users.

So, I'm sold that this is worth trying to specify.

As an aside, I assume that window.find essentially hooks into the find-in-page algorithm (maybe this is a wrong assumption), so any kind of specification for it is likely to be very similar. To put it differently, I think if window.find is specified and browsers update their implementations to match the spec, I suspect that they will also have to change the find-in-page behavior to simplify the code.

Well, but as long as the result of window.find is not observable from JS, it seems like the specification could just be "calling this function does something with the user interface generally related to finding things". Although, maybe it's observable from scroll offsets? I'm not sure.

aphillips commented 4 years ago

Text search is a complex topic for reasons such as those called out in @js-choi's comment. Past attempts to write a spec at W3C failed to consider I18N basics early on and have foundered on that. The I18N WG (perhaps wisely?) shelved any attempt to work on it directly as part of Charmod-Norm by creating a separate document. Any group starting to work on this might want to have a look at string-search and to the issues we filed against FindText.

I think this is worth taking a stab at--it is possible to overcomplicate the problem and at long as judicious choices are made (and well-documented) I think it is possible to have a successful result in a finite amount of time.

annevk commented 4 years ago

@domenic it's pretty observable, no?

console.log(window.getSelection())
window.find("test");
console.log(window.getSelection())
domenic commented 4 years ago

Hmm, that appears to be a Firefox quirk where window.find() (and Ctrl+F!) actually affect window.getSelection(). That's not the case in other browsers.

domenic commented 4 years ago

For the record, I was testing something wrong; window.getSelection() is impacted by window.find() in Chrome too. http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=8206

domenic commented 4 years ago

For folks watching this thread, @vmpstr has put together an initial pull request describing find-in-page in https://github.com/whatwg/html/pull/5770 (direct preview link). It's pretty basic and, I think, should be uncontroversial. But it might provide a good place to collect some of the notes or open issues here, e.g. we could expand it to link to https://w3c.github.io/string-search/#searching, and eventually try to define window.find() as triggering that feature.

xfq commented 4 years ago

FWIW, there's a CSS issue about controlling whether an element is findable/searchable: https://github.com/w3c/csswg-drafts/issues/3460

petelomax commented 3 years ago

My gut instinct on this is that "find" is too generic and meaningless. Adding eg openFindWindow() or findTextOnPage() or highlight/selectTextOnPage() would be intuitively more distinct from querySelector() and friends, which "find" just isn't.

domenic commented 3 years ago

We don't get to choose the name; it's already in all browsers. This issue is just about writing a spec for it.

mantou132 commented 2 years ago

find just needs to return some Range that contains the specified text. Other processing such as highlighting should be left to the web developer, e.g: use custom highlight api

hsivonen commented 11 months ago

@domenic it's pretty observable, no?

console.log(window.getSelection())
window.find("test");
console.log(window.getSelection())

It's rather unfortunate that what window.find() finds is Web-exposed when Gecko implements the search technically in a very different way from WebKit (forked to Blink), and the WebKit/Blink behavior depends on the UI language of the browser.

Specifically, Firefox operates on the Unicode Database level (in a language-independent way) and WebKit&Blink use collator-based search (with primary-level matching only) such that the collation data that is used is the CLDR search collation for the browser UI language.

As a collator implementor, I'm very skeptical of the technical merit of collator-based search compared to search implemented directly over the Unicode Database layer (possibly with hard-coded exceptions to try to reproduce the main effects of collator-based search). (When operating on the Unicode Database, you transform characters to other characters and match on the transformed stream of characters. When operating on collations, you perform a complex mapping from characters to collation units and then ignore everything but the primary weight in the collation unit and match on the primary weights. Even with fast computers of today, you can experience a performance difference by using cmd/ctrl-f on the HTML spec in Firefox and Chrome.) I also don't want to bring collator-based search into scope Gecko or ICU4X. See a URL text fragment issue.

sideshowbarker commented 6 months ago

Given that — along with the core “highlight the active match and scroll into view” behavior — browser UIs also expose a count of the total matches for the current query, it’s imaginable that it might be useful to developers (and for testing scenarios too) to have an API which programmatically exposes that total match count to JavaScript code.