whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.96k stars 2.6k forks source link

redact location.ancestorOrigins according to Referrer Policy #1918

Open hillbrad opened 7 years ago

hillbrad commented 7 years ago

@bzbarsky @dakami and I had a hallway discussion at the end of TPAC about the possibility of adding location.ancestorOrigins to Firefox. bz has had longstanding concerns about the information this leaks to child frames. We arrived at a local consensus that any leakage is roughly equivalent to what happens already with referrer, so it would make sense to redact ancestorOrigins according to referrer policy. (and this could resolve that objection to a Mozilla implementation of ancestorOrigins)

/cc @smaug---- @annevk

domenic commented 7 years ago

One big question, which I asked in the PR, is what does "redact" mean. Since it's an origin instead of a URL, several of the referrer policies don't really apply (e.g. maybe they're no-ops). If it gets censored completely (e.g. if the referrer policy is "no-referrer"), then does the resulting array contain null? The empty string? Or is that entry just missing, so that the number of entries in the array is less than the number of ancestor browsing contexts? We'll need a comprehensive spec for (origin, referrer policy) -> censored origin.

Otherwise, I think we'd need to get a sense of what other user agents besides Firefox would be interested in this spec change. I guess only Chrome implements both referrer policy and ancestorOrigins, so... @mikewest, perhaps?

As for WebKit and Edge, which don't implement referrer policy but do implement ancestorOrigins: does this sound reasonable to you, as something you would do if/when you eventually implemented referrer policy? Leaving aside any commitments to implementing referrer policy. Tagging the usual suspects... @cdumez @travisleithead. Please route to more appropriate people as necessary.

bzbarsky commented 7 years ago

The idea is that if the referrer policy allows the origin to leak out via the referrer (which I believe all policies except "no-referrer" do) then we should just go ahead and return the origin in ancestorOrigins. So this is really about the "no-referrer" case, plus any browser configuration that has equivalent effects.

As for what value should be used in the "no-referrer" case, I don't have a strong opinion. Obvious options are "", null, "null" (this last as if the actual origin were a unique origin). Using "null" feels somewhat nice to me in that it's a situation that could arise even without the referrer policy business, so pages should be ready for it anyway. Using null would worry me in terms of pages getting exceptions when trying to string-manipulate the array entries.

hillbrad commented 7 years ago

I should write some test cases, but isn't the null case already possible today with GUID URL schemes? (data:, file:, etc.) And implicitly handled, as with CORS, by serializing to the string literal "null" according to RFC6454?

domenic commented 7 years ago

"null" sounds pretty good. (And it's according to the Unicode serialization of an origin, not some RFC ;).) But yeah, the PR as written just asks for the origin of the URL no-referrer, so we gotta straighten that out.

hillbrad commented 7 years ago

Well, this could be defined as basically a switch on the referrer policy states (which might be the most logical internal implementation choice), but I thought that calling out to the algorithm to produce a referrer and then extracting the origin via URL parsing would be more future compatible with new policy states that might be defined. I can revisit if that seems preferable.

domenic commented 7 years ago

IMO a switch makes the most sense, but adding it to the Referrer Policy spec would be best, since that ensures that whenever they add new policies they'll see that they need to update that algorithm as well.

bzbarsky commented 7 years ago

The referrer may or may not be related to the origin in general (e.g. for a sandboxed iframe the referrer is based on its URL but the origin a unique origin). So going via some sort of "extract the referrer" algorithm to get a value to use in ancestorOrigins as is done in this PR isn't right.

hillbrad commented 7 years ago

Take a look at: https://github.com/w3c/webappsec-referrer-policy/pull/77 ?

bzbarsky commented 7 years ago

One thing that I'd like to check on, actually. What should happen if a page at origin A loads a subframe from origin A which then loads a page from origin B, if the original page is sending full referrers but the subframe is using the no-referrer policy?

hillbrad commented 7 years ago

I haven't spec'd it as a barrier or ratchet, but an individual query from a Location, to each ancestor, independent of any intermediate contexts and their policy states.

On Tue, Oct 18, 2016 at 6:00 PM Boris Zbarsky notifications@github.com wrote:

One thing that I'd like to check on, actually. What should happen if a page at origin A loads a subframe from origin A which then loads a page from origin B, if the original page is sending full referrers but the subframe is using the no-referrer policy?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/whatwg/html/issues/1918#issuecomment-254681987, or mute the thread https://github.com/notifications/unsubscribe-auth/ACFbcC4lLs5fwF7n6Tv4A1Aes0OY-y36ks5q1WvHgaJpZM4KZEkb .

bzbarsky commented 7 years ago

OK, but that will leak the origin of the topmost page in this case, when it should be able to have a reasonable expectation of no such leakage occurring, right?

hillbrad commented 7 years ago

Is that a reasonable expectation? Or should it set its own policy if it is concerned? Specifying a ratchet is much more difficult, btw, as the referrer policy options don't have a strict ordering.

bzbarsky commented 7 years ago

Is that a reasonable expectation?

As long as it's only loading things it controls, I think it is, yes. This way the decision as to whether to allow the origin to escape only has to be made in the page that actually loads cross-site things.

Specifying a ratchet is much more difficult, btw

I'm not sure what you mean by "ratchet" here, but two simple things to specify would be that once you hit no-referrer you either insert a single "null" and terminate or insert "null" for everything else up the frame chain. This isn't as nice as doing more complicated checks about same-originness, I agree.

bzbarsky commented 7 years ago

Note the more clearly articulated proposal I made for this in https://github.com/w3c/webappsec-referrer-policy/pull/77#issuecomment-255429675. I thought @hillbrad was going to convert that to an HTML spec issue, but that didn't seem to happen...

Anyway, I would love feedback from Blink and WebKit on whether the change I propose is something they would implement, and feedback from Edge on whether they're interested in implementing this at all, and if so under what conditions.

annevk commented 7 years ago

Copying @rbyers, @cdumez, @travisleithead to get input from Blink, WebKit, and Edge. Would be nice to make some progress here.

foolip commented 7 years ago

For Blink, perhaps @dominiccooney or @mikewest could comment?

mikewest commented 7 years ago

@jeisinger and @estark37 are Blink's referrer policy folks, and will likely have opinions.

jeisinger commented 7 years ago

What I like about @bzbarsky's proposal is that it only indirectly uses referrer policy - referrer policy ideally should only affect the referrer. Of course using the referrer afterwards for whatever is fine.

I think we'd implement this if that means that Firefox will ship ancestorOrigins, and the API is still good enough to achieve the kind of protection @hillbrad et al need

othermaciej commented 7 years ago

Could someone provide or link to a clear summary of Mozilla's concerns with ancestorOrigin? There is a lot of explanation here (and in the WebAppSec issue) of what the proposed change is, but I'm not fully clear on the problem we are trying to solve.

WebKit may be open to change if there is a serious enough problem, and if there isn't undue compatibility risk to changing.

bzbarsky commented 7 years ago

The summary is that just because I embed a youtube video on my page doesn't mean that Google should know my domain name, and by extension which sites my users are browsing. Similar for other cases of embedding cross-site content. Right now ancestorOrigins leaks that information. The proposal is to give pages a way to consistently opt out of such tracking of their users by actors like Google, Facebook, etc, etc.

othermaciej commented 7 years ago

OK, so the threat model is that cross-site embedded content (particularly in cases where it may have some way to track the user's identity) may be able to determine what sites it's embedded in.

The proposed mitigation is that embedded content can still do this by default, but the site can opt out of enabling this (specifically for content that gets embedded as an iframe instead of an inline script).

bzbarsky commented 7 years ago

That's correct. And the specific shape of the proposed solution is based on the observation that if you don't redact referrers then you've leaked your hostname anyway. So the only case in which the embedding site has a hope of not leaking this information is when it's already preventing sending of referrers.

othermaciej commented 7 years ago

On the face of it, that seems reasonable. I'm curious whether the Blink team specifically objects to this proposed restriction, and if so what their argument is. Tagging @johnwilander for WebKit privacy opinions.

annevk commented 7 years ago

Per https://github.com/whatwg/html/issues/1918#issuecomment-284714225 Blink is not objecting. I can work on a change to the HTML Standard.

domenic commented 7 years ago

I think someone already submitted a PR (maybe hillbrad?). It might be a bit stale or have outstanding review, but you'll save yourself some time by working from that :)

annevk commented 7 years ago

Thanks, but #1917 looks overly complicated given that we store the referrer policy on the document.

annevk commented 7 years ago

The model I went with (and how I addressed https://github.com/whatwg/html/issues/1918#issuecomment-254681987) is that the first time you hit no-referrer you append "null" and then return the list. So the any ancestors of the first ancestor that uses "no-referrer" are not revealed and the number of them is not revealed either.

I'll write a couple basic tests too, might not get to those until later though.

annevk commented 7 years ago

Update: the model I went with was wrong. I've now adjusted it (only in a comment on the PR thus far) to what @bzbarsky proposed. I was wondering if anyone had any opinions on whether we want to reveal all ancestors or not. I guess you can already tell how many parents you have anyway through parent and top, so we probably shouldn't worry about that at all.

johnwilander commented 7 years ago

Sorry for the delay. WebKit will obviously have to implement the Referrer Policy to support this opt-out but I think it takes us in the right direction.

Did we ever consider an off by default model instead? Are we saying too much relies on ancestorOrigins today? If Mozilla hasn't implemented yet the web can't be completely solidified on existing behavior. If we went off by default we wouldn't have to add a side effect to the existing no-referrer policy. We could for instance add an "; ancestorOrigins" attribute to referrer policies.

annevk commented 7 years ago

I don't think that's been considered, but by basing it on the referrer actually being transmitted, we only leak as much as the network does, although for more contrived scenarios it might leak a little more by default I suppose.

bzbarsky commented 7 years ago

Did we ever consider an off by default model instead?

Yes. But given Blink and WebKit's refusal to even discuss this topic for years, we had to, without their input, come up with something that we felt they would be most likely to implement, hence a minimal change from what they are doing right now.

I'm happy to consider an opt-in if it still solves the use cases this property is trying to solve. If I understood correctly, doing this as an opt-in would require changes to pretty much every site that embeds Google and Facebook ads to opt in or something.

jeffreytgilbert commented 5 years ago

For what it's worth, the idea to respect a referrer policy set by the domains in the ancestry chain is great, but neither ancestorOrigins nor the requested change go far enough in either direction. A full URL should be available in ancestorOrigins because domain on its own is no more or less secure because information about a person can be groked by domain + some number of other data points, so truncating it doesn't make much sense for user privacy concerns if we're being strict here. Conversely, a domain (cnn.com) may be considered ok, but a page on that domain (cnn.com/vegas-shooting-kills-dozens-etc) may be considered not ok given a specific context.

On the other hand, the user also has not and cannot indicate via referrer policy set by the middle men that it doesn't want to leak information about the ancestor chain, and that begs the question, should there be user level controls for turning this information flow on or off.

In my opinion, this requires a multi-part solution where the user has the ability to turn off a behavior, as do sites(content providers) who manage relationships between one another, but the location.href chain should be opened up fully where no restrictions are explicitly called for. The primary case FOR doing this from a supply chain perspective is being assured the message and markup you're delivering is not being framed in an inappropriate context. Advertisers, for instance, may have strict policies against placing their brand next to content related to pornography or extreme violence for instance. This information, when locked away through cross origin chains of iframes, becomes unknowable.

On the other hand, if a user jumps into "in private" mode and disables this information from leaking to chains of iframes, a disabled chain of unknowable origins should be enough information for an advertiser to use as an indicator that maybe the risk isn't worth the buy opportunity, and the end users experience and privacy is preserved.

dliebner commented 5 years ago

The current webkit implementation is helpful to ad tech as it helps determine the validity of the embed. It's possible for an advertisement to be chained from the original site through multiple intermediary iframes before finally rendering the bottom level ad content - this is normal, if an ad request is going through multiple ad networks before finally arriving on a served ad. What ad tech wants to detect is when an ad is being served on an unwanted domain, or if something else is generally amiss in the chain of ancestors. Failure to make this information available makes it easier for bad actors to commit ad fraud.

bzbarsky commented 5 years ago

Sure, and ad tech could just treat "no available ancestorOrigins" as "bad actor" for its purposes. Then sites can decide whether they want to leak their origin to their subframes (and allow ad tech in there) or not, right?

dliebner commented 5 years ago

I'm a little confused by the attitude that a parent frame should remain anonymous to its subframes. If a site is being embedded by another site, don't they deserve to know by who? In what legitimate scenario does a site embed an iframe (or a chain of iframes) and need to be anonymous?

annevk commented 5 years ago

As a reminder, there's a HTML PR for this at https://github.com/whatwg/html/pull/2480 and a WPT PR at https://github.com/web-platform-tests/wpt/pull/5402.

@othermaciej @johnwilander I suspect Safari picking this up would make it more likely for Firefox to ship this too (it currently does not expose this attribute at all).

bzbarsky commented 5 years ago

If a site is being embedded by another site, don't they deserve to know by who?

Imo, no. If it doesn't want to be framed, it has ways to avoid being framed, yes?

My usual go-to example here is that imo a site should be able to embed a video from a video hosting site without exposing information about itself to a video hosting site. Under the assumption that the video hosting site allows such framing, of course.

opyh commented 5 years ago

What ad tech wants to detect is when an ad is being served on an unwanted domain, or if something else is generally amiss in the chain of ancestors. Failure to make this information available makes it easier for bad actors to commit ad fraud.

A person who visits a political or health blog doesn't want these URLs to be shared with giphy, facebook, and every adtech company on the planet.

While it's understandable that adtech companies want to know my political views and if I have cancer or not (and as a side effect, can prevent ad fraud more easily), as a user I'd like to have a choice if my browser sends this very personal information. Embed providers are not entitled to it. They should be able to choose who can embed them (possible with frame-ancestors), and users should be able to choose whom they want to share information with.

dliebner commented 5 years ago

My counter point is that blocking-by-default will effectively block the majority of ancestor data to ad tech because you can't expect developers to go out of their way to add/enable allow-policies. From the ad tech point of view, if you can't reliably see the ancestors, you can't reliably detect fraud.

With regard to your privacy concerns, 1) Not all ad tech companies are interested in invading your privacy (although sure probably most are) and 2) If that's something you're worried about, ad block is fairly effective and 3) If the sites you're visiting are of a sensitive nature and are embedding advertisements and you're concerned about your privacy, perhaps you should be evaluating those sites and their choice of ad partners.

I am someone who is building an ad tech company who is not interested in tracking individual users, and I need tools to detect, prevent and deter ad fraud.

opyh commented 5 years ago

I have worked in adtech myself, on several sides of the ecosystem – adtech developers are used to much more painful things than adding allow policies to websites ;) So you can expect developers to do this.

You can’t demand from a normal person using a browser to know what's going on behind the scenes. If I, as a software developer, have no means to see which health site tracks me and which doesn’t, how is a non-IT person supposed to understand this?

It's the standard’s job to help creating browsers that protect me from bad actors. No matter if I have an ad blocker or not.

If ad fraud can't be detected without complete surveillance, so be it? The ad industry is free to adapt business models that don’t simplify privacy fraud. If a user explicitly wants to be tracked in exchange for freebies, they'd still be free to configure their browser accordingly.

Thanks for your counter arguments – I'm out of this discussion, and I hope that this issue can be solved in a way that doesn't hand my browser history over to random companies as a default.

michael-oneill commented 5 years ago

Browsers can determine if the user is a bot or not, as least as well as any external service. If this is communicated in a privacy preserving way then fraud could be detected more effectively without having to rely on surveillance. https://github.com/w3c/web-advertising/blob/master/admetrics.md

dliebner commented 5 years ago

Browsers can determine if the user is a bot or not, as least as well as any external service. If this is communicated in a privacy preserving way then fraud could be detected more effectively without having to rely on surveillance. https://github.com/w3c/web-advertising/blob/master/admetrics.md

That is useful, but the issue I'm talking about is running ads that are supposed to only be served on one site and running them on another site. The people seeing the ads will be legitimate users, but how will the ad tech know if the ads are being served on the intended site without the ancestor list?

michael-oneill commented 5 years ago

In this proposal the browser will determine if they are being shown on the intended site, the ad tech only gets metrics from the Metrics Server e.g. Neilson or similar. Anything invalid gets ignored.

SamB commented 5 years ago

Browsers can determine if the user is a bot or not

... but wouldn't bots just use lying browsers?

dliebner commented 5 years ago

It's not so much about detecting bots as it is about preventing malicious publishers from sending spoofed data via real users.

jeffreytgilbert commented 4 years ago

Problem statement: A full URL and the chain of domains can be read from within an iframe in a cross domain context via javascript. When used in conjunction with long-lived stable identifiers, behavioral information can be inferred and associated with the user identifier and deep behavioral profiles can be stored and resold over private data marketplaces unbeknownst to the user.

This is the default case. The top level site does not have the ability to control this behavior.

Proposed solution: Allow control of ancestorOrigin and referrer data by applying the Referrer Policy header/attribute to ancestorOrigin API. The default behavior, if no referrer policy is specified, is the same as historical ancestorOrigin behavior, which has an ordered list of domains.

Regarding privacy, here’s my best semi-complete list for @dliebner and @opyh.

A user should be able to opt into advertising and tracking for ad supported publisher content. For example:

A site (aka publisher) should expect to be able to restrict page content, including cross domain content such as ads, to appropriate usage. For example:

It's probable that I missed some things here.

Good news, bad news is… OpenRTB 3.0 has a possible solution using blockchain like signed ledgers to show the chain of changes to a bid request. The problem is, the adoption rate for OpenRTB is not fast. It's a big change and it's making some big assumptions about publishers, exchanges and networks willingness to adopt the new complexity and cost associated with implementing it. The biggest benefits are for adopters of header bidding. The biggest losers in this are probably ad networks, which is likely why there is a real reluctance to adopt this version. They married a good tasting thing with a bad tasting thing.

You can read more on the certificate chain here: What is ads.cert?

@opyh has some valid points related to not leaking the full browsing history of the user to advertisers. @dliebner also has valid points related to a trustworthy supply chain free of fraudulent publisher and exchange practices. My earlier comment is probably closer to an additional feature request for user level controls since this ticket addresses publisher level controls.