Privacy Beyond Blocking Cookies: Bringing Awareness to Third-Party Content

Previous posts have covered trustworthy principles in general and some product specifics as well. Privacy is an important part of trustworthy computing. This post discusses one aspect of privacy on the web: third-party content.

When most people browse the web, they think what they see in the address bar and the site they are visiting are the same thing. However, web sites today typically incorporate content from many different web sites. For the sake of clear terminology, the site the user browses to directly (seen in the address bar) is the first-party site; the other sites that the first-party site incorporates in its site experience (but that the user hasn’t navigated to directly) are third-party sites.

When you browse to a first-party site, you know that it can collect information about how you use the site.  What many users don’t realize is that technically, third-party sites can collect information about users as well. Users aren’t typically well-informed about which third-party sites are collecting what information, how the sites use this information today, or how the sites could use the information in the future.

Identifying Third-party Sites

Most websites today are actually mosaics, or mash-ups, of several different sites. To see this, you can bring up the Privacy Report in Internet Explorer (from IE7’s Page menu or IE6’s View menu, choose the Web Page Privacy Policy menu item) for any site you visit. Here’s part of the report for a news site, and another from a credit card site:

Example Privacy Report Example Privacy Report

While the address bar shows the address of the current, first-party, site, this dialog shows the addresses of all the different web sites (including third-party sites) that the current web page includes content from. The browser visits every one of these sites in order to show the current web page’s content. 

The way that sites can pull content in from other sites is useful and powerful and typical on the web today. It’s part of the underlying design and structure of the web, and enables functionality (like an interactive map in the middle of a restaurant’s website, or a “share this” link in the middle a news article) that people value.

Third-Party Sites and Privacy

At the same time, bringing information together from different websites has privacy implications. A good example of this issue that most people have experienced involves email. Many email systems treat email messages that come from unknown senders in a special way, blocking images in them and displaying a warning like this one:

Blocked Images Warning Message

The message body typically has some missing images (“red X’s”) with text nearby, like “Right-click here to download pictures. To help protect your privacy, Outlook prevented automatic download of this picture from the Internet.”

Why do email systems block these external images? The sender may have programmed some information in the external image that is ­unique to the recipient – for example, having the image’s file name or location include the recipient’s email address. When the sender sees that a particular image was downloaded, then the sender knows which email message arrived in a valid account and was opened. By not downloading the content, the email recipient prevents his email system from disclosing information and protects his privacy from the unknown sender. Potentially, the recipient protects himself from more unsolicited email.

In general, every piece of web content that a computer requests from a website discloses information to that website. This basic technique enables a third-party site to track visitors across different first-party websites that include content from the same third-party. When several websites show content (like a syndicated photo or article) from the same third-party website, that third-party site can determine which of the websites a particular visitor has browsed to.

For example, say two totally unrelated sites, Site1.com and Site2.com, both include images from MySyndicatedPhotos.com. The user browses to both Site1.com and Site2.com, and the user’s browser calls MySyndicatedPhotos.com in order to get the images these sites include. MySyndicatedPhotos.com can figure out (by various means) that the same machine visited these two different sites.

As the user visits additional sites that show content from this same third-party site, this third-party site is in position to build a profile of the user’s activity across the different sites that include its content.

While cookies can definitely contribute here, and there’s been long-standing concern and confusion about “tracking cookies,” the fact is that any content coming from a third-party site can function like a tracking cookie. The intent of the content (a photo, article, logo, or site-specific analytics; image, text, or script) is technologically irrelevant to its potential use as a tracking mechanism. Note that even if the user had blocked all cookies, other content on third-party websites could still be used to build a profile. Third-party content isn’t inherently good or bad; it’s just technically possible to use it this way.

Actually Happening or Just Technically Possible, and Other Questions

To be clear, this post is about what a website can do when several other websites use content from it. It’s not what all third-party sites actually do when other sites refer to content on them. What is actually done with the available information is up to the third-party site, and in some ways very hard for anyone else to figure out. The third-party site could have a clear, well-written, and prominently posted privacy policy that guides its operations. It might not. The site could have an employee who loses a laptop with the data collected, or has malware on his machine and discloses collected information against policy. The site could have business arrangements with other sites that involve pooling data.

Also, this blog post isn’t meant as a technical deep-dive on the techniques sites can use to track users, or the different counter-measures technically-savvy users might take to avoid being tracked. The common technical theme here (as described above in the email case and here) involves ways that first-party sites enable information that can uniquely identify site visitors to flow to third-party sites. For example, many of the web addresses you’ll find in the Web Page Privacy Policy dialog are often quite long and contain unique identifiers.  There are better discussions of this topic elsewhere. For example, a recent IRC discussion about developing new standards for rich websites covered aspects of this topic. While it’s quite long, some parts are very relevant, like this one (that people “are being tracked whether they send cookies or not”) and this one (“anyone who wants to track people across the web can trivially do so at this point, even without cookies…. you can pretty easily ‘fingerprint’ people through things like their user-agent string, ip address, screen size, other js- and http- accessible prefs, etc and then with a simple set of analysis scripts you can easily work out who is who just look at the ‘anonymised’ search query string data aol released”).

Web browsing isn’t anonymous or perfectly private even without third-party sites. For example, the provider of Internet access (to a person’s home, hotel room, café table, or desk at school or work) can observe where the computer goes on the Internet. These providers typically provide terms of use, so users have clear notice and can choose to accept or decline connectivity under the stated terms. Any software running on the user’s machine can determine the websites the machine has visited; this is the basis of features like History, or toolbars that copy a user’s browser history up to the web so users can get at it from different machines. Again, terms of use and privacy policies are important tools here for users. The websites a user visits can determine information about the user (for example, the user’s likely location). Also, users give the sites they visit information directly in terms of what they click on and choose to do.

Third-Party Sites and Trust Issues

Given that web browsing isn’t anonymous and in some ways this is “how things work” on the web, what exactly is the trust issue? For many people, trust begins with security. The security risk here is plain: visiting one website exposes the user to potentially malicious content from other websites. The user visits one site and sees content on it that seems trustworthy (it’s on the site!) but actually comes from a different source. Finding examples of this problem on the web isn’t hard; it’s happened to visitors of several top tier websites.

Trust includes privacy as well. The privacy concern involves users having a choice, and being able to exercise control about what information they share. Today, users are not in control of which websites can get information about their browsing activities. As a result, web sites that users aren’t aware that they’ve visited and don’t have a well-defined relationship with are in position to build a profile of the users’ browsing patterns.

A guiding principle for Internet Explorer (and Microsoft overall, as part of Trustworthy Computing) is that the user should be in control. Consumers have come to expect security protections from their browsers, and are starting to have higher expectations about privacy protections as well. Control here means that users have clear notice and can tell what sites they may be disclosing information to and under what terms. Control also means that users can exercise choice about what information they disclose to whom. Preventing information disclosure means blocking content; blocking content creates a possible impact to the appearance and functionality of the page.

Beyond these issues, accountability is a question here as well. When a user visits one site after another, and each one includes some third-party content, who is accountable and who takes responsibility for the information collected about the user? On today’s web, that’s not at all clear.

The privacy and trust issues around third-party content are complex and important. As discussed in this blog before, trustworthy browsing involves many industry challenges, and, like many other efforts (e.g. interoperability), requires cooperation and trade-offs. Web privacy involves more than just blocking cookies. Enabling users to be in control starts with making users aware of the issues. In another post, we’ll cover IE8 functionality that helps users stay in control of their information.

Dean Hachamovitch
General Manager