Preserving the Freedom to Learn in AI

Balancing openness and control in AI data access

Matt Perault

In Brief
  • The Web’s architecture embraced the freedom to learn: people should have the liberty to learn without censorship or gatekeeping. If you can access information legally, then you can read it, analyze it, and build on it. 
  • AI is threatening to chip away at that freedom, as some proposals seek to fence off public information through contracts, technical barriers, or expansive interpretations of copyright.
  • If adopted, these approaches would constrain lawful learning, disadvantage startups, and further concentrate AI markets.
  • Advancing the freedom to learn with AI means balancing openness for readers with control for publishers, just as the Web did.
  • We propose a three-part framework to carry this balance of access and control into the AI era:
    • Market evolution: enable licensing and monetization where it adds value, without requiring permission for lawful learning.
    • Voluntary technical standards: give publishers scalable, machine-readable ways to express AI-related preferences.
    • Public policy: five recommendations that reaffirm fair use, limit the ability for contracts to evade copyright, prohibit unlawful access and the misuse of AI, and foster access to data in the public’s interest.

The Web thrived because it found a simple equilibrium to balance openness and control: anyone could access information openly, while publishers had tools to control how they made their works available. Its architecture embraces people’s freedom to learn–the principle that people should have the liberty to inquire and experiment, without censorship or gatekeeping, and should have the access to knowledge and tools that makes that liberty real. In practice, it means that if information is lawfully accessible, people are free to read it, analyze it, and build on it. This foundation enabled the internet to generate immense economic and social value, distributing its benefits to creators, developers, and the public.

But today, this equilibrium is being threatened. Some stakeholders are questioning whether it can adequately balance the tensions of an AI-enabled world. As AI reshapes how people gather, engage with, and learn from information, some publishers are pursuing a new approach to online information, even when that information is publicly available. Some publishers are pursuing a new approach to online information, where huge swaths of the internet are off limits to learning, with information sitting behind paywalls or barred from access by provisions buried in terms of service or by court-ordered prohibitions. If these emerging approaches become the standards for AI data access, they would not only slow AI model development and limit how people use AI, they would redefine the freedom to learn itself—transforming it from an open public good into a fragmented privilege governed by private agreements and technical barriers, and bounded by contracts, code, and court orders.

The costs of redefining the freedom to learn in an age of AI are tangible and significant. As has been well documented, severe limits on the freedom to analyze data would erect barriers to AI training. But they would also limit how people can use AI models to analyze and read works, such as by limiting a person’s ability to ask an AI model to summarize an article that they are legally permitted to access. Moreover, the costs will not be distributed equitably: because large, well-resourced companies will be better positioned to navigate complex access rules than startups and entrepreneurs, who we call Little Tech, these rules will disproportionately harm these new entrants. AI markets will become even more concentrated, and without real competitive threat from startups, consumers will see higher prices for less innovative products of lower quality.

The freedom to learn, like all freedoms, is not infinite. Developers using AI should not have the unlimited ability to access and use data, such as by circumventing paywalls to get to content. Copyright law should continue to protect rightsholders, while also leaving room for fair use. With policymakers and courts scrambling to establish rules to guide lawful AI data access, what principles should guide them?

To carve out a path that balances open access and choice, policymakers, publishers, and developers should pursue a mix of tailored market, technical, and policy solutions.

This post examines these issues in more detail:

  • First, it reviews how the Web embraced the freedom to learn, empowering readers while also giving publishers choices about the information they share.
  • Second, it examines current proposals for AI data access that threaten to undermine this freedom.
  • Third, it describes a three-part blueprint for adapting the Web’s learning and reading paradigm to AI technologies: market evolution, technical collaboration, and public policy guidance.

By adapting the Web’s core structures for an AI-empowered world, this three-part framework can affirm people’s freedom to learn, while offering protections to publishers against unlawful access or use of their works.

How the Web empowered readers and publishers

The Web’s success is rooted in its openness. The Web’s design allowed anyone to build a webpage, to create a browser to fetch a given URL, and to deliver information to a user. That’s because the Web is built on open standards. It’s those open standards that allowed a young Marc Andreessen to develop Mosaic and then Netscape, and allowed millions of subsequent entrepreneurs to build Web-based tools that transformed our lives.

This structure benefits consumers, publishers, and builders. Internet users enjoyed radically lower barriers to both publishing and accessing information. Publishers can decide whether and how to make their content accessible online, such as by deciding not to post certain content or by placing content behind a paywall. But once publishers decide to make information publicly available, website readers can choose how to consume the information that is lawfully available to them. They are free to read that information and learn from it, and they can use what they learn to create new content and new technology.

Consider the development of search engines. To build a search engine, developers “crawl” the web: through automated processes, they go from link to link, webpage to webpage, and make a copy of those pages. Developers then analyze those copies in order to create a search index and deliver results in response to queries. This acquisition and analysis of public data benefits consumers, since it enables people to find relevant information, and it benefits publishers by driving traffic to their websites. In fact, an entire industry—the Search Engine Optimization (SEO) market—grew up around the value proposition of helping publishers to secure traffic.

Critically, the Web’s open architecture meant that search engines were not required to negotiate the ability to crawl and link with every single publisher, which would have been impossible given the scale of the Web. Had individual agreements been required, the Web’s ability to facilitate learning for people and to connect publishers to their audience would have been dramatically curtailed.

At the same time, starting from the early days of the Web, publishers worked with the technical community—including developers working on search engines that crawled the Web at scale—to develop standards to give publishers some control over how information on their websites could be accessed. The concept they developed was called the robots.txt standard, formally known as the Robots Exclusion Protocol, which is expressed by siteowners by creating a robots.txt file. It enabled websites to include plain-text rules in their root directories, providing guidance to crawlers on whether and how they wanted their site crawled. Crawlers could then use that information to respect these preferences.

Website preferences ranged: some might want to prohibit crawling for privacy reasons, because of concerns about search monetization, or because they wanted to avoid crawling that caused undue, expensive traffic loads. Other websites had the opposite preference: they wanted to be crawled for various reasons, such as to promote visibility of their content, enable users to learn from the information they chose to publish, or facilitate monetization.

The staying power of robots.txt stems from an alignment between the interests of publishers and developers. Many publishers wanted to be found in search engines and other directories, but did not want the traffic from bots to overwhelm their sites. By the same token, search engines and others did not want to send so much traffic as to make the publishers’ site inoperable. The widespread implementation of robots.txt does not mean that publishers and developers agree on every aspect of data access and monetization on the web. There are ongoing disputes about the right way to compensate publishers and developers for content and distribution, but despite those disagreements, robots.txt continues to be used by many search engines and crawlers today, even though it is a voluntary standard rather than a formal legal requirement. The fact that a voluntary standard has played such an enduring role in managing access and control demonstrates that restrictive regulatory mandates are not necessarily the sole, or best, way forward.

Over time, publishers have introduced other tools to control access to their sites by crawlers and individuals. Some publishers use paywalls to limit public access, and in many cases, websites can enforce terms of service that people affirmatively agree to.

That said, there have also always been limits to enforcement of these types of access controls, and the boundaries between a publishers’ ability to control information and the ability of the public to access information have been actively contested. For instance, in 2003, a court found that Ticketmaster could not prevent a competitor, Tickets.com, from “deep-linking”—that is, linking to pages other than the Ticketmaster home page. The court found that Tickets.com had not agreed to a contract merely by browsing to the website and their linking did not constitute trespass or misuse of the site in other ways. Similarly, in 2001, the consulting firm KPMG sent cease-and-desist letters to several websites for linking its own website without signing a formal “Web Link Agreement.” Users lampooned this effort, first by linking indiscriminately to the KPMG website, and then by showing up near KPMG’s physical offices with signs that said “KPMG this way.”

Similarly, book publishers cannot enforce blanket “no summarizing or quoting” notices in a book, and they also can’t prevent used book stores from reselling a book or libraries from lending them out. No doubt, some book publishers might want to have those rights. In fact, in the early 20th century, the Supreme Court ruled against a book publisher that attempted to limit book purchasers’ ability to resell a work by appending a notice prohibiting the sale of a book at a price below $1. This case provided the foundation for the “first-sale” doctrine, which allows selling and lending of copies of works.

These cases are rooted in broader principles related to the ability of a publisher to achieve in contract what it cannot otherwise achieve in copyright. Whether a contract can be enforced depends on factors like notice and assent and whether the terms are unconscionable or violative of broader public policy goals. To give an absurd example to make the point: a website can require a visitor to comply with the terms of an otherwise lawful paywall, but it cannot include in its terms of service a requirement to read its content while hopping on one foot, and then enforce this “contract” in court against a person who read the website’s content while sitting.

Allowing publishers to enforce broad restrictions on the use of public data “risks the possible creation of information monopolies that would disserve the public interest,” as one appeals court put it. This rationale has been subsequently cited by other courts to preempt contracts that conflict with federal copyright law in the context of re-use of public Web data. In X v. Bright Data, the court applied that rationale to hold that X’s terms of use were unenforceable in the context of Bright Data’s data collection, because such terms threatened to give publishers like X “free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use.” Along with creating “information monopolies,” allowing publishers to enforce such contracts would also result in the state-by-state patchwork of rules that uniform federal copyright standards are designed to preclude.

While evolving laws and industry practices have helped manage these boundaries online, they have remained contested, and they are once again coming to the fore in the age of AI.

Web publishing in the age of AI

Like the rise of the Web, the rise of AI is raising questions about publishers’ rights to control information and the public’s freedom to learn.

On the one hand, allowing AI developers to crawl sites for training data or otherwise access public content on behalf of users—such as by fulfilling a user request to summarize a website or highlight key information—can provide benefits for publishers. Like search engines before them, AI crawlers may also provide traffic and, in turn, monetizable value: being included in AI training corpora helps with discoverability, and discoverability then helps with monetization, brand value, and customer engagement. Publishers are moving from a world of “search engine optimization” to “AI optimization.” Because of these benefits, many website publishers want their content to be available to AI tools.

On the other hand, because generative AI and other emerging tools are nascent, so too are the opportunities. Many publishers are wary that in the new pie of AI value, the size of the slice they get will not be sufficient. Some publishers already claim that their traffic is dropping due to AI-related uses of their works, and that providing public access to their content doesn’t result in commensurate value. Absent alternative choices, publishers may lock more and more of the Web’s content behind paywalls because of fears that their content will be used to create value for others with limited returns for themselves.

Some publishers are also going further, arguing that they should not only be able to restrict access to their sites, but they should be able to restrict the use of content that is otherwise accessible. Several copyright lawsuits against AI developers claim that training generative AI models on publicly available copyrighted works constitutes infringement. In effect, they argue that even when an AI model can lawfully access content on a website, it should not be permitted to learn from it. If this concept is enshrined into law, the long-standing principle underlying the freedom to learn—that you can learn from things you can lawfully access—will not apply to machine learning.

What’s more, publishers are also seeking to control inference—that is, using an AI model to analyze a work found on the Web and produce an output for a user. It is already common for people to ask for an AI tool to summarize information found on a website, and new AI search tools will expand and improve these types of use cases. For instance, a tool could help a real estate firm synthesize data to understand the state of the housing market. Similarly, financial analysts can use AI to draw on myriad data sources and analyze trends. In these cases, the information is not being used to train the model; rather, the model acts on the user’s request to review and analyze the information, and provide a new, transformative output.

Governing the freedom to learn in AI

Creating value in an age of AI should not be a zero sum game. Rather, just as the Web has thrived by supporting a broad ecosystem of publishers, readers, and developers building and using the Web, so too do we need to find ways for a broad ecosystem to flourish in the age of AI. AI’s generative capacity—the ability to spark new information, new value, and new product and business models—should create value for this entire ecosystem.

The need for a sustainable resolution to these tensions is acute for Little Tech. Unilaterally blocking website crawling for AI development disproportionately hurts Little Tech because it creates an additional barrier to entry for startups and entrepreneurs who will be ill equipped to reach expensive bespoke licensing agreements with powerful publishers, navigate complex access provisions in a terms of service, or gain access to large pools of private training data that make it unnecessary use publicly available internet data for training.

Striking the right balance is important for other actors beyond Little Tech who lack the ability to purchase and license datasets. If publishers are permitted to assert unilateral control over the freedom of people to learn from information that they are otherwise allowed to access, the damage will be felt far beyond AI developers: archives like Common Crawl, researchers, and civil society organizations all derive value from lawful access to data. These organizations and their users are collateral damage; for instance, it was recently reported that The New York Times is removing its content from digital libraries like the Internet Archive, undermining the ability of researchers and to study the historical record.

Creating this value requires a framework to govern AI that includes three components: markets, technical standards, and public policy.

Market evolution

The public’s freedom to learn is compatible with a vibrant commercial market for accessing and using content. The right for people to use publicly available works to train AI or to use AI to read and analyze those works does not preclude many other ways to make money, as is evidenced by innovation that is already occurring in the market.

For example, even as AI developers are using publicly available data for training, some AI developers are already paying for access to certain content that is not publicly accessible, particularly when the private data has specific value for the specific tool the developer is building.

In other cases, companies are licensing data for uses that go beyond what is allowed under copyright law. For instance, even though the law permits an AI developer to use publicly available work for AI training, copyright prohibits the production of infringing outputs. Because of this restriction, companies are already working to license content for inclusion in outputs, such as displaying substantial text from a news article, rather than just a summary of facts, or displaying full resolution photos from third parties. Such licensing may also enable new types of outputs, such as allowing fans to interact with famous superheroes and villains from movies or to create remixes and mash-ups of songs. To support these and other uses, various companies are exploring marketplaces and other means to facilitate these sorts of arrangements.

Technical collaboration

Technical standards have a role to play, too, just as they have for decades in the development of robots.txt and other Web standards. Collaboration could create voluntary standards to enable publishers to express AI-related preferences. Developers can then choose to respect these preferences, just as they have with robots.txt. Preferences might include requests to restrict particular uses, but also include guidance to AI developers to help them learn from a website more effectively. Voluntary standards—rather than legally mandated ones—are also potentially helpful insofar as they allow for flexible application of the stated preference; for instance, a library might crawl a government website, in service of their public service mission, even where the site otherwise restricts crawling.

Such preference signals could take many forms. They could be attached to particular websites through something like robots.txt, attached to individual works (e.g., an image posted online) through metadata or other labels, or recorded in registries that prospective data users can easily reference. The Internet Engineering Task Force (IETF), which is responsible for the robots.txt standard, has convened an “AI preferences” working group to work through the details of what these signals might look like and how they might be implemented.

Whatever the technical standards might be, they should work for Little Tech. That means they must not impose compliance costs or administrative hurdles that put startups at a disadvantage relative to competitors with deeper pockets and larger teams.

With that in mind, technical approaches should be clear and machine-readable. Developers need an objective, standard, scalable way to identify a given preference. They shouldn’t be expected to dig deep into a website to identify a preference or to parse non-standardized language to understand it.

What’s more, standards should be carefully designed to empower publishers and users, not gatekeepers. Individual users should have controls over how their data is used, but intermediaries, such as large internet platforms or publishers hosting user-generated content, should not have the power to serve as gatekeepers to content posted by their users. Intermediaries should not be able to make choices on behalf of their users by setting default opt-outs.

This user-focused approach is critical. As Mike Masnick has argued, “blocking legitimate individual use of AI tools to access and analyze web content” is “not protecting creator rights—that’s breaking the fundamental promise of the web that if you publish something publicly, people should be able to access and use it.” To this end, collaboration around standards for Application Programming Interfaces (APIs) to access site data could also be a helpful approach to empowering users. In other contexts, like open banking, APIs have allowed service providers to act on behalf of a user in order to access and use their data. A user might, for instance, ask a budgeting application to access their bank account to track expenses. Similar sorts of approaches can help provide ways for AI tool developers and websites to develop mutually beneficial approaches.

Public policy guidance

Public policy also has an important role to play in ensuring that using AI does not deprive people of their freedom to learn, while also enabling some restrictions on use and unlawful access:

  1. Policy should affirm that AI does not deprive people of their freedom to learn. People should not lose their existing freedom to learn simply because they are using AI, and policymakers should reject proposals that expand copyright to that effect. To that end, policy should ensure fair, clear limitations to copyright. This matters not only in the context of AI training, but also when people use AI to read or engage with works in otherwise lawful ways.
  2. Policy should not permit rightsholders to use contracts to negate the public’s rights under copyright. Sites can make different choices about how their sites are accessed and enforce certain restrictions in their terms of service. However, contractual powers are not boundless, and sites generally shouldn’t be able to do through contract law what copyright would limit them from doing. Where copyright law protects the public’s rights to engage in certain uses, such as copying lawfully accessed data for the purposes of AI training, a rightsholder should not be able to override that fair use.
  3. Policy should empower publishers to prevent unlawful access to their sites. Existing laws already prohibit people from doing the digital equivalent of breaking and entering-–whether that’s breaking security controls in order to access a satellite TV signal or breaking into a password protected website. In AI, this principle means that an AI developer should not be permitted to circumvent a paywall simply because the aim is to “learn” from a website’s data in order to build an AI model. But policymakers should ensure that the term “unlawful” is carefully cabined and that, as noted above, terms of service and other legal mechanisms are not used inappropriately to interfere with the right to access and use information. For instance, if a user can visit a website and copy public data, merely using an automated process to do the same thing should not be considered a violation of a publisher’s rights. More generally, lawful access should include works found publicly available online, purchased media, or content obtained through contractual arrangements such as subscriptions.
  4. Policy should prohibit the misuse of AI to create outputs that infringe a publisher’s copyright. If someone uses an AI model to create an infringing output—such as by creating a news article that is substantially similar to the copyrightable expression in a news publication—existing copyright law gives rightsholders the ability to seek redress. That said, a person should be permitted to use AI to make lawful uses of a work, such as summarizing an article or converting website text into audio.
  5. Policy should foster access to data for use in AI. Governments already invest in datasets and creation of other material that can be useful for AI. For instance, federal agencies from the National Aeronautics and Space Administration (NASA) to the National Institutes of Health (NIH) generate extensive datasets that could drive AI innovation, yet these resources often sit behind paywalls or in formats that aren’t easy to use and hinder effective utilization. The government could play a critical role in addressing this imbalance, such as by creating an “Open Data Commons” of data pools that are managed in the public’s interest and by creating a National AI Competitiveness Institute (NAICI) to house this data and manage access to it. By providing access to data for Little Tech, this approach could help to ensure startups can readily access the resources they need to compete.

Building an equilibrium for AI

The advent of AI raises a fundamental question: should the freedom to learn cease to apply if humans rely on machines as instruments of learning? We argue that it should not. Upholding the freedom to learn—irrespective of whether learning occurs by a human or a machine—is vital to sustaining an open, innovative, and competitive technology ecosystem in the AI era, just as it has been throughout the history of the internet.

Realizing this vision will require coordinated action: markets must evolve, technologists must collaborate on voluntary standards, and policymakers must continue to affirm the freedom to learn. If we get this correct, AI can strengthen, not shrink, the Web’s founding promise: an open network where lawful access to knowledge serves as the foundation for learning, innovation, and empowerment.