Pausing the Internet

From The Practice — March/April 2019

How Perma.cc is trying to fix legal citations

A2J Future of law Law schools Students

In March 2014 the Harvard Law Review published “Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations,” by Jonathan Zittrain, Kendra Albert, and Lawrence Lessig. The paper examines how legal academic journal citation and even URLs (website links) provided in U.S. Supreme Court opinions are impacted by link rot or reference rot—instances where cited URLs no longer lead to the content that was originally cited. Within their sample of academic journals, the authors found more than 70 percent of all URLs no longer produced the information originally cited. In surveying all published Supreme Court opinions, they found that 50 percent of referenced URLs likewise suffered from link or reference rot. That is a problem worth restating: of all the citations of online sources found in all U.S. Supreme Court opinions, half of them no longer contained the evidence cited by the opinions’ authors. This means that we now know less about how Supreme Court justices reached their opinions, less about the arguments they found persuasive, and less about the judicial reasoning playing a foundational role in the highest levels of the U.S. legal system.

Link rot refers to instances where a cited URL no longer produces any content.

The authors propose a solution: Perma.cc, a webpage-preservation tool for academics, jurists, and potentially many others to prevent link and reference rot in their work. Perma.cc offers a way to ensure that internet citations are always linked to their original source material by freezing webpages as they exist at a particular moment and creating a new URL where they can be viewed at any time in the future. The project, created by the Harvard Law School Library Innovation Lab (LIL) and supported by a partnership of law libraries from around the world, is deeply rooted in a core function of libraries: preserving scholarship and works for the benefit of people who need that information. Zittrain, Albert, and Lessig’s article concludes:

The rise of the Web has enabled the creation and exchange of scholarly knowledge and the sources on which it is based. It has also bypassed the libraries that previously vouchsafed the long-term preservation of those sources. Unless action is taken to archive this type of information, future readers will be unable to obtain the sources relied upon by the authors whose work they read. The integrity of scholarship will suffer. The distributed Perma system seeks to unite journals, libraries, and authors to restore that integrity by ensuring that those sources are appropriately preserved for posterity.

In this article, we take a closer look at Perma.cc. Why is this solution coming from a law school library? Who is using it, and what does this mean for legal research? In the end, we are left with a portrait of Perma.cc—the service, the technology, and the people behind it—that offers a new vision for libraries, legal and otherwise, in the age of the internet.

Link and reference rot

Readers will notice two similar terms in this article: link rot and reference rot. The difference between the two is important to bear in mind. Link rot refers to instances where a cited URL no longer produces any content. Either the page is simply blank or users encounter the common “404 error” page that often reads “Page not found.” Reference rot, on the other hand, refers to instances where a cited URL no longer points to the same information that the URL contained when it was originally cited.

Consider the following example. Suppose you wanted to cite Zittrain, Albert, and Lessig’s paper in the Harvard Law Review, which you accessed on the publication’s website. You would inevitably include the article’s URL in your citation to indicate exactly where others could find that same work. Link rot would occur if you later entered that URL and saw no content or a 404 error page. How could this happen? The page might have been taken down. There might have been a technical issue with the website. The URL for the page might have been intentionally altered—perhaps the authors decided to change the title of the article and requested the URL be adjusted to match it.

Reference rot would occur if the URL continues to lead somewhere, but the information that was cited no longer exists at that location. Perhaps an infographic in the article was the object of your citation, but that infographic was later removed. Perhaps the title of the article was altered as suggested above and HLR published a new and different article with that exact same title and corresponding URL. In that case, your citation would lead to an entirely different article. More plausibly, reference rot might occur where the cited page is a landing page that once contained the cited information but no longer does. Perhaps you were merely citing that Zittrain, Albert, and Lessig published the article on HLR and its appearance on HLR’s homepage was your evidence.

Consider another example—one that lawyers and legal scholars are quite likely to experience. Suppose you are a lawyer who wants to cite a government report you accessed online. We will call this “Report X.” Your client came to you with some problem, and, through your firm’s access to a reliable online database, you find that the recently published Report X has an answer you need. You read the text carefully, understand what it says, and then cite the URL in your legal opinion when you draw on that report. You have closed the loop insofar as it is clear to any who care to check where you drew those ideas. But what if the online text of Report X, being newly published, was then amended to correct errors or perhaps even to fundamentally change a key conclusion? The URL remains the same—it is, after all, still Report X—but this is no longer the same document from which you drew your ideas. You reviewed and cited Report X₁, but your citation now leads to Report X₂. This is reference rot, and the consequences could range from incidental to severe.

Both link rot and reference rot present serious problems for scholarship going forward in the digital age. Unlike physical books and journals, webpages can be changed or deleted without notice. Websites will continue to change—and often. According to some studies, the average lifespan of a webpage is between 44 and 100 days. Another study, which explores the extent of link rot in science, technology, and medicine scholarship, shows that the vast majority of articles referencing “web at large” sources (generally, nonscholarly online content like wikis or other websites relevant to the research subject) contain some reference rot. Even in the most recent years of the study, one in five of all science, technology, and medicine articles contained reference rot. The takeaway: the internet is always in motion. The old methods of citation have already proven woefully inadequate to account for this shifting digital landscape that is increasingly housing the information we cite in our scholarship. Without an intervention, this problem will not just persist but continue to expand.

What is Perma.cc?

Perma.cc (“Perma”) was created in 2013 as a solution to link and reference rot by the Library Innovation Lab (LIL) housed in the Harvard Law School Library. These two influences, innovation and libraries, are core to Perma’s mission of preserving the integrity of scholarship in the digital age. To understand Perma from both angles, we spoke with two members of the LIL team who work closely with the project: Rebecca Cremona and Clare Stanton. Cremona, a developer, came to LIL with what she describes as a “scattershot background” typical of many at the Lab. Armed with degrees in physics and theology, she was drawn to the intersection of open access and academic scholarship and began programming for the Harvard Library Office for Scholarly Communication. Not long after, she joined LIL as a developer. Stanton, a communications and outreach specialist, has a background in art history and museum collections and is currently studying for her Master of Library and Information Science degree. Just before joining LIL, Stanton worked with Harvard Business School’s Clayton Christensen (author of, among many publications, The Innovator’s Dilemma—for more, see “Adaptive Innovation”) and found his research around the market disruption of large organizations to be increasingly relevant to the world of libraries. “After all, the whole point of a library is to stand in its place forever and have systems that both have lasted for a long time and continue to work,” she explains. “But there are undeniably these forces that are changing how we deal with information, how we access information, how we want to read books, and so on, and that really sparked my interest in how libraries can adapt to industry shifts.”

Perma is simultaneously an evolving web-archiving technology and a practical service that scholars and others can use.

At LIL, programming and librarianship have converged in the form of a web-archiving platform: Perma. Perma provides permanent links to online sources as a means of combating link and reference rot—snapshots of a sort that display the content that was viewable on a given webpage at the time the link was preserved. “If you are writing a paper for a journal, or if you’re writing a brief for court, you want to make sure that the exact thing you were referencing from the internet can be seen in that exact same iteration when someone’s reading your citation,” says Stanton. “The ‘Perma Link’ is both short and easy to click on, and it exists just like any other URL, only it’ll bring you to a preserved page.”

As for the actual storage of preserved links, LIL follows the central tenet of the LOCKSS philosophy of digital preservation—“lots of copies keep stuff safe.” “Some of the storage is physically at Harvard, but we also have a lot of redundant storage,” explains Cremona. “We have some in Germany, we have some in Canada, and we are developing a network of other institutions of higher education that agree to keep copies of Perma Links on their premises. And we’re always looking for new partners in that venture.”

Perma is simultaneously an evolving web-archiving technology and a practical service that scholars and others can use. The actual products of Perma’s technology are what Cremona calls “high-fidelity web captures.” As a user, you just need to enter the URL of the webpage you want to preserve, and in seconds, you are given a new URL, or Perma Link. That new URL brings you to the page as it existed at the moment you captured it. Across the top of the preserved page is a banner that reminds you it is a record and includes the exact time and date of the capture, additional details like page title and description, and even an option to view the live page as it exists today rather than the preserved one. In the end, users get URLs they can use in citations that will direct future audiences to the correct material.

Perma’s function, Stanton argues, is an extension of what libraries have always done. “In the past, if you were a scholar who wanted to cite to something, you would go to a library and you would find a physical book that was under the stewardship of that library,” she explains. “That’s what you would point to, and it would be reasonable to expect that to be there in three years.” However, the game has changed with the internet. Perma is LIL’s attempt to achieve a form of stability through the chaos. Stanton continues:

The internet is a great place. It’s totally a democratizer of information, which is now way easier to access, but those kinds of fail-safes of citation that are needed for the long term—that landscape has completely changed. Perma has a role to play in that new landscape that connects to the traditional work of libraries. Even though it’s not librarians who are actually choosing the materials and hosting them, we’re allowing that same kind of assurance to the scholar that, in the long term, they’ll have their citations available to the people who are reading them, which in many ways is what libraries do.

There are two distinct aspects to what Perma is doing when it allows users to preserve webpages: the capture and the playback.

Perma’s web-preservation service is available for free to academic institutions and courts—the initial user base LIL had envisioned. “Most academic law journals now use Perma.cc, and it’s a very natural fit into their workflow,” notes Stanton. “There are also Perma Links in the opinions of many state and federal courts, and Perma is used extensively by the Library of Congress,” adds Cremona. Increasingly, however, interest in Perma has swelled outside of these realms; new users can now set up paid monthly accounts based on usage (access to created links never expires). And people are using Perma in significant numbers. Currently Perma counts approximately 25,000 users and nearly 1 million links preserved.

This infographic shows the four steps to make and use a Perma Link: Step 1. Find the page you want to preserve. Copy its URL and open Perma.cc. Step 2. Add your URL to Perma.cc. Select Create Perma Link and Perma.cc does the rest. Step 3. Get your Perma Link. Perma.cc builds your new record in a matter of moments. When it’s finished, you have a chance to delete or annotate it. Step 4. Use the Perma Link in your citations.

How it works

There are two distinct aspects to what Perma is doing when it allows users to preserve webpages: the capture and the playback. The capture is taking in the webpage’s information, and the playback is rendering that information in the approximate form it took when it was originally referenced. The reason these functions are not one and the same is the medium. In other words, a library preserving a physical book and Perma preserving a webpage containing that same text is not an apples-to-apples comparison. “For a researcher, experiencing the physical book is a one-step process: take the book off a shelf,” explains Stanton. “But you can’t just read a .WARC file—an industry standard of web-archiving file formats—by opening it up on your computer.”

Perma has a multistep process to preserve and present that same text when it comes in the form of a webpage. It takes in the data of the webpage (the code), preserves “lots of copies” of that information (which are stored through Perma’s network under the LOCKSS philosophy), and then reassembles it in a way that gives the user an authentic view of what it looked like when it was captured. Thus, on the one end, there is the capture, and on the other, the playback. Both are necessary steps due to the complexity of web content.

A library preserving a physical book and Perma preserving a webpage containing that same text is not an apples-to-apples comparison.

“We call it a webpage, which makes it sound like a page in a book, but viewing a URL is actually more like being present for a live performance just for you,” explains Cremona. “And there are a ton of actors: it’s HTML, it’s styling, it’s code, it’s a variety of source files—there is a lot going on.” Of course, there are limitations to how closely Perma can reenact that live performance. For example, because preserved webpages are not connected to the live internet in the same way the original webpages were at the time they were captured, secondary links (links appearing within a webpage) are not accessible through Perma Links. But even absent the preserved webpage’s interconnectivity to the live internet, recreating that webpage can be complex work. Cremona offers the example of cnn.com:

If you go to cnn.com, more than 500 things are loaded into your browser to show you that webpage. What you are seeing is an interaction between many hundreds of things that some developer somewhere has written, copy that some copy editor somewhere has written, and it’s all displayed for you in a performance interpreted by your browser. That’s what you see instantaneously when you go to cnn.com. For us, it’s similar to thinking about how you would capture a live performance. You could use a video recorder, so then you could see what it looked like from a particular part of the room. You could try to get the audio to get what that performance sounded like from a particular part of the room. The extent to which the fidelity of your capture recreates the actual experience is a sliding scale depending on your resources, your ability, the complexity of what’s going on, and how much effort you want to put into it.

As Stanton describes, keeping up with the internet to produce high-fidelity captures is an ongoing battle. Playback is likewise a continuing effort because, just as the way we produce digital content is frequently changing, the way we consume digital content is in constant flux. Thus, LIL is constantly finding ways to resolve the various anachronisms inherent in digital preservation, whether that is allowing users to view the live version of the preserved webpage for comparison or capturing versions of the same page for different devices. Other webpage-preservation tools, Cremona adds, are even working on ways to recreate Flash media for modern audiences (since Flash has largely gone out of use after Steve Jobs declared Apple products would no longer display it). Permanence, in short, requires a certain vigilance of web content in both directions—looking back, what old forms of web content have become obsolete and why, and moving forward, how are people going to be accessing that information?

How to preserve a book filled with arsenic

True to its library roots, Perma is intent on preserving all aspects of each webpage—including those parts that were at one time (and may continue to be) harmful to users. That means faithfully preserving all the code that was part of the experience when the webpage was captured, even if its sole purpose was to facilitate, for example, targeted advertising. Cremona and Stanton liken the task to how libraries have handled the preservation of one particular physical book that causes harm to users, first brought to their attention by LIL summer fellow A. Kendra Greene while working on a book of dangerous library collections.

“The book is called Shadows from the Walls of Death, which may sound like a ridiculous name, but it is a literal book that has arsenic wallpaper in it,” explains Stanton. The book was written and compiled by Robert Clark Kedzie while he was serving on Michigan’s Board of Health and published in 1874 for the somewhat-ironic purpose of warning the public against the dangers of arsenic poisoning from wallpaper. The physical book contains dozens of real samples of said poisonous wallpaper. “Essentially, the way to get a vibrant color green in wallpaper during that time was to use actual arsenic,” says Stanton. “So he shipped all these arsenic-filled books to libraries all across the United States, which, for the people who were handling these books, was very dangerous for them.”

At the same time, there is now value in preserving some of these copies. After all, the physical book holds clues to the past that are of interest to scholars today—say, scholars looking into how the United States grappled with its growing understanding of arsenic poisoning. There is value in having this unique source of information, which is why a small handful of copies have been preserved with additional precautions for safety.

Perma, Cremona explains, approaches the preservation of potentially harmful digital content in much the same way. “I’m sure there are plenty of readers who want to see the wallpaper but don’t need that arsenic,” she says. For Perma, instead of arsenic, it is, for example, JavaScript that destroys the webpage if it is not allowed to report the user’s information back to the live internet. Thus, for Perma, the challenge is working around that type of issue to give an authentic playback while somehow still preserving those harmful parts. Cremona explains:

You could imagine creating a really high-fidelity copy of that book that just kept the poison out so then you could review the wallpaper and not get sick. That’s how I think of blocking this evil JavaScript from the playback. It lets you get at what you wanted—the content—and then if you are actually a scholar of evil JavaScript in the future, which there will be, then sure, you can have that too. We still have the copy. You’re just going to need to suit up.

The future of permanent webpages

Perma’s mission to preserve webpages indefinitely is bold for precisely the same reason it is needed. As Stanton and Cremona emphasize over and again, the internet is ephemeral. “In a lot of ways, Perma is an answer to an issue that is always going to be evolving,” says Stanton. “We want people to know that the internet is fragile but that it’s not an unsolvable problem. Perma can help as part of the ecosystem of how people interact with the internet.”

Permanence requires a certain vigilance of web content in both directions—looking back and moving forward.

There is also plenty of room in that ecosystem for other forms of webpage preservation, affirm Stanton and Cremona. They point to other players like the Internet Archive, which started archiving the internet in 1996 and now contains nearly 350 billion webpages, and Webrecorder, which specializes not just in storing web resources but in developing alternative ways to view that content (like the Flash Player noted above). “Technically, you could create a web archive from your very own web browser,” adds Cremona. “And then you’d have that archive on your computer—you just save the webpage. But how would you then get that out into the world? And how would you get that out into the world with authority? That’s where Perma comes in.”

And yet, Stanton and Cremona agree, LIL need not be at the center of that digital preservation. “Part of the mission of LIL, perhaps the mission of LIL, is to explore systemic interventions in library science and law, and to find where libraries fit in,” says Cremona. “But our focus is always on what’s next. We could easily see total success as transitioning Perma.cc to someone else suited to run it for all time.” According to Stanton, the important thing for LIL is that this resource is up and running—and that it continues to run. “Libraries are always going to be looking for the next way to help their patrons solve a problem, and now Perma is one answer to that,” adds Stanton. “We also want to continue to look for new projects and see what other systemic interventions are needed out there.”