Making the Law Computable

From The Practice March/April 2019
The Caselaw Access Project

Imagine you are a lawyer in the United States in the 1820s trying to keep abreast of all that is going on in the law. One of the chief requirements of doing this is having a mastery of the constantly evolving body of precedent and case law. In 1826, that meant reading roughly six cases per day. While that would entail a significant amount of reading, it was certainly within the realm of possibility to do.

Fast-forward to 2002. On the highest-yielding day for case law in 2002, there was more case law published in a single day throughout the United States (that is, federal and state) than was produced in the entire year of 1826. The methods of the 1826 lawyer are no longer sufficient to fulfill the 2002 lawyer’s responsibilities of maintaining a current understanding of the law. How are lawyers supposed to keep up?

The Caselaw Access Project (CAP) is about unlocking case law to allow technology and innovation to take hold. In short, it is about making the law computable.

Traditionally, case law was compiled and published—thereby creating a business cost—in large volumes and sold to law libraries. These libraries would add them to their stacks for lawyers or law students to find, and each year these stacks would grow as more case law was made. The dawn of the digital age in the 1970s and 1980s offered new solutions that brought both greater ease of access via technology but also new conditions and limitations. As vendors moved to digitize these volumes of precedential court decisions, public case law increasingly slipped behind complex paywalls. Services like Westlaw and LexisNexis, for example, began charging significant fees for access to their databases—access that was now possible through personal computers. This fundamentally challenged the need for bound case law books of the past and forced lawyers to interact with case law through a new proprietary set of interfaces, search engines, headnotes, citation indexes, and other tightly controlled parameters. Because these services had the raw data—that is to say, the case law—they were able to control access. And, to cover the business cost of producing and maintaining these authentic digital databases, that access was controlled through significant fees.

It was against this backdrop that the Harvard Law School Library Innovation Lab (LIL) launched the Caselaw Access Project (CAP) in 2013, an initiative to digitize every precedential case ever published by a U.S. court throughout its entire history and make it all free on the internet. The scope of the project is as big as it sounds. CAP’s first order of business was to compile a list of all precedential cases—a list that ultimately included more than 40,000 volumes of case law comprising some 40 million pages of text. Contained in those 40 million pages were cases published as early as 1658 and as recently as 2018, including case law from all state, federal, and territorial courts. The task was then to digitally capture all these pages in a machine-readable format, which amounted to roughly 200 terabytes of high-resolution scans. But the purpose of CAP is larger than merely compiling the law. CAP is about unlocking case law to allow technology and innovation to take hold. In short, it is about making the law computable.

To learn more about CAP, its mission, and where it’s going, The Practice sat down with Jack Cushman, a senior developer at LIL and the lead developer for CAP. Cushman, who is also a lawyer, has a unique perspective on CAP’s story. “I was a developer first,” he says. Cushman began working as a computer programmer before his undergraduate education—starting at a web design firm at age 16—and he continued as a web developer before deciding to pursue a career as a lawyer. After graduating from Northeastern University School of Law in 2008, Cushman clerked at the Massachusetts Supreme Judicial Court for Associate Justice Margot Botsford and then did appellate litigation work at the law firm Stern Shapiro Weissberg & Garin. “I found some of that work very rewarding, particularly pro bono and impact litigation, but there was also a lot of other legal work that wasn’t as rewarding for me,” he recalls. “Then I found the Library Innovation Lab.” After discovering LIL at a Berkman Klein Center open house event and asking them a flood of questions about their then-newly-launched project (see “Pausing the Internet”), Cushman followed up with a brief eight-point email summarizing his thoughts. LIL invited him to attend a meeting, which led to him writing code for the project, which led to him working for LIL one day a week, which eventually led to his current full-time position as a senior developer. Now, in addition to his work on CAP and across LIL’s various other projects, he teaches a seminar at Harvard Law School with LIL director Adam Ziegler on programming for lawyers (see “Computer Programming for Lawyers” below).

Jack Cushman began working as a computer programmer before his undergraduate education—starting at a web design firm at age 16—and he continued as a web developer before deciding to pursue a career as a lawyer.

This past October CAP officially launched its public-facing data service in partnership with Ravel Law (now part of LexisNexis), posting 360 years of U.S. case law free and available on the internet. This release only scratches the surface of the potential in store for lawyers, courts, academics, and the public at large. This is just the beginning of CAP’s blueprint to change how we interact with the law. Below, we explore how it all came together, what CAP’s database is designed to do, and where the project is going. In the end, we begin to ask new questions—namely, what becomes possible when the law is made computable?

The landscape of case law access

As any lawyer knows, judges are constantly creating new case law in the form of precedential court decisions. To understand what CAP is trying to accomplish, Cushman explains, it is important to understand how that case law transforms from an opinion voiced by a judge to an object of text that lawyers and others can find and read. In the modern ecosystem of case law copy, one could imagine three main players—the courts, traditional publishers, and new providers—each facing its own challenges, to one degree or another, around three main issues: dissemination, authoritativeness, and accessibility.

The courts, somewhat ironically, have not traditionally played a large role in the direct dissemination of case law, instead leaving it up to publishers to create hard- and soft-copy case law. That, however, is beginning to change as courts are increasingly publishing their own case law on their websites for free download. In theory, this court-published case law ought to be inherently authoritative because it is coming directly from the source. Illinois is an example of a state that publishes its official cases online as digitally signed PDFs. In reality, however, many states publish only unofficial versions of case law on court websites, which could vary from the official version of that law. Gauging accessibility is perhaps even more complicated. In one sense, case law published by the courts on their websites is highly accessible insofar as there are typically no fees for accessing this public-facing material. In another sense, given that there are hundreds of different courts in the United States, access is also limited with no centralized method of searching across the various courts’ case law decisions published in this way. PACER, a fee-based public access service provided by the federal judiciary, currently provides access to federal court opinions only. Moreover, and crucially, some courts still do not publish all their cases online.

If all this seems a bit haphazard, that may be because this is not the traditional way case law copy has been produced in the United States. For a long time, the courts have relied on private publishers—such as Westlaw, Bloomberg, and LexisNexis—to disseminate authoritative case law. Cushman describes how this works:

Each court has a relationship with a major publisher to collect their cases as they come out, fill in a book, and then, when the book is full, to put it on the shelf and announce that it’s for sale and sell it to whoever wants it. That was the traditional way that case law made it out into the world. There was the court and there was the private publisher, the publishers sold the books to law firms and law schools, and that became the precedent that lawyers used.

While courts often share the original drafts of cases widely, each court typically works with only one publisher to produce the authoritative volume containing the official copy with additional—often substantive—edits. This process has evolved with the emergence of cyberspace such that these publishers now digitize the case law that they receive from the courts, add in new editorial content (such as headnotes) intended to assist lawyers, and build sophisticated query systems capable of searching the text. On one level, this consolidation and searchability makes the incredible volume of case law highly accessible to lawyers who pay for this service. On another level, however, because entrance into and navigation of the publisher’s database is governed by fees, it also inherently reduces access insofar as there is now a wall around those copies of case law.

This complex ecosystem creates a problem in which there are potentially differing databases of case law—many of which can be quite expensive.

Even with these publishers, however, authoritativeness of the copy is not guaranteed. Because a court has a relationship with a particular publisher, and because the other publishers also want to create their own complete databases, there are risks of alternative versions of similar case law entering the ecosystem. Cushman explains:

While each court still has a relationship with one publisher, and that publisher will get up-to-date copies of their cases, other publishers will often extract the data. They may get the one that the court puts on the website right away. They may get one that the court publishes a month later when it’s finished making some changes. They may get the one that’s put in the book that gets extracted back out. Every database can end up with a different copy of the thing.

Finally, there is a growing number of alternatives to the big publishers that, each in their own way, are attempting to deal with access issues by reducing costs or creating search tools with new capabilities. For instance, there is a startup tier that includes services like Fastcase and Casetext that often bill themselves as a lower-cost alternative to the incumbent commercial players. There are also nonprofits such as CourtListener, sponsored by the Free Law Project, which provides access to its incomplete database for free. Leaving aside whether or not these alternatives are somehow “better” than traditional publishers, their proliferation arguably exacerbates the problems already facing many of the large publishers—the questionable authoritativeness of the data sets underlying them.

Simply put, this complex ecosystem creates a problem in which there are potentially differing databases of case law—many of which can be quite expensive. How does this system actually anticipate lawyers accessing the correct version of the law? “If you’re working as a lawyer, and you’re doing legal research, you want to do responsible research and make sure that you’re finding the law that your clients need you to find, so you’re going to use a commercial database,” says Cushman. But, for those using the big commercial databases, searches can be expensive. “Everyone will have the story of the summer associate who goes and doesn’t realize what they’re doing, and spends $1,000 on legal research that they’re not supposed to spend in just a few minutes because it could cost tens of dollars per search in some databases,” Cushman notes. At the same time, lawyers cannot afford to be wrong about what the law says, so they are left with little choice. The current landscape for case law copy leaves room for little alternative. He explains:

Because it’s very expensive to get the law from courts and it’s important for the database to be correct and complete, there’s a strong incentive to build high walls around your database. If you’ve invested all this money in building a complete database, you have to charge tens of dollars per search. But what that means is that lawyers are only going to have the tools that services like Westlaw or Lexis build for them—or choose to build for them.

Enter CAP

CAP was created to challenge this system from a programming angle. LIL asked: How can we create a database that is as big and complete as the big commercial players, but free and accessible—not just in making the text available but such that programmers can create new means of analyzing the vast wealth of data contained in case law? As we see below, CAP’s goal is to create the database of knowledge that others, including private companies, can develop into applications. To do that, you need a different organizational model. “Here at LIL, we have nonprofit funding, grant funding, or partnership funding that lets us do things on a project basis that we think are important for the world,” says Cushman. “And so the funding model for that is more or less we convince someone that something is a good idea, and then we get a certain amount of money to do it.”

Harvard boasts one of the largest collections of U.S. case law in the world—second only to the Library of Congress.

For CAP, that took the form of a partnership with Ravel Law, a legal research startup at the time the project began. In return for funding support, Ravel got a copy of the database produced by the project and limited commercial exclusivity. Otherwise, the deal was structured as follows: All the data becomes free in any quantity to anyone in the world by 2024. (For research scholars around the world, unlimited access is already available.) If, in the meantime, a jurisdiction officially starts publishing its case law online—prospectively, Cushman clarifies—CAP is then free to share unlimited quantities of everything in that jurisdiction’s history before the 2024 mark. Thus far, that has only included Arkansas and Illinois, but others are likely to follow, says Cushman. Until 2024, with the exception of case law from Arkansas and Illinois and any other states that follow their example, users are limited to accessing 500 cases per day. “So anyone who shows up with an email address, we’ll give them a key and they can start taking 500 cases,” Cushman explains. “And again, those 500 cases they get are free and clear. As far as we’re concerned, we have no ongoing interest in what’s done with them.”

Users will quickly see, however, that this does not mean they have a new Westlaw or LexisNexis that just happens to be free. CAP is something different altogether. It is simultaneously more and less than what lawyers and law students are likely to find in other common legal research databases. CAP’s distinct user experience is rooted in its mission—to make case law not just accessible but computable. Accomplishing that goal, however, was anything but simple.

Cutting off the bindings

It’s no accident that CAP was conceived of and carried out at Harvard Law School. Harvard boasts one of the largest collections of U.S. case law in the world—second only to the Library of Congress. And yet, given the proliferation of large commercial databases over the last quarter century, these books were not filling the stacks at Harvard Law School’s Langdell Library. (They are not there now, either, but more on that below.) Virtually all of them were packed away 25 miles off site at the Harvard Depository—the subject of a short documentary produced by metaLAB (at) Harvard. So when LIL’s team compiled their list of all precedential U.S. case law, which tracked those volumes that were officially designated case law and thus authoritative, they then had to go find each volume in storage. With the help of the Harvard Depository, which stores more than nine million items sorted and organized by physical size—not by, say, author name or subject—they were eventually able to locate their sources.

LIL also needed a clear and feasible strategy for how they would transform all this physical data into digital form. The plan they ultimately went with was, for a library, unexpected—though perhaps not as strange as one might assume. LIL proposed slicing the bindings off all the case law books so that they could then be scanned in a high-speed scanner. “The idea that we were cutting the books was always the most provocative part of the whole thing,” Cushman says with a laugh. Their argument was that high-speed scanners could work at a pace of 8,000 pages per hour, whereas typical scanners designed to handle fully intact books, such as cradle scanners, could work at a pace of only 800 pages per hour. In other words, without slicing off the bindings, the project would take 20 years instead of two. Ultimately, the library allowed them to cut the bindings off the books to facilitate the project. Indeed, as Cushman explains, the books were hardly accessible to anyone sitting in the depository. What LIL was proposing was a way to not only preserve the content of those books in a new form but make them significantly more accessible to people actively looking for the information they contained. Cushman continues:

If you’re running a lending library, you have books that come in and, while there are ones that are wanted, the ones that are least wanted have to be destroyed. And I think for individuals who have a romantic idea of a library, like me, that can be very unsettling. But what we’re trying to do is create information and provide the books people need, and we know that some books are going to reach the end of their life. So maybe the real-life librarians were wiser about this than the librarians of our imaginations.

After they had identified all the books they needed, the physical aspect of CAP’s digitalization process could begin. LIL would order a set of case law books from the depository, which would then be transported to their offices in Langdell Library. The books would face the “guillotine,” a tool that would cut off the book’s spine rendering its pages unbound and thus more easily scannable. Next, the books hit the high-speed scanner—the fastest scanner they found equal to the task was actually designed for reading bank checks—which captured digital images of each page. After sending an entire book through, the resulting pile was consolidated using a card-shuffling machine, which brought the pages all back together into a perfectly neat stack. Each book pile, having regained a close approximation of its original shape, was then shrink-wrapped and organized to be sent to long-term storage in a facility hundreds of miles away. “So at the other end of our floor there would just be shelf after shelf after shelf of stacked, shrink-wrapped books waiting to go,” recalls Cushman. Another developer at LIL, Andy Silva, created custom software that helped manage the logistics of this process from start to finish, tracking the whereabouts of each book on their list—a benefit of having talented developers on staff.

Without slicing off the bindings, the project would take 20 years instead of two.

In the end, it took a few years to compile digital files of all the case law. “That process of cutting and scanning gave us all case law in the form of images, but we still had to turn that into structured data because the images alone are worth very little relative to a database like Westlaw,” says Cushman. In fact, it was the next phase that demanded the bulk of the effort—and likewise the outside funding. LIL worked with Innodata, a digital services company, to prepare the data using optical character recognition (OCR) with human checks for accuracy. This is the process that produced digitalized text of each case while redacting all headnotes that were under copyright (such as those Westlaw added to the text themselves). While CAP’s metadata, which includes pieces like case name and date, were all manually checked for accuracy, Cushman cautions that the precision of OCR technology is not at a level where it can be absolutely guaranteed without that kind of human oversight. To that end, it is worth quoting the caveat on CAP’s website that “the text of each case was left as raw OCR output.” What this accomplished, however, was law in computable form.

If you build it, developers will come

As mentioned above, CAP’s database went live to users in October 2018. Overnight, all U.S. case law that ever existed up to 2018 was made available online at no cost. It is important, however, to clarify what this did not mean. This did not mean that lawyers gained a free and comparable alternative to what had been otherwise paid services. It did not mean that CAP produced a search engine for its database geared toward computer novices. CAP was never about replacing any existing player; it is about creating a single, authoritative, and computable database of U.S. case law.

CAP’s October 2018 launch was more an invitation to programmers than it was to lawyers—though there is a movement to cultivate a broader overlap between the two (see “Computer Programming for Lawyers” below). The rollout took inspiration from an unlikely source: the Massachusetts Bay Transportation Authority (MBTA). “I was very influenced by the MBTA and how they approached real-time tracking for subway and bus arrivals,” says Cushman. He explains:

When they were taking on that problem, they realized that it was very cheap to collect the data and it was very expensive to share it with riders—to make apps and signs that could tell you when the next train was coming. So the first pass of their project was just to gather all the data and make an API (application programming interface) for developers. Because that was something they could get done in a relatively short period of time, they just said, “OK, here’s the data. If you’re a programmer, you can know when the next train is coming. Now let’s see what happens.” And then what they saw was a bunch of people made these different apps and signs to the point where you’d find private signs in shop windows that said when the next bus was coming. It was because the city had made that API available that someone could then make the sign for it.

This was essentially LIL’s strategy for rolling out CAP. The October launch was about exposing the data to developers—a data set that previously was either hidden behind paywalls or incomplete and unreliable. “The first things that we made were an API that will let people write programs that search case law and show results and bulk data that, if someone wants to just have all of the data on their computer, they can download it and use it,” explains Cushman. “Those are really targeted at that first sort of corps of people who make things for other people to make things.” The question now is what will developers make? LIL, for their part, are indeed making a front-facing search interface of their own—another of Silva’s projects. But, as Cushman emphasizes, there is nothing inherently special about what they will make apart from what anyone else could produce. It will all come from the same API to which everyone now has access.

Achievement unlocked

Once developers move the data past this initial layer with functioning programs and applications, the potential is limitless. To illustrate just some of what is possible with the data, Cushman and his team made a few applications of their own. They produced a limerick generator, which uses an index of case law according to rhythm, meter, and end rhyme to produce limericks comprised of actual case law text. Here are just a couple of examples:

The earlier rules were repealed.

Her body was left in the field.

W. Knowles.

special controls .

The board of assessors appealed.

The city of Rochester had.

In Bunyan v. Mortimer, Madd.

Hudson and Buck.

storing the truck.

It is, therefore, probably bad.

“With the limerick generator, I’m trying to show that the case law is computable,” says Cushman. “And that means that you can search it in any way you can imagine, instead of only the way that the database vendor could imagine for you.” Indeed, what incentive was there for existing commercial databases to invest any effort into generating limericks? More important, what else can we do with this data if we overcome the limitations of these existing databases? CAP features other examples that get at this point. LIL has created GAVELFURY, which randomly draws on all instances where an exclamation mark (“!”) was used in case law, and Witchcraft in Law, which maps every appearance of the word “witchcraft” in case law. And there are others.

“Someone’s going to figure out that saving lawyers money can eventually make them money, and when they crack that, there will be programmers who will build it,” says Cushman.

The point is that the aggregation of all this data in computable form unlocks our ability to find answers to questions that might have once been inconceivable—or at least unconceived. For lawyers, that could mean a search functionality tailored to their specific needs. One example Cushman offers centers on contracts. If a lawyer wanted to compare their contract language to what has been used before, they might search all instances where case law quotes contract language. But this type of innovation, Cushman says, is not the most interesting possibility that CAP unlocks. “Someone’s going to figure out that saving lawyers money can eventually make them money, and when they crack that, there will be programmers who will build it.”

Cushman sees this comprehensive historical database as a means of exploring new questions that are not inherently legal in nature. In his contract example, one could imagine a researcher asking a less contextual question: How has contract language changed over time? This ability to measure the frequency with which certain terms were used in case law offers significant insights. For instance, LIL discovered something interesting about the term “victim”—namely, it does not appear that it was used much at all in case law before the 1960s. But why? This question itself could domino into an entire research project that might never have otherwise existed. That scenario of the 1826 lawyer vs. the 2002 lawyer introduced at the top of this article—Cushman discovered that through CAP’s database. Another revelation arose out of searching for words that were most common to a particular era. In his case, it was the 1930s. Cushman explains:

I love to look at the 1930s, in particular, because my elementary-school-level understanding of the 1930s United States was “Dust Bowl” and “Great Depression” and “foreclosures.” That’s what I expected to see—a lot of foreclosure law in, say, California, where we looked. But that’s not at all what you find. It’s automobile accidents. It’s “trucks” and “highway” and “accident” and “collision,” and what we think is that the reason they’re talking about that in the 1930s is that’s when cars start to kill a lot of people; that, all of a sudden, you have this new source of accidental death where there’s two people involved, one of them died, someone has to pay for it, and the government needs to somehow allocate the loss. And it hadn’t been allocated yet, so the courts had to sort it out. And that’s a story that I didn’t know about California history, but that pops out when you start to look at the data in that way. And once you see that story, you can start to ask, “Well, how did different states navigate that? And what does that tell us about upcoming navigation of the problem with self-driving cars killing people?”

Here we start to see the application of CAP’s database provide real value in its potential to help solve problems now. The significance of the October 2018 launch is that now all the raw data is released from the confines of a paywall. Now, we are all only a good question (and a good developer) away from entirely new realms of insight.

Computer Programming for Lawyers

While the Caselaw Access Project is not overtly billed toward a community of lawyers, the Library Innovation Lab takes an active interest in exposing law students to coding. During Harvard Law School’s winter term, Cushman and LIL’s director, Adam Ziegler, teach Computer Programming for Lawyers, a course intended for students with no prior programming experience. However, as Cushman explains, while the course may teach coding skills, it is less about the actual coding than it is about helping students become better lawyers with computational thinking. “We want students to see behind the curtain of technology, getting them ready for all of the change that’s coming to legal practice and to the world in general,” Cushman adds. The course description mirrors this emphasis:

Modern legal practice requires deep understanding of technology. Advocates must understand what it means at a technical level to “speak” online, to “sign” a digital contract, to “search” a computer, or to “delete” evidence. And law firms must understand what tasks can be most efficiently done by custom software and what are best left to human beings. This course teaches students to be effective computer programmers, and therefore to deconstruct and understand the technologies they might encounter throughout their careers. Students will learn basic computer programming skills using the programming language Python. We will then apply those skills to real-life legal scenarios drawn from the instructors’ own legal and programming experience, such as data-driven lobbying and statutory analysis, mass litigation automation, and electronic discovery.

For a course like this, CAP’s database comes in handy. For example, last year students practiced their programming skills by searching Illinois case law for all instances of dollar signs (“$”). Using only the API available, students were able to gather new information about the history of Illinois. In short, through this simple instance of programming, they were able to compile every mention of a dollar figure in Illinois case law, average these figures per year, and then graph how that amount changed over time. This exercise created an exponential curve that roughly corresponds to inflation—up until 1930, at which point the line goes flat. What this teaches students is that programming allows them to ask new, interesting, and complex questions that were perhaps previously unavailable to them. “That’s not fundamentally different from the limerick generator,” Cushman adds. “It’s a way of turning the case law into data.”

What is the endgame?

Part of CAP’s long-term mission is actually out of its control. “On the supply side, my dream would be that courts finish the transition to digital-first publishing that is more authoritative,” says Cushman. He elaborates:

It’s clear that the current thing that courts are doing—which is to put the case in some book that no one buys or reads and then let each database find their own different copy of it—is not fair to anyone. It’s not fair to lawyers or the litigants or the courts, and courts need to be publishing digital versions that are authoritative and signed and versioned. That way, we can make sure that everyone has the up-to-date thing that the court says is precedent. So that should happen for its own sake. And a nice side effect of that happening will be that a project like CAP won’t be relevant to that postauthoritative digital era, because the court itself will be saying, “This is our precedent in an authoritative digital form.” And CAP will just be one interpretation of that.

Insofar as CAP wants to see all U.S. case law digitized and computable—both retrospectively and prospectively—Cushman envisions a sort of “ragged edge” of case law that will need to be transformed in the future. CAP includes case law only up through June 2018, but different jurisdictions will commit to prospectively digitizing case law at different times. Once that happens, CAP will likely go back and fill in the gap. “We probably won’t run the project year by year to fill in case law as it happens,” says Cushman. “It’ll be more practical to go back after we think that we’re pretty much done. I think there will be a lot of value to that final round in having that one complete, authoritative data set.”

Courts need to be publishing digital versions that are authoritative and signed and versioned.

Jack Cushman, senior developer at LIL and the lead developer for CAP

On the demand side, CAP’s mission to help all stakeholders—lawyers and others alike—see the law in new ways will continue well after they have completed the data set. “I imagine all this as one live pipeline—that if we could get this part working and get that part working, the whole thing will have value flying through it,” enthuses Cushman. “We’ve built this middle part. Now we just have to get the rest of it flowing, too.”