Waving the Predictive Coding Magic Wand
07-19-2012, Avansic - Corporate
Waving the “Predictive Coding” Magic Wand

There has been much discussion about predictive coding amongst e-discovery circles in the past year. The Wall Street Journal recently published an article on this very subject entitled “Why Hire a Lawyer? Computers are Cheaper.”

Overview of the Process
First, what exactly is predictive coding? Given a large set of documents, attorneys review a small sample set to their specifications and their pattern of review is applied to the larger document set using a “content clustering” or “find similar” calculation. This represents a way to review a large volume of documents without attorneys spending thousands of hours putting eyes on each page. Predictive coding is used as a supplement to traditional e-discovery filtering methods such as keyword searching.

The variables in this process include the selection of documents for the sample set, user response to the sample set, and the computer algorithms applied. These algorithms may learn “on the fly” or may be pre-calculated depending on the software and process. For instance, poor user input will result in poor sample set results which would then be propagated to the remainder of the document set.

There are several different ways that a user can be led through a set of documents in predictive coding: “choose your own adventure,” spokes on a wheel, purely random, or secret black box. “Choose your own adventure” is a style where the computer algorithm adjusts to your responses on the fly, changing the next document that might appear based on your previous input. Spokes on a wheel is a pre-defined set that covers all document clusters and users can dive deep into whatever interests them. Purely random selection, while statistically relevant, does not generate the most useful set. Secret black box is where only the developer of the computer algorithm knows what functions are being performed.

Product vs. Process
The WSJ article highlights one of the main problems that arises when discussing predictive coding – that it is presented as a product. In truth, predictive coding isn't just a piece of software that can be layered over other e-discovery processes. It is actually a process that includes a combination of attorneys and technology. At its core, predictive coding is about workflow.

In fact, keyword searches, sampling, and early culling could be considered a type of predictive coding since they help pare down the document set. This is particularly true in the early stages of e-discovery.

Because predictive coding is a methodology and not just a piece of software, it can be integrated into any review tool – either modern software like Concordance, Summation or LAW, or heritage products. There are a number of different predictive coding engines and algorithms available, and many have been around before being called predictive coding. In fact, near-dupe technology, otherwise known as “find similar,” is heavily used in predictive coding to calculate the logical clusters and groups in the remainder of the set.

Predictive coding can be an excellent way to reduce the number of documents that attorneys need to review. This is increasingly necessary as document sets grow concurrent with client unwillingness to pay for review.

One of the attractions of predictive coding is the ability to streamline first pass review. The processes listed above, in combination with the efficient use of technology, gives similar results to those described in predictive coding. Attorneys are a critical part of this equation along with use of the right technology.

In a project where predictive coding is used, processing costs will increase but there will be substantial savings in the review phase. This assumes the project was well-planned and managed from the outset and that the process and technology were used appropriately.

Of course, predictive coding can become more expensive than necessary if the sample sets are poorly developed, if the data set includes documents that do not have natural language (graphics), or if the pricing model includes per-month gigabyte charges for hosting. In a predictive coding set, there will always be more data hosted in a review platform versus keyword-based review. Project management of predictive coding is critical, since it is a process and not just an “easy button.” There may be several run-throughs of the sample set in order to achieve the desired results so it is important to set appropriate expectations.

There is a substantial cost difference based on whether the data is processed in advance or on-the-fly. Processing and clustering in advance results in lower costs and a fixed error rate but requires multiple passes at the sample set. Allowing the adjustment of the error rate or heuristic in real time requires a large amount of processing power and a more complex algorithm; this results in higher costs and requires “middleware” between review technology and the predictive coding engine.

Technology and Efficiency
The thought behind predictive coding – that technology can help reduce the cost of e-discovery – is a great one. Figuring out what pieces of technology to apply at what point in the workflow may not be as easy. Consultants or other attorneys seasoned in e-discovery analyzing your workflow or case (particularly for large projects) can save enormous amounts of time and money. Ultimately, predictive coding is not just about using technology, but about using it well and using it at the right time.