Avansic Whitepaper: Don't Be Scared of Analytics and Forensics in E-Discovery
07-28-2015, Avansic - Corporate
By now, legal professionals have been exposed to the concept of e-discovery and the possibilities that modern tools provide with relation to review. This includes advanced processes, methods and tools such as analytics, concept clustering, predictive coding and deduplication that offer the possibility of more targeted review. These can be daunting due to terminology and changing technology, but overcoming that fear means moving forward with shrewdness into the wider world of e-discovery.

Analytics, which is the computational analysis of data to determine facts or patterns, is an incredibly powerful tool in e-discovery. It has become more affordable and easier to use, despite some challenges relating to adaption and usability. Indeed, there are ways to use analytics to find what you're looking for rather than to eliminate irrelevant items. Understanding concept clustering and predictive coding (and knowing the difference between the two) can help you determine the best circumstances in which to apply these sophisticated technologies.

The realm of forensics offers e-discovery help through collection protocols and metadata handling. Knowledge of forensics hashing and de-duplication methods can be a big advantage for legal professionals during e-discovery.

How to Use Analytics
There are several ways to use analytics to assist in e-discovery: “more like this” document location, organizing and prioritizing a review, and identification of exact or near duplicates for de-duplication purposes.

Finding additional hot documents that are “more like this” is akin an online shopping tool recommending other products of interest. Although helpful in any situation, this is a particularly helpful time-saver in larger cases. Once a document of interest is located, it can be used to find other documents that are similar in content. This can be extremely useful if you aren't quite sure what you seek or even that it exists.

Analytics enables a user to better organize e-discovery review through grouping documents by ideas and concepts. Linear review becomes more efficient since the reviewer stays within the same topic and is not constantly shifting gears. Additionally, it allows for the use of “specialist” reviewers who may have specific knowledge of a certain document subject (for example, a drug interaction).

Analytics can also be used for text-based de-duplication, which includes exact and near-dupe. This is not the same as forensics de-dupe which is focused on exact duplication of entire contents of a document. The text-based deduplication utilizes processes that identify noise words or phrases, repeating text, white space, headers, and other textual data or formatting. Once this material is removed, the remaining text can be compared to determine if there is duplication. If the text is the same and the human reviewer would see the same content, there is usually not a reason to review duplicate versions, so the number of documents to be reviewed decreases. The exception would be context-based review for date, custodian, privilege, and so on.

For near-dupe, analytics leverages the results of an exact dupe but allows for variance in word placement and length to determine the similar documents with material differences. The software provides a “grade” for the similarity of the documents using complex machine learning and word proximity. While the variance in the removal of common artifacts is similar across almost all analytics platforms, the grading system for determining near-dupe vary (and may be configurable.) Although not the case with exact text de-dupe, multiple iterations of near-dupe might be necessary to provide a desired result. When used properly, this can be an effective tool to decrease the number of documents to review.

On a final note, analytics can be used to reduce the review burden but should be combined with a regimented, tested workflow that includes quality control and a large emphasis on sampling.

How to Use Concept Clustering
Concept clustering leverages analytics to group similar documents together based on word proximity and similar phrases – it follows the “more like this” idea discussed above. It utilizes mathematical formulas to determine similarity based on proximity of words, similarity of phrases and their placement within the document. The majority of concept clustering uses unsupervised machine learning. The result is an arbitrary measure of how documents are similar to each other which is otherwise a difficult measurement to take.

Tools may see different levels of similarity between documents. However, once concepts have been noted, groups of documents can be created that are similar to each other which feeds directly into a more efficient review.

How to Use Predictive Coding
Predictive coding is a workflow that leverages analytics, sampling, and sometimes concept clustering. Predictive coding uses different variants of supervised machine learning. Most common tools have workflows built into a review platform and therefore this technology has become more accessible - it is no longer necessary to hire mathematicians to create and sample because precision is monitored throughout.

The most common predictive coding workflow is as follows: select a sample of data, code that set, and use that set to train the algorithm to code the remaining documents. This is immediately followed by random sampling of the trained results to determine accuracy, and then repeating the process if the precision is insufficient.

Forensics Collection
Although it may seem that forensics is not necessary in a run-of-the-mill e-discovery project, forensics collection and a proper chain of custody can help during document review.

A true forensics, bit-by-bit copy of native documents is the best way to preserve the metadata in those documents for retrieval at any time. In fact, better tools may exist in the future to extract metadata from native documents. However, certain types of metadata must be preserved at the time of collection that do not exist within the native file content such as, the date the file was created where the file was located, what computer the file was located on, and the original file name. If collection is not performed in a forensics manner this data may be lost or altered without detection and that information can be useful during document review.

Forensics De-dupe
MD5 hash is a very useful for determining if two pieces of data, such as files, are exactly the same. From a computer science perspective, MD5 hash is one-way mathematical formula that takes a string of data as an input and yields an ostensibly unique value.

In the normal sense, MD5 provides a value that acts as a “fingerprint” for the data. If two MD5 hash values are the same then the data that produced them is the same. In the forensics context, MD5 hash generally means the hash of the content of the data, excluding operating system level metadata such as created date, filename, and file path. In e-discovery processing, MD5 hash has many different meanings. In some cases a program may choose to hash only a portion of the data, such as the values from an email header.

MD5 hash is a very powerful tool in review. For instance, use of MD5 hash doesn't just allow the identification of duplicates of user created documents, it also allows the automatic exclusion of well- known files, such as operating system files and application executables.

When producing data, an MD5 hash is absolutely necessary in order to be able to prove that what was produced is the same as what was received. Caution is necessary to figure out the specific information the hash was taken of; when receiving a hash in a loadfile, you should always ask the question, “an MD5 hash of what?” since the hash might be of only a portion of the data.

E-Discovery review can be helped tremendously by analytics and forensics processes. Specifically, there is rarely a case where analytics and concept clustering aren't useful in e-discovery review. Predictive coding is typically helpful where there are a unique set of requirements of time, volume. Hybrid models combine supervised and unsupervised learning and automate the processes so there is little configuration or knowledge required by the user. Knowledge of these tools gives the user the confidence to consider their usefulness for future projects.