January 25, 2021

The Machine Learning/Predictive Coding Silver Bullet

It is my great hope that the title of this blog oozes sarcasm, but my fear is that it does not. The e-discovery market is definitely not immune to the overzealous efforts of marketing departments, and that dynamic is on full display with the technology of predictive coding. Every marketing piece I have ever read and every webinar I have ever seen have literally been full of fallacy and misrepresentations. Not only has this led the market to a state of pure confusion, but it has also wildly oversold the value proposition. To be clear, predictive coding is a powerful tool, but users need to understand it before it can be of any real value and it seems clear that the majority of e-discovery vendors are intent on hiding behind confusion rather than shedding light on both the strengths and weaknesses of the technology.

The first and most important thing to understand about predictive coding, or “technology assisted review” (TAR), is that it’s simply the application of a mathematical algorithm upon a set of data. That algorithm is written by people that don’t understand the case at hand, don’t understand the nuances of the subject matter, and are working under constraints in order for the algorithm to run on standard hardware in a reasonable amount of time. The importance of those simple facts cannot be overstated. Try to put yourself in the mind of someone attempting to write such an algorithm. How would you go about it? How would you distinguish between content? Well the answer that everyone has effectively come up with is the obvious one: keywords. In fact, these algorithms do little more than attempt to distinguish between documents by determining the frequency of specific words and the context within the content under analysis. Two of the most common algorithms used are:

  1. Latent Semantic Indexing (LSI) or decision trees: LSI algorithms take a set of pre-coded documents and constructs what amounts to a decision tree that will quickly sort documents into responsive and non-responsive groups based on keywords within the document. For example, if we have a wrongful termination case and have ten documents, five of which are responsive and five of which aren’t, the algorithm will look at all ten docs and determine if there is a keyword that distinguished the two groups. If such a keyword exists, the algorithm will stop there and the entire decision tree will end up being a single word. However, if no single word exists, it will pick the word that best divides them and then look at the two divided groups separately to determine what words best divide them. The process continues on until the decision tree is complete.  LSI can also create clusters without a pre-coded document set, but these may be completely irrelevant to the matter under review.
  2. Naive Bayes/Probabilistic Latent Semantic Analysis algorithms: Another common algorithm type is one that measures the frequency of words within documents and then categorizes them based on those frequencies. For example: If the word “terminate” occurs in all of the responsive documents, it receives a high weighting. The documents are then effectively given scores based on the words they contain and if the score is high enough the document is considered responsive. This approach doesn’t exactly equate to a keyword search but in practice it ends up being very close. These types of algorithms are also able to group or cluster documents based on content in the absence of taxonomies or other category information, but again may have no value to the case under review.

It’s important to remember that these algorithms effectively attempt to approximate a standard search query. In fact, in the case of LSI the result is nothing more than a complex Boolean search. The challenge with that approach is evident in that the algorithms require a human to code some number of documents to meet a statistical probability of accuracy within a desired confidence interval. The algorithm attempts to mathematically decipher why the human made the coding decisions for the document set under analysis. Proponents of these technologies use fancy phrases like “machine learning” to suggest the computer (algorithm) can see some intricate, unexposed and complex relationship between documents that humans aren’t capable of recognizing. It was a human that coded the documents in the first place, so any relationship is already know by the human…..and if not, any relationship is just as likely to be a coincidence as it is a meaningful insight.

The other key point to understand about these technologies is that, despite marketing claims, they don’t possess an increasing level of insight given an increasing volume of seed documents.  Seed documents are those documents a human has coded and fed to the algorithm to process. The algorithms are designed to find documents that look like documents they have already seen. The more unique documents fed to the system, the greater number of categories will be identified, but the algorithm can’t uncover something new; it simply made more associations. At no point in time will any of these technologies be able to code documents that are responsive, yet unlike other documents they haven’t already seen. As an example, take a wrongful termination suite of an employee where a human has coded documents relating to the terminated “employee performance”, the predictive coding algorithm will not find the distinct email where the employee threatens a coworker. The algorithm won’t have seen any coded documents that relate to a death threat, so it won’t have any idea that “kill” is perhaps the single most important keyword in the case.

When tested objectively, these algorithms produce mixed results at best and explain why many ultimately become frustrated because the technology has failed to live up to the marketing hype. Many vendors continued to soundly beat their own drum with sometimes wildly disparate solutions at ILTA this year.  Most attendees were frankly a bit glazed over with the marketing message and bored with the topic in general. Although recent surveys have shown more adoption of predictive coding technologies, there are equally just as many subject experts that have become exceedingly skeptical.

The limitations of machine learning or predictive coding notwithstanding, the technology is in fact very powerful when applied to e-discovery and should be an important part of any large scale legal review.  The key to effective use is to employ it in such a way as to take full advantage of the strengths and not fall victim to the inherent limitations. If you go through and code ten documents that are responsive to a specific topic, predictive coding is very good at constructing a complex (not magical) search that can find those ten documents and other similarly responsive documents.  That might not seem all that powerful, but if you are faced with coding 300,000 documents across 10 reviewers, it can become very compelling to shorten review times and reduce e-discovery costs.  While rarely discussed, the technology is equally as powerful at weeding out unresponsive, non-privileged or non-confidential documents. Simply apply the workflow to code for non-responsiveness and large numbers of irrelevant documents can potentially be eliminated from the review set.   Applying both of these use cases can greatly magnify efficiency of a review but neither represents a silver bullet for finding a smoking gun. In fact, no technology can help you find a smoking gun unless you have already coded a document that resembles a smoking gun.

The other important element to a predictive coding algorithm, and ultimate wide scale adoption, is that it must be clear why and how coding decisions are being made. The single largest issue I see with the predominant products is that few allow practitioners to peek inside the magical black box that contains the technology.  Not only do users need a way to peek inside, but it should be very clear to them what the technology is doing and why the technology is doing it. If a product uses a decision tree to code documents, it should, at a minimum, provide the specific search for each responsive document. This not only increases the transparency of the tool but also leads to greater confidence in it from the user, opposing counsel and the court system. If after coding many documents, the algorithm determines, incorrectly, that the word “Tuesday” is important, it needs to convey that information in such a way so the reviewer can either reject all documents coded because they have “Tuesday” or choose to accept them. Either way, the technology must provide the users with enough information to assess the results it has provided.

Unfortunately, too many vendors in this space see predictive coding as the key feature towards booking business for the next four quarters.  They aren’t about to pull the shiny orb off the pedestal they have placed it on. Despite all my ranting, I predict the next piece of marketing I read on predictive coding will have a title closer to “Predictive Coding is the Silver Bullet to Vanquish Away E-Discovery Woes” than “Predictive Coding:  A Powerful and Beneficial Tool When Properly Employed.”

Tim Leehealey

Tim Leehealey is Chairman and CEO of AccessData. Prior to joining AccessData he was VP of Corporate Development at Guidance Software. Prior to that he was an investment banking analyst covering the security market at Wedbush Morgan.

More Posts


  1. Justin Scranton says:


    Good article and some excellent points. In fact, I just told a colleague that unless a “smoking gun” document is part of your seed set, it won’t likely be found because of how LS indexes are built (and weighted).

    As a side project, I have been looking into the “magical black box” which for me is not the predictive algorithm but the index itself. I have been building multiple indexes over the same corpus using different–and all accepted–transformation techniques. My results suggest that how one reduces the dimensions of a particular array has a significant impact on later algorithmic clusters.

    In another industry (web indexing), search engines re-index their corpus (the entire world wide web) on a daily basis in order to insure that their index is producing appropriate recall and precision. The transformation techniques (and thus the root LSI) are constantly being tweaked to improve both recall and precision (which are perhaps more critical to a search engine). In contrast, in e-discovery once the index is built, only the algorithms are being tweaked (someone correct me if I am wrong).

    That is not to say that LSI and PC are “bad” but just to reiterate that if the process is defensible, then e-discovery professional (read: counsel) are going to have to learn that the “magical black box” can not always be trusted.


Speak Your Mind