November 25, 2020

5 Things You Should Know About Predictive Coding

1. What It Is

Predictive Coding is another tool in your e-discovery arsenal. More specifically, it is a software or service that takes a large set of documents and with relatively minimal human input, codes or ranks them for you. Commonly a predictive coding engine will work by having a top-level reviewer/attorney look over a seed selection of documents and determine whether they are relevant or apply to a certain issue. (Usually a human review of somewhere between 1800 and 2500 documents will be enough to teach the system to auto-rank an unlimited number of remaining documents.) The predictive coding system then analyzes the initial seed examples and identifies references in the text such as people, concepts, places, products and materials to generate rules that will find further concepts of the same type. The system then uses mathematical algorithms to apply these rules across the entire universe of documents (again, this number is unlimited) and rank or code them correspondingly. A firm may then use the auto-coded document set as-is for production (meaning no more human eyes view the documents) or treat the machine rankings as a guideline – still performing a human review on the ranked documents, but in a much more targeted way. For example, a firm would place the best reviewers on the most highly ranked documents and lower-level reviewers on less relevant or applicable documents. Either way, the system saves firms time and money.

2. What It’s Not

Because predictive coding technology is new and not well-understood, there is some confusion in the marketplace about what it actually is. Many firms are claiming to offer auto-coding or predictive coding services which actually consist of clustering, culling, categorizing, or threading. While these processes are a valuable part of the e-Discovery process, they do not constitute a true predictive coding solution. Clustering/categorizing, which consists of grouping information or pointing the reviewer to similar content, is probably most commonly confused with predictive coding. However, the reviewer still must look at every document once it’s been clustered or categorized – whereas a predictive coding solution removes the need to place human eyes on each document. Culling, as an e-Discovery tool, actually removes documents from a set. Instead of being synonymous with predictive coding, (which does not remove documents but merely ranks unresponsive documents low) culling should be used in conjunction with predictive coding. Threading, which presents emails as conversations rather than as random individual documents, is another tool that can be used in concert with (but should not be mistaken for) predictive coding. Email threading simply helps top-level reviewers to make the best analysis possible of documents ranked highly by the predictive coding system.

3. What It’s Called (Hint: It’s Not Actually “Predictive Coding” yet)

Closely related to the above-described confusion in the marketplace is the fact that “Predictive Coding” currently goes by many names. When you are researching these types of solutions, look for the following terms, as they also represent the concepts described in this post.

  • Prioritizing Software
  • Suggestive Coding
  • Automated Review
  • Auto-Categorization
  • Automated Relevance Designation
  • Relevance Ranking
  • Intelligent Case Assessment

Presumably the market will settle on one of these terms eventually, but until that happens all of the above should be considered when researching this product set.

4. That It’s Defensible

A recent e-Discovery Journal poll showed what many critics of predictive coding have been espousing; namely that lawyers will fail to adopt the technology due to fear of inadequate defensibility. However, these notions are outdated and simply need some explanation in order to be assuaged. First and foremost the current standard of defensibility in document review includes human review and keyword searching, both of which have been proven to be highly unreliable. A 2008 TREC study analyzing the success of keyword searching indicated that on average, “Boolean keyword search found only 24% of the total number of responsive documents in the target data set.” Since this is the current court-accepted standard, it’s the only one you have to beat when defending a predictive coding solution. Since most predictive coding products offer ample ability to sample, as well as transparency and a full auditing capability, this shouldn’t be hard to do. Moreover, more official studies regarding predictive coding are being performed currently. The ongoing 2010 TREC legal track study which aims to measure the effectiveness of predictive coding tools showed numbers well above the 24% level, even with the least effective products and made the conclusion, “the assumption that manual review is more effective than technology-assisted review is not necessarily valid.” Previous information retrieval studies have shown again and again how inconsistent and flawed human review is. Since predictive coding only has to beat that standard to come out ahead, the course is not actually that difficult. To paraphrase Judge Paul Grimm in Chris Dale’s December 23, 2010 article, technology will always outpace us and someone just needs to have the courage to go first.

5. That It’s the Next Big Thing

Predictive Coding topped most of the 2011 e-Discovery next big trend prediction lists – and with just cause. The rising tide of electronic discovery is in no way abating, and cost and time savings is becoming of the utmost importance to litigants. With promptings from the bench to use technology above and beyond that of the keyword search, the natural next step is to consider a type of automated review. Once someone ‘goes first’ and has the technology judicially sanctioned, the mad rush will be on to play catch-up. In the meantime, the march forward to automated review seems pretty well inexorable.

Caitlin Murphy

Caitlin Murphy is Director of Marketing for the Access Data Group, where she manages all aspects of legal marketing and consults on product design for the AD Summation line. She is a product and industry expert as well as an attorney and member of the California State Bar. Before joining AccessData, Caitlin spent five years working for CT Summation as a product evangelist in both San Francisco and London. Caitlin raised Summation’s brand profile by making numerous presentations to all levels of American and European legal professionals and by conducting over 50 thought leadership seminars in 30 states. Prior to entering the e-Discovery field, Caitlin practiced civil litigation with the San Francisco bay area law firms Kazan, McClain, Lyons, Greenwood & Harley and Lieff, Cabraser, Heimann & Bernstein. She received her J.D. from the University of California Hastings College of the Law and holds a B.A. in United States History from the University of California at Davis.

More Posts


  1. […] This post was mentioned on Twitter by Rob Robinson and Chris Dale, eDiscovery Group. eDiscovery Group said: Predictive Coding is getting lots of press in 2011… Wonder what that means? […]

  2. […] Things You Should Know About Predictive Coding – (Caitlin […]

Speak Your Mind