July 12, 2014

Cluster Analysis 101 – Or, “This is Not Your Mom’s Summation”

If you haven’t noticed yet I tend to address most of my blog posts to the IBlaze crowd; the tireless paralegals and lit-support staffers working in the law firm trenches. That’s because I feel most comfortable talking to that group since I used to be an internal lit support manager myself and this is basically what I know best…  Also, I just want to share what I know and hopefully make someone’s life a little easier! For this post I want to step outside the comfort zone and cover a topic that is really about one of the next generation features in Summation; Cluster Analysis. I’m really excited about it and hope to break it down into some simple practical Summation know-how that you can also get excited about and take back to your desk (so to speak) so you can put it to use on your next case. In short, I want to give you a primer for performing cluster analysis, describe how it works, show you how to view results, and talk about why it’s so beneficial. I think this is something that most old school Summation users haven’t touched on yet and as a result are missing out.  It’s also for those people out there who still think Summation is a dinosaur of the early lit support software period – you couldn’t be further from the truth!

Anyway, without further ado:

IsPivot = True (Using Cluster Analysis in SummationPro)

Who here has heard of ‘cluster analysis’ or more commonly known as “Near-Duplicate Analysis”? This is most definitely a new technology if you are coming from the early lit support era, and is a truly powerful new feature in Summation. The benefit of cluster analysis is quite simply to speed up the review process. At AccessData our cluster analysis is commonly referred to as “content clustering”. It’s called “content” clustering because that’s exactly what it’s doing; it’s creating clusters (or groups) of documents based on the content, i.e. the words on the page.

First let’s look at where the option is to enable Cluster Analysis.

When creating a new case under “Processing Options” you will see the cluster options.  Click “Perform Cluster Analysis” if you want to perform clustering every time you add evidence to your case.  Personally I’m a bigger fan of clustering on-demand when I want to, which is usually after I’ve completed several data loads.   To perform clustering on demand, click on the add evidence button next to the case you want clustered.

Next click on “Cluster Analysis”, set your threshold and click “Start”.  You will now see a cluster analysis processing job in your work list and can monitor its progress.  The time it takes to complete the analysis will depend on the number of items in your case. Typically this goes pretty quick.  To give you a benchmark I have a demo case with approximately 6,000 items in it and clustering takes about 5 minutes to complete.

Analyzing Results

So what does this thing do??? Cluster analysis will identify groups, or “clusters”, of documents with similar content.  For every cluster there will be a “Pivot” document which is like the root of the cluster and each similar item in the cluster will be given a Percent Similarity score.  This “Similarity to Pivot” score tells you how similar the item is in relation to the pivot document.  Another key feature of cluster analysis is that it will also identify email threads or conversations.  After cluster analysis is performed the results can be viewed in the “Similar” and “Conversation” panels in the review interface.  Also note that cluster analysis will work on files loaded through the .dii load file import or files processed as evidence through the “add evidence wizard”.  That is a huge advantage over competing solutions.  Even if you’ve loaded TIFF images with OCR, if the OCR text is similar enough cluster analysis will still identify clusters for you.  Let’s stop and think about that: You can have a very un-friendly single-page TIFF production with a .dii load file with only a couple tokens and still produce clusters? And this will help you review production documents faster even when you had very little metadata to begin with; again since we are using “content clustering” technology, even the file types can be different and you will get a cluster.  Let me give you a extreme example of this, and why I think this is so powerful.

Someone creates a word doc, then prints it PDF, another person opens the PDF, cuts and paste the text of the document into an email and emails that to third person. That person then prints the email, and a fourth person scans the email to TIFF.  Our cluster analysis could possibly put all of the files together in a cluster, you’d have four types of files in the cluster (DOC, PDF, MSG, and TIFF) all because the content is similar.  Here’s a example of such a cluster:

This was a test I performed where I took the same document and converted it in 6 different file types: MSG, TXT, DOC, PDF, RTF, & XPS.  The cluster analysis engine detected all 6 files and grouped them regardless of the fundamentally different file types.  Again our technology for clustering is all based on the content.  Notice the different file type icons in the similar panel.

Summation Content-Clustering 101

So how does cluster analysis work under the hood?  Our cluster analysis is custom-built entirely by AccessData and consists of a 5 step process:

1. Text extraction

2. Omit “stop words”

3. Stemming

4. Calculation of similarity scores

5. Cluster formation

First we read the extracted text of a document, whether that’s OCR, metadata, email body, etc.  The next step is get rid of the ‘stop words’, we use a list of known words that are often-occurring language specific noise words. The remaining words are stemmed to remove suffixes and plurality.  A similarity score is then generated by computing “term frequency-inverse document frequency” weight for a word pair in relation to all word pairs in the entire case. A similarity score between two objects is calculated using these word pair weighted values. The resultant score can range from 0 (no similarity) through 100 (nearly identical).  Lastly we form the cluster, which is a two part process of breaking the case into smaller groups of word pairs and similarity scores, then choosing the Pivot object and locating other objects whose similarity meets or exceed the cluster threshold.

So … how is that for “next-generation”?? I told you that this is not your mother’s Summation anymore! I hope this explanation will help you take advantage of this extremely useful feature set baked right in to your Summation Express or Summation Pro product. It’s pretty simple to use and shouldn’t be intimidating – even if it is cutting edge!



Document Clusters

Here is an example of a document cluster.  I’ve clicked on a word doc and brought up my “Similar Panel”.  Note there are 12 other documents that are similar to the one I’ve clicked on.  Four of them have similarity scores of 100% and the rest have 97%.  So here you can see duplicates as well as the near-duplicates.  As you can imagine this information is quite valuable. Not only can you see a group similar docs, you can use the actions button to bulk label, bulk code, or even compare any two documents side by side in a special comparison viewer to see exactly where these documents differ.  You can also open up any one of the other 12 cluster documents by simply clicking on the ObjectID link.  So let’s take a look at the text comparison view and let’s compare object 4222 with 4847.

Using the Comparison Tool

At the top of the screen you will see some of the statistics between the two documents we are comparing, such as the number of total changed lines and how many lines are the same.

The comparison tool will show you a side by side comparison of the two documents.  Areas where you see black text on white background are identical rows of text; shaded in green are the rows where you will see changes.  Within the green sections text in blue is also identical, however what’s key is the text in red which shows the differences.

So this is where “IsPivot = true” comes into play…

“IsPivot = true” is simple search on the “IsPivot” column for any objects that are marked as Pivot docs.  I like to think of this search as a way to boil any list of records right down to the meat and potatoes.  In this example I originally had 385 word docs. After running “IsPivot = true” I’ve identified the 33 pivot records.  Now every time I click on a doc in this list I’m guaranteed to see more than 1 doc.  So arguably you will be able to review documents at an exponential rate now.  I’d also argue that most of the time minor changes to the contract like the date or slight verbiage changes will not affect broad issue coding and tagging/labeling the reviewers are doing.  In addition to reviewing clusters, there are other action items where clusters come into play:

1) Bulk Coding – During bulk coding users can select “Include Similar” items, automatically extending the bulk coding information to the similar records.

2) Bulk Labeling – During bulk labeling users can select “Similar” items, automatically extending the bulk coding information to the similar records.

3) Search expansion option – Just like “include families”, users can choose to expand their search results list to always “include similar” items.

4) Creating Review Sets – Also has the option to “include similar” docs when creating review sets for reviewers.

Email Threading

Performing cluster analysis will also identify email threads or groups of similar email messages.

In this example I’ve run a search for the term “Charge Methodology”.  I then brought up the “Conversation” panel.  This shows me that the message I’ve clicked on is actually just one message in a chain of emails.  Here we sort the thread in chronological order and you can see a hierarchy to message threads, forwards and replies. We show you the total number of emails in the cluster, the total number of attachments, time-frame, and list of participants.  If you want to bring up any other message in the thread, simply click on it and it will open in a new browser window.  You can also bulk label, bulk-code, or perform comparisons just like we did earlier with the documents.

Finding Threads

There’s a specific button on the item list called the “Discussion List” that is quite handy when it comes to reviewing threads.  It’s almost the exact same thing as running “IsPivot = true” for documents.  In this example, I’m using the filter facets to show me the top email sender domains (Enron.com and attglobal.net).  I’ve clicked apply and this gives me a list of only the messages originating from these 2 domains. After clicking the “discussion list” button I can now see a list of threads that helps give me a summary of the email threads that occur for these specific messages.  You will see in the discussion list the title of the thread, originator, time started, and number of emails in the chain.  As you go through the discussion list, the viewer will open to the first email in thread.

To conclude, our cluster analysis is actually pretty simple and has some serious benefits.  For me, it’s all about speed, as the features are geared to help review go by much faster.  By looking at groups of documents or email conversations you can get through more items in less time.  Now to hopefully get you really excited, this is just phase 1 of a larger goal to bring predictive coding capabilities into Summation.  I’d encourage you take a look at our CEO’s recent blog post on predictive coding.

Scott Lefton

Scott began to work in the legal industry as a in-house IT Administrator and Litigation Support Specialist for Epstein Turner and Song in Los Angeles, CA. (www.epsteinturnerweiss.com) At this firm, Scott gained extensive experience working directly with veteran trial attorneys and learning the litigation process. In 2005, his technical background and litigation support expertise led him to become a Trial Technician and Department Manager at Merrill Legal Solutions. (www.merrillcorp.com) In 2008, Scott move home to St. Louis and began working with Midwest Trial Services (www.midwesttrial.com) as a Trial Consultant. In 2010 Scott's began to work for AccessData, as a Sales Engineer specializing in AD's Legal Products; Ediscovery, Summation, & AD ECA. Scott is a certified TrialDirector and Summation trainer, and has supported more than 100 trials. During his Trial consulting career he worked on many high profile trials and case with nationally recognized law firms such as Morrison Foerster, Jones Day, Manatt Phelps, and Munger Tolles& Olson. Scott has worked with notable attorneys such as: James Bennett of Morrison Foerster on the JDS Uniphase Securities Litigation, Robert Zeavin of Manatt Phelps on the ICO vs. Boeing Satellite Systems trial, Michael Olecki of Grodsky Olecki on the landmark case Ramirez vs. Los Angeles Co. Sheriffs Dept., and Michael Zellers of Tucker Ellis & West on an anti-trust trial RLH Industries vs. SBC Global Communications. He has helped clients avoid over $20 Billion in damages and has helped others to win over $2 Billion in awards.

More Posts


  1. [...] Cluster Analysis 101 – Or, “This is Not Your Mom’s Summation” [...]

Speak Your Mind