Legal Discovery, Symantec Enterprise Vault, and Microsoft SharePoint Server Integration

Hoping this might be a good topic for some people because i guarantee there are very few people who have been through integrating these products. 

I have been working on integrating the two products at a client for the past 5 months so hopefully this will help someone out. 

Legal discovery has very different requirements vs. enterprise search. MOSS is a relevance based search engine which means that searches conducted by a user are matched against content in the MOSS index using various rankings like link depth from authoritative sources, document type rankings(e.g. Word is ranked higher than Excel), word repetitions within a document, the number of times a document is linked to and etc… A certain portion of document is indexed (e.g. the first 15 MB) after which content is not termed relevant. Documents over a certain size are also not indexed out of the box. All of this can be adjusted by configuration but my point is that MOSS is purely focused on making the most relevant content available to a searcher. 95% of the time the average user never goes beyond the first page of results. This is why relevancy is so important for your average user and why the system is tuned for it. Legal discovery seems to be focused on a very different search. The legal user often times wants to view all search results and to search the entire document of every document indexed by MOSS. The legal user wants to put those results into a workflow to be saved off on a case by case basis. MOSS is not tuned to these needs. I will detail the biggest risks that MOSS introduces for a Legal discovery based search.

Gaps and Mitigations

Gap

Comments

Mitigation

Documents over N MBs are not indexed The initial limit is 16mb. Need to increase the maximum files size Office Online

Someone's Blog

Optical Character Recognition Technologies are weak Legal could possibly scan and save a large number of documents. There is very little IFilter support for OCR based documents.

There is currently no iFilter for Tiff. I’m guessing that’s what files are being scanned in as.

Capatris is developing a Tiff iFilter - Capatris IFilter

Search Result Count is an estimate MOSS produces an estimate of the number of search results returned. Legal will not know the total number until all results are paged through  
Not every type of document can be indexed IFilters are not available for all document types. Add-ons can be purchased but this will have to be dealt with on a case by case basis. Research partners for formats that need to be included. Develop inclusive list of file formats we support.
Old Versions of Documents are not indexed SharePoint does not index old versions of a document. This refers to the version control ability within a SharePoint Document library  
Tested index limit is 50 million documents This is a limit per farm. After we hit this limit we will need to create a new farm or try to mitigate performance issues Create a new legal discovery farm for every 30 million documents
No legal discovery workflow capability to deal with document holds There is no legal discovery workflow built into SharePoint Search to track cases or searches by case. This could be custom built
Searching Exchange Exchange content has a lot of interest in terms of legal discovery.  In my opinion a large organization >2000 mailboxes will have way too many emails to be searched through MOSS Search.  The index limit is way too small for this. 1. Use FAST, index scales alot higher 2. Look to SharePoint Works for the Exchange Connector although you might be pushing up against the MOSS index document limit. 3. Use Exchange Journaling and an eDiscovery product like Mimosa.  I would look for one that is tightly integrated with MOSS. 

Symantec Integration Issues - MOSS Search Collisions with File Shares

Symantec Enterprise Vault is capable of archiving file shares.  This means files on file shares will be replaced with placeholders.  The SharePoint crawler does not recognize the difference between a placeholder and a file on the file system. 

The question becomes do you want to crawl files that have been archived? 

Our answer was no, since the Symantec Enterprise Vault builds its own index why have two.  Also this lightens the load on the MOSS index.  The behavior we witnessed is that MOSS would continue to pull items out of the vault when they are archived. The only way to prevent this is through putting Symantec Enterprise Vault in backup mode and adding the crawl account to the backup group.  This however will flood the crawl log with errors.  So not the best option either if you want search to remain manageable. 

The best option would be if the Crawl account could ignore documents with the offline attribute.  This is not currently available in MOSS but hopefully one day it will be.   

Update: This has been fixed in the MOSS August 2008. It's not very clear from the notes but here it is...https://support.microsoft.com/kb/956056/.   “When you try to crawl a content source, the offline files in the content source are indexed unexpectedly. The offline files are the files that have the PR_FILE_ATTRIBUTE_OFFLINE attribute set.
Note After you apply the hotfix, offline files will are not indexed any longer”

Search Integration with Symantec Enterprise Vault

If you want to search MOSS and Enterprise Vault from a single interface today you're only option is custom code calling the EV COM interfaces.  This is very complicated code to write.  Luckily around the end of year Symantec will release a new version that supports Search Federation so a single interface can be used. 

Search Federation still does not cover the Discovery Accelerator product.  It's the one product i feel should be integrated into MOSS.  It would be beneficial if there were workflows that could launch out of MOSS search into DA.  I have heard that DA will be able to federate queries back to MOSS however so when a legal user is within the DA interface MOSS search results can appear.  This is due in the next release of Symantec EV.

Hope i helped someone with my thoughts around legal discovery, please feel free to comment.