Deduplicator in mailclient

1/8/2024

But, you may not know that the “MD” in MD5 stands for “ Message Digest,” a synonym for “hash” (as a noun, not a verb). If you’re reading this, I trust you already know what “hashing” is and that the most common hash algorithm ( i.e., “mathematical formula”) employed in e-discovery is called the MD5. HASH FACT 4: Hashing is a one-way process i.e., the message digest cannot be reversed engineered to learn the content of the message. HASH FACT 3: No hash value is “close” to another hash value in a manner tied to similarity of the messages represented by the digest values of same. HASH FACT 2 : Different data, different hash value. HASH FACT 1: Same data, same algorithm, same hash value. Let’s start by recounting a few facts about hashing, then examining how these facts relate to e-mail message and loose file identification and deduplication in e-discovery. This article looks at how to get everybody on the same page when it comes to generating consistent, hash-based message identifiers across vendors and matters. Each tool approaches the task in a slightly different way and, when it comes to comparisons based on hash values, even the most minute variation in the data hashed generates a markedly different hash value. To be clear, any e-discovery tool worth its salt employs a method to hash and deduplicate messages unfortunately, they don’t employ the same method. Instead, we invent reasons why it’s just too darn hard. Certainly, no one has managed to get something accepted as a de facto industry standard, in the nature of, say, the Concordance load file format or EDRM XML. Insofar as I’m aware, no one has published a standard methodology for cross-vendor identification or established that it works. Then, why don’t we have a proven means to uniquely identify messages across vendors? I suspect it’s due to a lack of leadership and validation. It really is a trivial technical problem, and one that could be resolved without much programming or politics. It’s not just that ILTA folks understand the technology issues (“ GEEKS!”), we’re passionate about them (“ NERDS!”) and debate them respectfully as peers (“ WUSSIES!”).īeth’s idea deserved more credit than it got. ILTACON is the rare venue where reasonably well-adjusted and -socialized people engage in lively discussions of such things. “It’s because artificial hashes are kind of complicated,” one panelist offered, and not “a trivial technical problem.” The panel questioned whether MD5 hashes were the appropriate standard or whether SHA-1 would be required, positing that cross-matter deduplication is “something that requires significant buy-in across a broad spectrum of people.” Beth’s request was ultimately dismissed as “not an easy challenge” and one that would be confounded by “people, process and technology” and “the MD5 hash stuff.” One panelist got off on the right foot: He said, “I’ve created artificial hashes in the past where what I had to do was aggregate and normalize metadata across different data sets to create a custom fingerprint to do that.” But, he added, “that’s probably not defensible, and it’s also really cumbersome.” If an e-mail is privileged in one case, there’s a good chance it’s privileged in another so, wouldn’t it be splendid to be able to flag its counterparts to insure it doesn’t slip through without review?īeth asked a great question, and one regrettably characterized by the panel as “a big technical challenge.” If they did, you could use work from one matter in another. At last week’s ILTACON in Washington, D.C., Beth Patterson, Chief Legal & Technology Services Officer for Allens in Sydney asked a panel why e-discovery service providers couldn’t standardize hash values so as to support identification and deduplication across products and collections.

0 Comments

Deduplicator in mailclient

Leave a Reply.

Author

Archives

Categories