De-duplication (part 1)
| Son Le
Data deduplication is a technique for eliminating duplicate copies of repeating data. This topic provides a brief overview of the techniques and tools used to implement data deduplication that I have implemented.
The Problem
In computing system, to ensure that we don’t lose data, we have to make sure that messages are ‘delivered’. Simple tactic is “retry, retry, retry”.
De-duplication our messages
With this problem, the high-level API like:
def deduplicate(stream): for message in stream: if has_seen(hash(message)): discard(message) else: forward(message)
The implement with a has_seen function that checks if we have seen a message before. The function is implemented by keyed by its id, hash of the message, and a seen set. If message is seen before, discard it. If it’s new, forward it.
The meaningless fields
So it’s easy, right? Actually, it’s not. Our data format contains some fields that are frequently-changed, but are not useful for computing. For example, the updated_date field is changed every time the message is published. So we updated our function
def deduplicate(stream): for message in stream: if has_seen(hash(remove_useless_frequently_changed(message))): discard(message) else: forward(message)
Good
The hash of message
Another problem came up is the hash. We have already known that hash of abc is difference with hash of cba.
We use json format to send messages and our producer don’t guarantee the order of message fields.
Lucky for me, I can use canonical json format to ensure the hash
def deduplicate(stream): for message in stream: if has_seen(hash(remove_useless_frequently_changed(canonical_json(message)))): discard(message) else: forward(message)
Summary
With this topic, I have implemented a data deduplication algorithm that can be used in production. Stay tune for next part of this topic
Comments