De-duplication (part 1)

2022-01-03 20:30 | Son Le

Data deduplication is a technique for eliminating duplicate copies of repeating data. This topic provides a brief overview of the techniques and tools used to implement data deduplication that I have implemented.

The Problem

In computing system, to ensure that we don’t lose data, we have to make sure that messages are ‘delivered’. Simple tactic is “retry, retry, retry”.

De-duplication our messages

With this problem, the high-level API like:

def deduplicate(stream):
    for message in stream:
        if has_seen(hash(message)):
            discard(message)
        else:
            forward(message)

The implement with a has_seen function that checks if we have seen a message before. The function is implemented by keyed by its id, hash of the message, and a seen set. If message is seen before, discard it. If it’s new, forward it.

The meaningless fields

So it’s easy, right? Actually, it’s not. Our data format contains some fields that are frequently-changed, but are not useful for computing. For example, the updated_date field is changed every time the message is published. So we updated our function

def deduplicate(stream):
    for message in stream:
        if has_seen(hash(remove_useless_frequently_changed(message))):
            discard(message)
        else:
            forward(message)

Good

The hash of message

Another problem came up is the hash. We have already known that hash of abc is difference with hash of cba. We use json format to send messages and our producer don’t guarantee the order of message fields. Lucky for me, I can use canonical json format to ensure the hash

def deduplicate(stream):
    for message in stream:
        if has_seen(hash(remove_useless_frequently_changed(canonical_json(message)))):
            discard(message)
        else:
            forward(message)

Summary

With this topic, I have implemented a data deduplication algorithm that can be used in production. Stay tune for next part of this topic

Son Le’s Blog

Son Le’s Blog

De-duplication (part 1)

The Problem

De-duplication our messages

The meaningless fields

The hash of message

Summary

Son Le’s Blog

Son Le’s Blog

De-duplication (part 1)

The Problem

De-duplication our messages

The meaningless fields

The hash of message

Summary

Comments