This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Pringled101 on 2025-01-13 11:59:22+00:00.


Hi! A friend and I have been working on a project called SemHash which I wanted to share. We found that text deduplication is more complex than it appears, so we built this to simplify the process.

Duplicate samples can skew model training, return redundant samples in RAG workflows, reduce generalization, and cause train-test leakage—leading to unreliable results. Techniques like minhash handle exact or near-exact duplicates, but semantic deduplication also catches semantically redundant samples, which we believe is an important aspect of deduplication. Furthermore, it’s not trivial to see why something was removed with minhash, which we also believe is important. For this reason. we’ve added explainability features as well so that you can inspect why something was removed. We already found some interesting results on some well known datasets in our benchmarks which are included in the repo.

The package can be installed with pip install semhash, and the basic usage looks like this (this example assumes you have the datasets library installed):

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
train = load_dataset("ag_news", split="train")["text"]
test = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=train)

# Deduplicate the train set
deduplicated_train = semhash.self_deduplicate().deduplicated

# Or deduplicate the test set against the train set
deduplicated_test = semhash.deduplicate(records=test).deduplicated

I’m very interested in hearing your thoughts on this! Is deduplication a part of your current ML workflows, and if so, what techniques do you use?