Validate, deduplicate, and classify your data before it enters your systems. The Data Processing category provides three APIs that catch quality issues early: duplicate content, malformed payloads, and misidentified file uploads.
Integrate these into your ingest pipeline to reject bad data at the boundary. The deduplication service catches near-duplicates that exact matching would miss, the JSON validator enforces schema contracts on untrusted input, and the file type detector prevents extension spoofing.
Detect duplicate or near-duplicate content across a batch of documents using SimHash fingerprinting and Jaccard similarity. Unlike exact matching, these algorithms catch records that differ only in whitespace, word order, or minor edits.
Set your own similarity threshold to control precision vs recall. Returns pairs of matching documents with their similarity score and the method used.
Validate JSON documents against JSON Schema specifications from Draft 4 through 2020-12. Every validation error includes the exact JSON path, the violated rule, and a human-readable message, making debugging straightforward.
Supports custom validation rules and business logic constraints beyond what JSON Schema alone can express. Use it as a gateway validator for incoming webhook payloads or API requests.
Detect the true file type from content bytes using magic number analysis, regardless of what the file extension claims. Returns the MIME type, correct extension, content category (document, image, archive, etc.), and a confidence score.
Essential for upload validation — prevents users from disguising executables as images or bypassing file type restrictions by renaming extensions.