📊

Data Processing

Validate, deduplicate, and classify your data before it enters your systems. The Data Processing category provides three APIs that catch quality issues early: duplicate content, malformed payloads, and misidentified file uploads.

Integrate these into your ingest pipeline to reject bad data at the boundary. The deduplication service catches near-duplicates that exact matching would miss, the JSON validator enforces schema contracts on untrusted input, and the file type detector prevents extension spoofing.

Deduplication

Detect duplicate or near-duplicate content across a batch of documents using SimHash fingerprinting and Jaccard similarity. Unlike exact matching, these algorithms catch records that differ only in whitespace, word order, or minor edits.

Set your own similarity threshold to control precision vs recall. Returns pairs of matching documents with their similarity score and the method used.

SimHash Jaccard similarity configurable threshold batch analysis

POST /v1/content/dedup

JSON Validator

Validate JSON documents against JSON Schema specifications from Draft 4 through 2020-12. Every validation error includes the exact JSON path, the violated rule, and a human-readable message, making debugging straightforward.

Supports custom validation rules and business logic constraints beyond what JSON Schema alone can express. Use it as a gateway validator for incoming webhook payloads or API requests.

JSON Schema Draft-07 error paths custom rules business rules

POST /v1/json/validate

File Type

Detect the true file type from content bytes using magic number analysis, regardless of what the file extension claims. Returns the MIME type, correct extension, content category (document, image, archive, etc.), and a confidence score.

Essential for upload validation — prevents users from disguising executables as images or bypassing file type restrictions by renaming extensions.

magic number analysis MIME detection extension mapping content verification

POST /v1/file/detect