Metadata Structuring
Metadata serves two distinct purposes in the tokenized datasets framework:

Public Metadata
Basic descriptive information such as schema definitions, creation timestamps, and general categorization is stored on-chain in Patricia Tries in plaintext to enable discovery and validation. This information is hashed using SHA-512 to ensure integrity, but the hash is stored alongside the plaintext data.

Privacy-Sensitive Metadata
Statistical summaries, detailed provenance information, and other potentially sensitive metadata can be encrypted and stored off-chain through off-chain workers, with only hash references maintained on-chain. For this sensitive metadata, zero-knowledge proofs can selectively verify properties without revealing the underlying information.
This dual approach balances the need for discoverability with privacy protection, acknowledging that hashing alone does not provide confidentiality for on-chain information.
The marketplace implements a standardized metadata schema that balances comprehensiveness with efficiency. The schema includes:
Dataset identification
Basic information like title, description, version, and creation timestamp
Technical specifications
Format, size, encoding, compression method, and schema definition
Quality indicators
Completeness, consistency metrics, update frequency, and last verification date
Domain-specific attributes
Field-relevant indicators like resolution for images, sampling rate for audio, or collection methodology for surveys
Usage terms
License type, attribution requirements, and permitted use categories
Functions of Metadata in the Tokenization Process
This structured approach enables efficient discovery and evaluation of datasets while ensuring that critical information is consistently available across the marketplace. The metadata serves several crucial functions in the tokenization process:
It enables efficient dataset discovery without requiring access to the full data
It provides the basis for quality assessment and validation before purchase
It facilitates provenance tracking and attribution for regulatory compliance
It documents the technical requirements for utilizing the dataset effectively
Consider a 500 MB dataset of weather records: its metadata might include the source, schema (e.g., columns for temperature, humidity, wind speed), statistical summaries (e.g., average temperature of 15°C), and a timestamp.
The metadata is serialized, hashed with SHA-512 to produce a fixed-length digest, and linked to the CID. This allows consumers to verify the dataset's authenticity and relevance before purchase.
