Data Ingestion Process
The tokenization process begins with data ingestion, where datasets are uploaded to IPFS, a content-addressable storage network that enables decentralized data retrieval, coordinated through off-chain workers.

This generates a Content Identifier (CID)—a cryptographic hash serving as an address and integrity check. For redundancy and fault tolerance, the system leverages erasure coding as described in the base layer specifications, allowing data reconstruction even when some shards are unavailable.
PoSp enhances this by requiring nodes to provide storage proofs through custom pallets, ensuring data persistence in a manner akin to a distributed vault with redundancy.
IPFS Protocol Enhancements for Data Management
The IPFS implementation in the Data Marketplace leverages several key features of the protocol to enhance security and efficiency:
Content addressing through cryptographic hashing ensures that data retrieval requests specify exactly what is being requested, not where it is located. This removes the need to trust specific storage providers and enables content verification upon receipt.
Deduplication through content addressing automatically eliminates redundant storage of identical data, improving efficiency across the network. If two datasets contain overlapping information, only the unique portions consume additional storage resources.
The Merkle Directed Acyclic Graph (DAG) structure of IPFS divides data into blocks linked by cryptographic hashes, enabling efficient verification, transfer, and caching of dataset components. This structure allows incremental verification and transfer of large datasets, enhancing performance for multi-gigabyte AI training datasets.
To ensure long-term storage reliability, PoSp requires nodes to periodically demonstrate possession of the dataset through cryptographic challenges managed by pallets.
Unlike Proof of Work, which relies on computational effort, PoSp leverages disk space, aligning with the marketplace's goal of resource efficiency. This mechanism discourages data loss or tampering, as nodes are incentivized through rewards to maintain data integrity.
