DW-marking: Data Watermarking - The missing link to on-/off-chain implementation of distributed data marketplaces

ABOUT THE PROJECT

DW-marking - Data Watermarking: The missing link to on-/off-chain implementation of distributed data marketplaces

Data Marketplaces (DMs), in which data sellers make datasets available for purchase by data buyers are emerging fast in the big data market for monetising personal, or aggregate, often anonymized, datasets. Monolithic DMs operating under a single authority, need to place full trust on a single company/organisation. They may also end up producing additional monopolies/oligopolies on the Internet. Therefore, several attempts are ongoing for developing distributed marketplaces, often on top of Distributed Ledger Technologies (DLTs).

Fully on-chain approaches are having scalability problems when faced with large datasets, such as those traded over DMs. To facilitate trustworthy off-chain handling of datasets in distributed DMs, DW-marking will develop a new breed of digital watermarking techniques for protecting ownership, and establishing accountability, in the off-chain handling of datasets.

Existing digital watermarking techniques for media, such as video, images, and software are not well suited for DMs, since they were developed for large binary files of particular encoding that can be easily manipulated without affecting the contained information (e.g., changing slightly the colour tone of a few pixels). Such operations cannot be applied on datasets that carry structured and loosely structured information, in the form of strings, integers, and floating point numbers. Any change of such information can render it useless (e.g., changing the character of a string), or inaccurate (integers, floating point numbers). Therefore, we will develop a new breed of watermarking techniques suitable to the nature of datasets traded in contemporary DMs.

In addition to developing core watermarking techniques, we will also develop protocols for using them to power dataset provenance primitives. We will also develop interfaces for connecting off-chain watermarking techniques with on-chain primitives for the same purpose. DW-marking will thus make concrete and clear contributions to all the protocol layers of ONTOCHAIN as described in further detail later.

timeline

Motivation for the project:

Various datasets are being traded in centralised and distributed Data Marketplaces (DM). Given the large volume of said datasets, most of them are handled off-chain. This creates huge challenges in terms of establishing dataset ownership, avoiding illegal copying and reselling and other threats to the growth and sustainability of the current DM business models.

Generic use case description:

DW-marking will develop highly tailored watermarking techniques that will embed on the actual data being traded the identity of their original owner. Having such a watermark in place will work as the root for building other ownership related functionalities.

Essential functionalities:

1) Frequency Watermarking for datasets,
2) Recursive Watermarking as an off-chain provenance primitive,
3) Oracle for importing off-chain dataset transactions.

How these functionalities can be integrated within the software ecosystem:

Frequency Watermarking will be implemented as a standalone primitive for off-chain handling of ownership issues in DMs and other distributed systems. Recursive Watermarking will be implemented as an Oracle for allowing off-chain DMs to upload past transactions into a blockchain.

Gap being addressed:

Currently there is no ownership protection of any kind for data traded in any of the several growing DMs. The only protection of such data is a legal own based on the terms of each transaction. Making use of such terms however is very difficult since it first requires being able to prove ownership upon data. This can be done with a block chain but the processing off the actual data, due to volume, most often happens off-chain. Watermarking will protect ownership even during off-chain operations.

Expected benefits achieved with the novel technology building blocks:

Having a watermark embedded in a dataset means that if the data ownership is contested, the rightful owner will be able to prove his ownership of the data. Also, if a data owner applies a different watermark for each copy of the data sold, then if a leaked copy is found, there will be a way to identify who the original leaker was.

Potential demonstration scenario:

We will demonstrate a prototype implementation of frequency watermarking and show, using real data, its resistance to Guess and Destroy attacks. The first one is attempting to guess where the watermarking is and by doing so, hijack it. The second one is attempting to add noise to the data in order to destroy the watermark. We will show that to destroy the watermark, the added noise needs to be so high that the original data will also be destroyed, thus rendering the attack useless.