#28 From Data Integration to Dataless
A data marketplace is an online transactional location or store that facilitates the buying and selling of data. As many businesses seek to augment or enrich internal data sets with external data, cloud-based data marketplaces are appearing at a growing rate to match data consumers with the right data sellers.
If you now go and look at the marketplaces, you’ll soon see one common feature among 99 percent of those. That is, marketplaces are selling datasets. We all saw these CSVs and Excel files in the big boom on open data. That was and still is the default format to distribute open data. For example, Bloomberg advertises to have over 5000 datasets in their marketplace.
Some separate data marketplaces and data streaming platforms. Of course, they do that since we are talking about different things: the first one is the market for doing business and the second is the technical data platform. Data streaming platforms are great for data buyers who need access to on-demand, real-time data feeds. In my thinking data marketplace can contain both worlds: datasets and streaming data. Both of them are data products and in some cases data as a service.
What we are witnessing is the two different paradigms for data consumption. The first one is a data integration and the second one is dataless. Let me explain.
Read more from below or watch the video
Data Integration paradigm
In the data integration paradigm, the company buys the dataset from the marketplace, receives a copy of the content, and then “integrates” that data into their BIG pool of data.
Then the data team uses SQL to query the data and visualize it or something like that. As a drawing and stereotypically we can describe the approach to be like above.
One variation of the above is that data is pushed or pulled with help of APIs but still a huge amount of data is transferred to one data pool.
The risk here is that much of the data is not even used, but still, we pay for it and pay for the storage, maintenance, and monitoring. In short, data is gathered just because we might need it and we don't know now what we are going to do with the data.
On the other end of the rainbow is something I call the dataless paradigm. Of course, there is data as well, but it’s not stored in large quantities on the data platform.
Instead what is added and indexed to the local data platform is the schemas of data streams and possibly some sample data.
The data can be still queried with SQL or other means such as for example SPARQL or even GraphQL. But, instead of executing the query against local data, it is executed if necessary with multiple queries from multiple data source APIs.
This increases the complexity and most likely query execution duration significantly. Accepting the above, we might not want to use this kind of data consumption paradigm in low latency and high-speed data requiring solutions.
Of course, just like with all things in life, reality is somewhere in between. The reality is a hybrid of the two extremes.