For a while, I've been wondering why data as a service seems to feel like a concept from outer space for data engineers, data teams and analysts. Then I had an hour-long discussion (pressure-free) with one experienced data engineering professional Antti Loukiala from Solita.
Data Integration paradigm
We discussed data usage from various angles and suddenly we realized that we live in the data integration world. Just like we did in the 1990s with code. Back then we reused code by integrating chunks of code together given that the attached open source license allowed us to do that. Then Web APIs appeared and changed the game totally. Now is the same with data. If you look at the data markets, you’ll find chunks of data packaged for sales. Those are datasets.
The company buys the dataset from the marketplace, receives a copy of the content and then “integrates” that data into their BIG pool of data. Then the data team uses SQL to query the data and visualize it or something like that. As a drawing and stereotypically we can describe the approach to be like above. One variation of the above is that data is pushed or pulled with help of APIs but still a huge amount of data is transferred to one data pool. The risk here is that much of the data is not even used, but still, we pay for it and pay for the storage, maintenance, and monitoring. In short, data is gathered just because we might need it and we don't know now what we are going to do with the data.
On the other end of the rainbow is something I call dataless lakes. Of course, there is data as well, but it’s not stored in large quantities on the data platform. Instead what is added and indexed to the local data platform is the schemas of data streams and possibly some sample data.
The data can be still queried with SQL or other means such as for example SPARQL or even GraphQL. But, instead of executing the query against local data, it is executed if necessary with multiple queries from multiple data source APIs. This increases the complexity and most likely query execution duration significantly. Accepting the above, we might not want to use this kind of data consumption paradigm in low latency and high-speed data requiring solutions.
The future will be hybrid
Of course, just like with all things in life, reality is somewhere in between. The reality is a hybrid of the two extremes. Experienced data engineering and machine learning professional Jouni Miikki (nowadays Chief Architect at Vastuu Group) summarized well the need for both approaches:
“For example location weather information on how much now somewhere humidity is now or has been yesterday would be well suited for dataless (via API / service).
On the other hand you might have DNA information from millions of people or tens of millions of anonymized call records from which you extract summaries or other information. In those cases it may not be so practical to ask for that information outside if you are operationg particularly with an exploratory process.”
Nevertheless, realizing that we still live in the data integration period is fundamentally important to succeed in the current offering. As long as are dealing with the dataset-driven data integration paradigm, it is ok to talk with data as a product terms.