#15 What is Toxic data and why should I care?

We all have seen the use of "toxic" in discussions during the past years. Most of us have probably stumbled upon it in relation to gender politics and misogyny. Toxic is also something we should be aware of in the data economy as well.


Invasion against personal data



One viewpoint to toxic data is any advertising data that is collected or used without the explicit permission of the consumer. So, by definition, safe data is any data collected or used with freely given and informed consent from the consumer, in compliance with all data privacy laws. But that is a limited approach. This example is of course personal data related and often might come out of marketing people. A more clear example is my health records which by default make my health care easier and safer given that information is accessed by people who have the right to do. The same information becomes toxic data if it is accessed by malicious people.





Opening new cybersecurity attack vectors


Another viewpoint is more cybersecurity-related. Highly valued Forbes gives an example as follows: "Imagine that you had a schedule for a senior executive available either explicitly or implicitly online. And now add to this that the senior executive was known to use a certain type of private jet service and the trips on that service could be determined publicly. Finally, because the jet service also made it possible to see information on the flight crew that was going to be on that private jet, if you were a nefarious hacker, you could get to that senior executive in a number of ways. For instance, you could target the electronics of the flight crew member and they might have far less secure profiles than the senior executive or staff surrounding him or her. This could open the door to a new type of cyber attack that wouldn’t have been possible in the past.".


Reck the trustworthiness and credibility


The third example was raised in a discussion with long-term cybersecurity professional Tessa Viitanen from Techie Stories Ltd. hird example and variation of toxic data is related to misinformation and disinformation and poorly documented data, where there is no knowledge on whether the information is tampered or biased, or not.


Gartner has predicted that 75% of CEOs will be personally liable for Cyber-Physical Security Incidents by 2024

Some organizations and individuals want to wreck the credibility of certain professions and institutions to put forward their own agenda which might be for example political or financial. As an example, in Finland anti-vaxxers attacked the age groups with misleading information. This is expected to give their message (mis/disinformation) a better chance to sink into the minds of the great public. When it comes to attacking healthcare, there are a couple of predictions already made by Gartner. Also, we already have seen Wanna Cry and Petya type of attacks in hospitals and ATMs. The risk of toxic data is real whether it is done by using social manipulation methods, data tampering, or data, and technology weaponization.


Gartner has predicted that cyber attackers have three motives of weaponizing operational technology by the year 2025:

  1. actual harm

  2. commercial vandalism (reduced output)

  3. reputational vandalism (making a manufacturer untrusted or unreliable)


Gartner has also predicted that 75% of CEOs will be personally liable for Cyber-Physical Security Incidents by 2024, but how can they be responsible if there is not enough understanding of the subject matter and related risks?


However, these aren't the only concerns, when it comes to toxic data. Stanford University Human Centered Artificial Intelligence wrote about poorly documented AI systems in healthcare and a research group that was able to tamper data and remove cancer from CT-scans.


The problem questions we should be presenting is:

  1. Which instances are validating the development of AI and data annotation in Healthcare systems?

  2. Which instances are verifying that there is no data-annotation, code, pattern or algorithm tampering involved during the development process?

  3. What happens when the data is tampered?

  4. What impact could a leak of tampered data be for the professionals such as researchers, lawyers, doctors, surgeons etc. and who has the overall responsibility?

  5. Why are we focusing only on individual privacy and old-school platform standardization and regulations instead of looking at the big picture of cybersecurity and cryptography in constantly evolving ecosystems?


What could be done?

Tackling the toxic data problem in the Data Economy is not a small or easy task. One of the options is to add traceability to the data. If we could know what data is used in results given to us, we could make a better judgment of whether this is something we can trust or not. This is what academia has as a built-in solution in articles. Listing your source at the end is the academic foundation you build your claims on. Also having the traceability would give me information if my personal data is used in the analysis. In short, bringing transparency in the data source black box would be one option to tackle this growing problem.