Achieving Data Security and Analytics with AI


By Glenn Schmitz, CISO, Virginia Department of Behavioral Health and Developmental Services

In 2020 the world changed when COVID began to spread worldwide and became a global pandemic that we had not seen in nearly one hundred years. This global crisis quickly became an opportunity for cybercriminals. Already deploying ransomware at a steady average rate of around 200 million reported cases between 2017 and 2019, these malicious actors stepped up their attacks to over 300 million in 2020 and more than doubled to more than 623 million in only a year, according to Statista.com. This is a staggering number and calls for immediate action to combat the risks associated with ransomware and the data that our organizations process on a daily basis.

            At the Department of Behavioral Health and Developmental Services (DBHDS), we process hundreds of thousands of medical records for patients and recipients of mental health and disability services from the Commonwealth of Virginia. Cyber villains see our data as a prime target for extortion and ransom. Because of this, we took a very proactive response to this threat and began looking for cutting edge solutions that could help protect our most critical data and systems while allowing us to share our data with partners and researchers freely without the threat of possible breaches.

            When I first stepped into the role as the CISO at DBHDS, I met with many of the business units of my organization and I quickly found a common theme, many of the applications that are vital to our operations depended on production data in the lower dev and test environments. This posed a significant challenge as these lower environments are difficult to secure the same level of protection as our production environments due to costs and resource constraints.

Additionally, because the agency is a state healthcare organization, there is a large demand for our data from partners, academia, and researchers. Due to the highly sensitive nature of our data, this demand comes with complex and costly privacy protection regulations and lengthy data sharing agreements. These restrictions hinder our ability to be agile and responsive in providing best in class services to our patients. When we need to share our production data, we usually have to go through lengthy data sharing agreement negotiations that can stall and delay the business from accomplishing its objectives. Security is no longer seen as a value-added business enabler but rather a hindrance and, at best, a nuisance. 

To meet this need for our data promptly while also protecting our systems, we turned to use Artificial Intelligence (AI) to generate synthetic datasets to be used in our lower environments and to share with our partners. Synthetic data is generated or simulated artificially using various techniques such as generative models, statistical algorithms, and simulation methods. The goal of generating synthetic data is to create a dataset that is similar in distribution and structure to real-world data, but with some modifications or adjustments to improve data quality, privacy protection, and scalability. But what exactly is synthetic data?

Synthetic data, as defined by Gartner, is generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world. Synthetic data is designed to improve protect sensitive data, AI models, and mitigate bias.

By employing an AI solution to generate synthetic data, and only using synthetic data in systems for analytics, the CISO can lead the organization to eliminate the threat of ransomware to these systems.

By harnessing the power of AI, we can synthesize complete “production like” datasets at scale to meet the needs of the business while ensuring security in the lower environment. Additionally, we can share synthetic data with our partners for analytical research without needing time-consuming data use agreements. 

The AI solution we chose generates synthetic datasets from real data and is a true representation of the production data. The AI treats each of the data fields as private and creates an entirely new persona based on the production data to create a truly unique dataset comparable to the real data that will not have any impact on the data analytics outcomes. Because each data field is considered private and is changed, the original dataset can not be derived from the synthetic data and therefore, absolute privacy of the production data is achieved.

When evaluating AI and synthetic data solutions, the CISO needs to keep in mind the following:

  1. Is the AI ethical and explainable?

Although we should be able to trust our solutions, we need to verify how and more importantly, why the AI answers the way it does. We must ensure the machine’s solution is in line with our human ethics and be able to make adjustments when it doesn’t.

2. Does the solution meet and solves multiple problems for the organization and possibly multiple business units?

AI and synthetic data solutions as costly. The best solutions can be applied to multiple business units that will bring value and innovation to those business units.

3.  Most importantly, the CISO needs to evaluate if the solution will answer the security concerns of the organization. Does the solution meet or exceed the security standards and the security framework that your organization has implemented?

Once you have evaluated the solutions, the CISO has to convince executive leadership and, in many cases, the board of directors that AI and synthetic data is the best solution to ensure success and eliminate our problems.

  1. Production data in the lower, less secure environment, dev and test.
  2. The creation of quality test data at volume. 
  3. Adherence to complex and costly privacy protection regulations.
  4. End the need for lengthy delays in data sharing agreements.

It is of the utmost importance to approach any security solution and, in this case, AI and synthetic data from the perspective of how the solution will benefit the business. How will this enhance business operations, save money, and save time. 

It is imperative to understand your overall security and operational objectives and tie your objectives to the business. To do this, the security professional needs to align the desired outcomes of the solution with the business and get top executive buy-in on its implementation. 

Ensure that all stakeholders of your organization understand that the needs of business operations align with the requirements for security. Be clear on success, not just from a security perspective but from the business. Be clear on how implementing an AI solution will enhance both the security and operations of the business and, ultimately, how security will become a business multiplier versus a business expense. 

Be realistic with your timelines and objectives. It is far easier to convince your board and executive leadership to invest in an AI solution when you present the short and long-term goals of the security and business are aligned for the best possible outcome of success.

For instance, in the short term, the implementation of AI will be used for the generation of synthetic data as quality test data in the lower environments to reduce the risk and probability of a data breach if these lower environments become compromised. In the long term, the business can use AI to conduct predictive analytics on the data. In my case, that means fully leveraging and understanding the mountains of healthcare data my organization maintains and evaluating the probability of success for a specific treatment plan. To do this, we developed six objectives to meet both the business and security goals of AI and synthetic data:

  1. Creation of Synthetic Data for use in the Dev and Test environment – elimination of production data in the lower environments.
  2. Creation of synthetic data sets and analysis by AI to identify data quality issues.
  3. Creation of synthetic data for release to researchers and other interested parties.
  4. Analyze synthetic data to identify the best treatment outcomes.
  5. Enable predictive analytics to predict treatment plans and the likelihood of successful outcomes.
  6. Enable predictive analytics to predict demands on resources and services by ingesting population datasets such as census data.

By tying both the business objectives with the security objective, I gained support throughout my organization from the business leaders. This ultimately led to the full support of our executive leadership team’s full support.

Finally, you should clearly communicate the Return on Investment (ROI) of AI to the business. Remember that Risk = Likelihood of Occurrence x Severity. When you factor in the Cost of an incident and the cost to fix it, you will be able to provide your executive leadership with risk data that is quantitative versus qualitative in nature and will place you in a far better position to communicate what an ROI would look like for your organization.  

Once you have buy-in and executive support for a solution, you should consider the type of synthetic data you want to generate based on use cases. Use of fully synthesized data, not based on real world production data, should be used for your testing and dev environments. Whereas, production data generated synthetic data is best used for analytics, data sharing, and monetization. Another consideration is the use of ethical AI.

We wanted an ethical AI that would explain the “Why” it came to a conclusion or solution that is presented back to the human. As we evolve along with AI, it is important to understand how the machine comes to its conclusions. We, humans, must trust but verify the end solution to ensure it aligns with our humanity. For example, what if we ask the AI to formulate the best course of treatment to save time and money for a terminal patient. Fair question, we want to conserve resources, and the doctor’s time so she can treat other patients, and of course save money for both the hospital and patient. But, what if the AI concludes that euthanizing the patient is the best treatment plan?  Another example, is what if we use AI for national defense to provide the best course of action in conflict and the AI concludes that defending a densely populated city from attack is not in the best interest of the long term strategy to win the war and that sacrificing 8 million lives to safeguard 300 million is the best option? AI must remain ethical for humans to trust using it. Not to follow it blindly but understand the “How” and “Why” it reached its decision, allowing the human to be “In or On the Loop” to correct the AI to meet our human ethics.

As continuous and evolving threats like ransomware target our most sensitive and critical data, organizations need to embrace the latest technology to help defend, deter, and mitigate these threats and risks. By employing an AI solution to generate synthetic data, and only using synthetic data in systems for analytics, the CISO can lead the organization to eliminate the threat of ransomware to these systems. If these systems do fall to ransomware, the organization would not need to pay the ransom and if the data is released to the public, there is no impact because the data is synthetic and cannot be re-engineered back to any real data. The risk and liability are reduced to zero. The impact on the business is reduced to the time it takes to build a new environment and generate a new synthetic dataset. The CISO can become a true business enabler and help provide datasets that can be easily shared with partners without the need for lengthy legal agreements and privacy or regulatory concerns.