Challenges in Validating Machine Learning and Artificial Intelligence Models

By Maria Matkovska, Director of Model Validation, Model Risk Management, First Citizens Bank

Machine-learning (ML) techniques are not necessarily new in nature, as some concepts were proposed in the 1940s (Hardesty, 2017). Many institutions across various industries have implemented ML methods – or at least prototypes of artificial intelligence (AI) algorithms – with the use of artificial intelligence and machine learning exponentially increasing. However, such techniques have not been implemented for long in some industries, including the financial sector, despite the ability of these statistical methodologies to deliver value to institutions that are rich in data, especially in terms of fraud detection, credit risk management and in the anti-money-laundering space. The use of ML modeling was viewed as a luxury in the past, but over time, many institutions turned to these complex and sophisticated techniques as not only benchmarks or supporting models but champion models as well. The accuracy of the results, dynamic nature of the models and the amount of data they can process present an enormous potential benefit to the financial sector. However, reviewing and validating ML models also present big challenges due to their opaque nature.

In the banking industry, each statistical model needs to be reviewed and validated by an independent assurance unit such as the second line of defense function or a model validation team, to be specific. These teams review and potentially accept a model for its use by a bank by following the U.S. Federal Reserve’s standard on model review and definition. SR 11-7: Guidance on Model Risk Management is a regulatory guidance established by the Federal Reserve that provides direction on best practices of model review and challenge by model risk management teams. It defines a model as “a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates” (Board of Governors, 2011). During a model validation, the reviewing team focuses on various aspects of a model. However, such review can be limited and challenging for models that are opaque in nature, which is the case for ML models. Such challenges include but are not limited to conceptual soundness, outcomes analysis testing and ongoing monitoring plans. These points are further developed below.

  • Conceptual Soundness

The element of conceptual soundness involves assessing the mathematical construct and design of a model, its empirical evidence, model drivers and statistical assumptions with the goal of understanding the model foundation. Due to their sophisticated architecture, ML models are typically more advanced and opaque than traditional statistical algorithms. This can create great challenges in understanding, verifying and validating the variable selection process, estimation process and explainability. Even if ML models perform better than traditional models, the lack of explainability may cause the use of ML models to be restricted. 

  • Outcomes Analysis Testing

The concept of outcomes analysis testing includes comparing the model prediction to the actual values to understand the accuracy and reliability of the model’s forecasts. While it may not be too challenging to calculate the error between realized and predicted values, understanding the reason or reasons behind this deviation produced by complex ML models is not as simple as those forecasted by linear or logistic modeling techniques. In this case, standard sensitivity testing may be difficult to interpret. ML models also run the risk of being overly specific or, alternatively, too general in predicting results – a concept also known as overfitting (high variance) or underfitting (high bias) or bias-variance tradeoffs.

  • Ongoing Monitoring

During review of an ongoing monitoring process, it is necessary to confirm the model’s performance is monitored after its implementation and use in production to identify any deterioration over time. Various metrics with the supporting thresholds are used to monitor the model’s performance on a routine basis and on a frequency commensurate with that of model use. Most ML systems are dynamic in nature, as they re-estimate the models on a consistent basis. Because of this fact, it can be challenging to monitor the model’s performance, as the metrics and models themselves may change as the ML adapts without any human intervention. Due to the complex and opaque nature of ML models, they present more unique obstacles to a reviewer than traditional statistical methodologies.


The challenges discussed in this article include: assessment of model methodology, outcomes analysis and ongoing monitoring exercise. The ‘black-box’ nature of ML models makes it difficult to understand the specific variables and design used by the model. The unknown drivers of the models and the effect of their changes make it hard to interpret and explain model results. Monitoring the stability and potential deterioration of such a dynamic model can also be challenging due to its ever-evolving and changeable nature. While the concept of ML models is still relatively new to the financial sector, presenting challenges during their assessment and validation, it also offers enormous potential for improvements to the banking industry.

Board of Governors of the Federal Reserve System, (2011, April 4). SR 11-7: Guidance on Model Risk Management. Board of Governors of the Federal Reserve System. Retrieved August 2, 2022, from 
Hardesty, L. (2017, April 14). Explained: Neural networks. MIT News on Campus and Around the World. Retrieved August 2, 2022, from