WHAT IS IT ABOUT?
Diffusion models have been mainly used for image and video generation. Recently, their usage has been extended to new domains, such as chemistry for the generation of new molecules. For our analysis we aimed at generality and approached the explanation of diffusion models for linker design of molecules with different applications.
WHAT IS A “LINKER”?
A linker is a substructure of a molecule that connects two or more disconnected fragments of atoms. Linker design is an important task in drug development, as it plays a central role in the design of effective molecules with specific properties.
HOW DO DIFFUSION MODELS WORK IN PRINCIPLE?
Diffusion models learn a data distribution and generate new data by sampling from that distribution. The diffusion model itself is an advanced AI model. We try to understand its generative process.
HOW DOES “NOISE” COME INTO PLAY?
Adding and removing noise is the hallmark of diffusion models. Starting from a sample in the dataset (an image or, in our case, a molecule), they add “noise” until the original sample is “destroyed”—like the transition from a detailed image to a “TV static effect.” Then, the model learns how such added noise needs to be removed to retrieve a valid sample, generating a new image (or molecule).
HOW DID YOU PROCEED?
For our study, we selected a state-of-the-art diffusion model for linker design and developed a novel explainability strategy extending a well-known concept in the field on explainable artificial intelligence: Shapley values. For our method, DiffSHAPer, we adapted the widely used Shapley value formalism for explaining machine learning predictions to diffusion models. Our goal was to find which fragment atoms were the most influential for linker generation.
WHAT IS THE MOST IMPORTANT FINDING?
We found that, to generate chemically valid linkers, diffusion models do not learn or exploit chemistry principles, but they mostly rely on distance constraints between atoms. Therefore, they take into account recurrent statistical patterns in the data without learning generalizable chemical rules.
WHAT WAS THE BIGGEST CHALLENGE?
From a computational perspective, running inference and explaining the generations of diffusion models are time-consuming tasks. From a methodological perspective, our approach represents a novelty, therefore we had to find the best way to present our results effectively.
IS THERE AN APPLICATION?
Our methodology can be used to understand what molecular diffusion models learn. In the specific case of linker design, it’s useful to determine what drives the generation of the linker. Linkers are important in drug design, as they can improve critical molecular properties (such as potency and stability). Consequently, a linker generated solely based on distance and geometric constraints does not guarantee optimization of properties or practical chemical utility.
WHAT ARE THE NEXT STEPS?
The first step would be to apply DiffSHAPer to molecular diffusion models tailored to different tasks. Future research will be focused on the development of models able to include more chemical context in their internal reasoning.