The U.S. Environmental Protection Agency tracks the release of nearly 800 hazardous substances that companies will quickly phase out if greener, higher-performing alternatives are found. Now, AI has the potential to give scientists powerful new tools to do just that, by discovering new materials that may be safer for humans and the environment.
Using underlying models pre-trained on vast molecular databases, we can screen millions of molecules at once to find desirable properties while eliminating molecules with dangerous side effects. These models can also be used to generate entirely new molecules in nature, bypassing the traditional time-consuming trial-and-error-based discovery process.
In recent months, IBM Research has released a new family of open source foundation models on GitHub and Hugging Face. Anyone with a little data can explore better battery materials to store electricity from the sun and wind, or the “forever” chemistry of toxic PFAS in everything from nonstick frying pans to products. Customize the model for your own applications, such as exploring material alternatives. A chip inside your laptop or cell phone. In addition to models, IBM has devised several methods for fusing different molecular representations.
These models can be used alone or in combination and have been downloaded over 100,000 times in a matter of months. “We are encouraged by the strong interest we have seen so far,” said Seiji Takeda, principal investigator who co-leads the Materials Fundamentals Model (FM4M) project at IBM Research. “We look forward to seeing what the community comes up with.”
machine readable molecules
Unlike words, which many large-scale language models involve, molecules exist in three dimensions, and their physical structure strongly influences their behavior. One of the big challenges in applying AI to chemistry lies in how to represent molecular structures in a way that computers can effectively analyze and manipulate them.
Various styles of expression have emerged over the years. Molecular structures can be summarized in natural language as textual SMILES and SELFIES strings. as a molecular graph containing atomic “nodes” and bond “edges”. As a numerical value that represents the relative strength of a physical property. Spectrograms are taken with a microscope and show how target molecules interact with light.
Each format has its strengths and limitations when applying AI to classification and prediction tasks. The SMILES dataset is the world’s largest, with approximately 1.1 billion molecules represented as text strings. However, because SMILES strings reduce 3D molecules to lines of text, valuable structural information is lost and the AI model can produce invalid molecules. A related alternative format, SELFIES, provides a more robust and flexible grammar for representing valid molecules, but like other textual representations it also lacks 3D information.
In contrast, molecular graphs capture the spatial arrangement of atoms and their bonds, but obtaining this detail is computationally expensive. Data collected through experiments and simulations can also be very useful, but they also have drawbacks. Experimental data used to train AI models in chemistry can be incomplete or contain errors. For example, measurements of how molecules interact with the electromagnetic spectrum may include analysis of only visible light and exclude infrared and ultraviolet light, distorting the AI model’s view of the world. Masu.
IBM researchers debated the strengths and weaknesses of each representation as they planned to build a foundational model of the material. Ultimately, we pretrained each model separately using its own modality.
SMILES-TED and SELFIES-TED (short for “transformer encoder-decoder”) were pre-trained on 91 million SMILES and 1 billion SELFIES, validated samples from the PubChem and Zinc-22 databases, respectively. Ta. MHG-GED (which stands for “Molecular Hypergraph Grammar with Graph-Based Encoder/Decoder”) was pre-trained on 1.4 million SMILES-based graphs containing atomic numbers and charges.
An AI architecture known as Mix of Experts (MoE) uses routers to selectively activate subsets of a model’s weights for different tasks to serve large models more efficiently. This is a common method. The MoE takes incoming queries from users and passes them to a routing algorithm that determines which “expert” is best suited for the job.
IBM researchers used the MoE concept to combine the complementary strengths of SMILES, SELFIES, and molecular graph-based models. Recent research at the 2024 NeurIPS conference in Vancouver shows that by combining the embedding of three data modalities in a “multi-view” MoE architecture, other major molecular-based models built with only one modality It has been shown that it can perform better than
They tested MoE on MoleculeNet, a benchmark created at Stanford that reflects some of the tasks commonly used in drug and materials discovery. This benchmark includes a wide variety of both classification tasks and may include predicting whether a particular molecule is toxic or not. The regression task involves predicting how soluble a particular molecule is in water. The researchers found that multi-view MoE outperformed other leading models in both styles of tasks.
The MoE approach also provides insight into which data representations are best combined with which types of tasks. Researchers found, for example, that the Department of Education favors SMILES- and SELFIES-based models for some types of tasks, while requiring all three modalities equally for other types of tasks. and showed that graph-based models add predictive value for specific problems.
“This pattern of expert activation suggests that MOE can effectively adapt to specific tasks and improve performance,” said Emilio, a senior scientist at IBM Research who co-leads the project.・Vital Brasil says:
IBM researchers will demonstrate the basic model and its capabilities at the Association for the Advancement of Artificial Intelligence (AAAI) conference in February. Over the next year, we plan to release new fusion techniques and models built on additional data modalities, such as positioning atoms in 3D space.
Through the AI Alliance, IBM also collaborates with other researchers in academia and industry to accelerate the discovery of safer and more sustainable materials. This spring, IBM and Japanese materials company JSR launched the Working Group on Materials (WG4M), which has included approximately 20 corporate and academic partners to date.
The group focuses on developing new fundamental models, datasets, and benchmarks that can be applied to a variety of problems, from reusable plastics to the materials needed to support renewable energy. “There’s no time to waste,” says Dave Braines, CTO of emerging technologies at IBM Research UK. “New, more sustainable materials are needed in nearly every industry, from semi-manufacturing to clean energy. AI is giving us the power to double down on our creativity.”