MolSearchGPT - A chemical similarity search using AI-generated embeddings

Through my work at Frontier Medicines, an early-stage covalent drug discovery company, I developed a similarity search tool for a space of 15 billion synthesizable molecules.

To do this, I represented molecules as vector embeddings using a proprietary transfomer model fine-tuned according to Frontier’s use case. In addition, I featurized every molecule in the search space to allow multiple search fields and relevant filters by running computational workloads on distributed clusters, and stored this rich feature information using a data warehouse on AWS.

The end product was offered to chemists using a UI I built with Streamlit that kickstarted an AWS SageMaker pipeline job. The pipeline job ran each component of the workflow, including the parsing of query molecules, the vector similarity search, metadata retrieval from the data warehouse, and appropriate clustering and filtering.

More details about how I leveraged vector databases for semantic search can be found on the Pinecone website, detailing Frontier’s adoption of their serverless product in an early-access program.

A bite-sized version of the project intended as a proof-of-concept can be found on my GitHub.