分子表示学习模型全览：从蛋白质到小分子的语言模型

分子表示学习已成为计算化学和生物信息学的核心技术。随着Transformer架构在自然语言处理中的成功，研究者们将其应用到分子数据的表示学习中，取得了显著进展。本文全面介绍从蛋白质到小分子的各种语言模型，为读者提供完整的技术栈和实用代码。

环境配置

基础依赖安装

# PyTorch安装（根据CUDA版本调整）
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# HuggingFace Transformers
pip install transformers

# 检查GPU可用性
python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'GPU Count: {torch.cuda.device_count()}')"

可选：设置模型缓存路径

import os
os.environ['TORCH_HOME'] = '/your/path/to/model'
os.environ['HF_HOME'] = '/your/path/to/hf_model'

一、蛋白质语言模型

1.1 ESM-2系列

模型简介

ESM-2（Evolutionary Scale Modeling）是Meta开发的大规模蛋白质语言模型[1]，在进化规模的蛋白质序列数据上进行预训练，能够捕获蛋白质的进化和结构信息。

可用模型规模

模型名称	层数	参数量	模型大小
esm2_t48_15B_UR50D	48	15B	~60GB
esm2_t36_3B_UR50D	36	3B	~12GB
esm2_t33_650M_UR50D	33	650M	2.5GB
esm2_t30_150M_UR50D	30	150M	~600MB
esm2_t12_35M_UR50D	12	35M	~140MB
esm2_t6_8M_UR50D	6	8M	~32MB

安装和使用

pip install fair-esm

import torch
import esm

# 检查GPU
print("Number of GPUs:", torch.cuda.device_count())

# 加载模型（选择合适的规模）
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()  # 禁用dropout以获得确定性结果

# 如果有GPU，移动到GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# 准备序列数据
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]

# 批量转换
batch_labels, batch_strs, batch_tokens = batch_converter(data)
batch_tokens = batch_tokens.to(device)
batch_lens = (batch_tokens != alphabet.padding_idx).sum(1)

# 提取表示
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)

# 获取token表示（每个氨基酸的embedding）
token_representations = results["representations"][33]

# 获取序列级表示（整个蛋白质的embedding）
sequence_representations = []
for i, tokens_len in enumerate(batch_lens):
    # 移除特殊token（开始和结束）
    seq_repr = token_representations[i, 1 : tokens_len - 1].mean(0)
    sequence_representations.append(seq_repr)

print(f"Token representation shape: {token_representations.shape}")
print(f"Sequence representation shape: {sequence_representations[0].shape}")

高级用法：注意力权重和接触预测

# 获取注意力权重和接触预测
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)

# 接触预测（用于蛋白质结构预测）
contacts = results["contacts"]
print(f"Contacts shape: {contacts.shape}")

# 注意力权重
attentions = results["attentions"]
print(f"Attention shape: {attentions.shape}")

1.2 ESM-C (ESM Cambrian)

模型简介

ESM-C是ESM3模型家族中专注于表示学习的平行模型[2]，相比ESM-2在相同参数量下提供更高效的性能和更低的内存消耗。ESM-C设计为ESM-2的直接替代品，具有重大性能优势。

性能对比

ESM-C参数量	对应ESM-2参数量	ESM-C优势
300M	650M	更低内存消耗，更快推理
600M	3B	高效达到甚至超越更大规模ESM-2性能
6B	-	性能远超最佳ESM-2模型

安装和使用

pip install esm

方法一：使用ESM SDK API（推荐）

from esm.sdk.api import ESMProtein, LogitsConfig
from esm.models.esmc import ESMC

# 创建蛋白质对象
protein = ESMProtein(sequence="MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG")

# 加载模型（如果遇到tokenizer错误，使用方法二）
try:
    client = ESMC.from_pretrained("esmc_600m").to("cuda")  # 或 "cpu"
    
    # 编码蛋白质
    protein_tensor = client.encode(protein)
    
    # 获取logits和embeddings
    logits_output = client.logits(
        protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
    )
    
    print(f"Logits shape: {logits_output.logits.sequence.shape}")
    print(f"Embeddings shape: {logits_output.embeddings.shape}")
    
    # 提取序列级表示
    sequence_embedding = logits_output.embeddings.mean(dim=1)  # 平均池化
    print(f"Sequence embedding shape: {sequence_embedding.shape}")
    
except AttributeError as e:
    print(f"ESM-C错误: {e}")
    print("请使用方法二或方法三")

If you see

ESM-C错误: property 'cls_token' of 'EsmSequenceTokenizer' object has no setter

please do this according to https://github.com/evolutionaryscale/esm/issues/214

pip install esm==3.1.1

The output is like

Logits shape: torch.Size([1, 67, 64])
Embeddings shape: torch.Size([1, 67, 1152])
Sequence embedding shape: torch.Size([1, 1152])

方法二：使用远程API（需要注册）

from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, LogitsConfig

# 需要先在 https://forge.evolutionaryscale.ai 注册获取token
forge_client = ESM3ForgeInferenceClient(
    model="esmc-6b-2024-12", 
    url="https://forge.evolutionaryscale.ai", 
    token="<your_forge_token>"
)

protein = ESMProtein(sequence="MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG")
protein_tensor = forge_client.encode(protein)
logits_output = forge_client.logits(
    protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
)

print(f"Remote embeddings shape: {logits_output.embeddings.shape}")

1.3 CARP

模型简介

CARP（Contrastive Autoregressive Protein model）是微软开发的蛋白质语言模型[3]，采用对比学习和自回归训练目标，在蛋白质序列建模方面表现优异。

安装和使用

在线安装：

pip install git+https://github.com/microsoft/protein-sequence-models.git

离线安装：

下载仓库：https://github.com/microsoft/protein-sequence-models

解压并安装：

cd /path/to/protein-sequence-models
pip install .

代码实现

from sequence_models.pretrained import load_model_and_alphabet

# 加载模型和序列处理器
model, collater = load_model_and_alphabet('carp_640M')

# 准备序列数据（注意：需要嵌套列表格式）
seqs = [['MDREQ'], ['MGTRRLLP'], ['MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG']]

# 将序列转换为模型输入格式
x = collater(seqs)[0]  # (n, max_len)

# 获取表示（第56层的表示）
with torch.no_grad():
    rep = model(x)['representations'][56]  # (n, max_len, d_model)

print(f"Input shape: {x.shape}")
print(f"Representation shape: {rep.shape}")

# 获取序列级表示（平均池化）
sequence_repr = rep.mean(dim=1)
print(f"Sequence representation shape: {sequence_repr.shape}")

1.4 ProtT5

模型简介

ProtT5是基于T5架构的蛋白质语言模型[4]，采用编码器-解码器结构，在大规模蛋白质数据上预训练，支持多种下游任务。

从本地路径加载模型

import torch
import re
from transformers import T5Tokenizer, T5EncoderModel

# 设备配置
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# 本地模型路径（如果已下载）
tokenizer_path = '/your/path/to/prot_t5_xl_half_uniref50-enc/'

# 加载tokenizer和模型
try:
    tokenizer = T5Tokenizer.from_pretrained(tokenizer_path, do_lower_case=False)
    print(f"Tokenizer loaded from local path: {tokenizer_path}")
except OSError:
    # 如果本地路径不存在，从HuggingFace下载
    tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
    print("Tokenizer loaded from HuggingFace")

# 加载模型
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

# 示例蛋白质序列
sequence_examples = ["PRTEINO", "SEQWENCE"]

# 预处理：替换稀有氨基酸，添加空格
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# Tokenization
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest", return_tensors="pt")
input_ids = ids['input_ids'].to(device)
attention_mask = ids['attention_mask'].to(device)

# 生成embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

# 提取每个序列的残基embeddings
emb_0 = embedding_repr.last_hidden_state[0, :7]  # 第一个序列
emb_1 = embedding_repr.last_hidden_state[1, :8]  # 第二个序列

print("Shape of embedding for sequence 1:", emb_0.shape)
print("Shape of embedding for sequence 2:", emb_1.shape)
print("Protein embeddings generated successfully!")

1.5 Ankh

模型简介

Ankh是专门为阿拉伯语蛋白质序列优化的多语言蛋白质模型[5]，基于T5架构，支持多种语言和蛋白质表示任务。

实现代码

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# 本地模型路径
local_model_path = "/your/path/to/ankh-large/"

# 加载tokenizer和模型
tokenizer = AutoTokenizer.from_pretrained(local_model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(local_model_path)

# 示例序列
sequence_examples = ["MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"]
inputs = tokenizer(sequence_examples, return_tensors="pt", padding=True)

# 设备配置
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}

# 生成编码器embeddings
with torch.no_grad():
    encoder_outputs = model.encoder(**inputs)
    embeddings = encoder_outputs.last_hidden_state

# 提取有效序列的embeddings（移除padding）
emb_0 = embeddings[0, :inputs['attention_mask'][0].sum()]

print("Shape of encoder embeddings for sequence 1:", emb_0.shape)
print("Model loaded successfully from:", local_model_path)

二、肽语言模型

2.1 PepBERT

模型简介

PepBERT是专门为肽序列设计的BERT模型[6]，针对短肽序列进行优化，在肽-蛋白质相互作用预测等任务中表现优异。

模型特点

专门针对肽序列（通常长度较短）
基于BERT架构，采用掩码语言建模
在UniParc数据库的大规模肽序列上预训练
输出维度：320

安装和使用

import os
import torch
import importlib.util
from tokenizers import Tokenizer

# 设置环境变量
os.environ['TORCH_HOME'] = '/home/gxf1212/data/local-programs/model' 
os.environ['HF_HOME'] = '/home/gxf1212/data/local-programs/hf_model'

# 本地模型路径
snapshot_path = "/home/gxf1212/data/local-programs/hf_model/hub/models--dzjxzyd--PepBERT-large-UniParc/snapshots/7b0cbb2f925d05c9fca42c63c1712f94200fdb41" 

def load_module_from_local(file_path):
    """从本地文件加载Python模块"""
    module_name = os.path.splitext(os.path.basename(file_path))[0]
    spec = importlib.util.spec_from_file_location(module_name, file_path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module

# 1) 动态加载模型配置
model_module = load_module_from_local(os.path.join(snapshot_path, "model.py"))
config_module = load_module_from_local(os.path.join(snapshot_path, "config.py"))
build_transformer = model_module.build_transformer
get_config = config_module.get_config

# 2) 加载tokenizer
tokenizer_path = os.path.join(snapshot_path, "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)

# 3) 加载模型权重
weights_path = os.path.join(snapshot_path, "tmodel_17.pt")

# 4) 初始化模型
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_built() else "cpu"
config = get_config()
model = build_transformer(
    src_vocab_size=tokenizer.get_vocab_size(),
    src_seq_len=config["seq_len"],
    d_model=config["d_model"]
)

# 加载预训练权重
state = torch.load(weights_path, map_location=torch.device(device))
model.load_state_dict(state["model_state_dict"])
model.eval()

# 5) 生成embeddings
def get_peptide_embedding(sequence):
    """生成肽序列的embedding"""
    # 添加特殊token [SOS] 和 [EOS]
    encoded_ids = (
        [tokenizer.token_to_id("[SOS]")]
        + tokenizer.encode(sequence).ids
        + [tokenizer.token_to_id("[EOS]")]
    )
    
    input_ids = torch.tensor([encoded_ids], dtype=torch.int64)
    
    with torch.no_grad():
        # 创建注意力掩码
        encoder_mask = torch.ones((1, 1, 1, input_ids.size(1)), dtype=torch.int64)
        
        # 前向传播获取token embeddings
        emb = model.encode(input_ids, encoder_mask)
        
        # 移除特殊token的embeddings
        emb_no_special = emb[:, 1:-1, :]
        
        # 平均池化获取序列级表示
        emb_avg = emb_no_special.mean(dim=1)
    
    return emb_avg

# 使用示例
sequence = "KRKGFLGI"
embedding = get_peptide_embedding(sequence)
print("Shape of peptide embedding:", embedding.shape)  # (1, 320)
print("Peptide embedding generated successfully!")

三、小分子语言模型

3.1 ChemBERTa系列

模型简介

ChemBERTa是首个大规模的分子BERT模型[7]，在7700万PubChem分子上预训练，采用掩码语言建模目标，为分子性质预测提供强大的预训练表示。

主要版本

ChemBERTa-77M-MLM: 在77M分子上用掩码语言建模预训练
ChemBERTa-2: 改进版本，支持多任务预训练
参数量: 约12M-77M参数

安装和使用

# 安装依赖
pip install transformers torch rdkit

from transformers import AutoTokenizer, AutoModel
import torch
from rdkit import Chem

# 加载预训练模型
model_name = "DeepChem/ChemBERTa-77M-MLM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# 设备配置
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

def get_molecular_embedding(smiles_list):
    """获取分子的ChemBERTa embedding"""
    # Tokenization
    inputs = tokenizer(smiles_list, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    # 前向传播
    with torch.no_grad():
        outputs = model(**inputs)
        # 使用[CLS] token的表示作为分子级表示
        molecular_embeddings = outputs.last_hidden_state[:, 0, :]  # [CLS] token
        # 或者使用平均池化
        # molecular_embeddings = outputs.last_hidden_state.mean(dim=1)
    
    return molecular_embeddings

# 使用示例
smiles_examples = [
    "CCO",  # 乙醇
    "CC(=O)O",  # 乙酸
    "c1ccccc1",  # 苯
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"  # 咖啡因
]

# 验证SMILES有效性
valid_smiles = []
for smi in smiles_examples:
    mol = Chem.MolFromSmiles(smi)
    if mol is not None:
        valid_smiles.append(smi)
    else:
        print(f"Invalid SMILES: {smi}")

# 生成embeddings
embeddings = get_molecular_embedding(valid_smiles)
print(f"Generated embeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

# 单个分子的embedding
single_embedding = get_molecular_embedding(["CCO"])
print(f"Single molecule embedding shape: {single_embedding.shape}")

高级用法：微调ChemBERTa

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch.nn as nn

# 加载用于分类任务的模型
model = AutoModelForSequenceClassification.from_pretrained(
    "DeepChem/ChemBERTa-77M-MLM",
    num_labels=2  # 二分类任务
)

# 准备数据集和训练参数
class MolecularDataset(torch.utils.data.Dataset):
    def __init__(self, smiles_list, labels, tokenizer, max_length=512):
        self.smiles_list = smiles_list
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.smiles_list)
    
    def __getitem__(self, idx):
        smiles = self.smiles_list[idx]
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            smiles,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# 微调代码示例（需要准备训练数据）
# training_args = TrainingArguments(
#     output_dir='./results',
#     num_train_epochs=3,
#     per_device_train_batch_size=16,
#     per_device_eval_batch_size=64,
#     warmup_steps=500,
#     weight_decay=0.01,
#     logging_dir='./logs',
# )

3.2 MolFormer系列

模型简介

MolFormer是IBM开发的大规模化学语言模型[8]，在11亿分子上预训练，采用线性注意力机制和旋转位置编码，在多个分子性质预测任务上达到SOTA性能。

模型特点

预训练数据: 11亿分子（PubChem + ZINC）
架构: 线性注意力Transformer + 旋转位置编码
高效性: 线性时间复杂度，支持长序列
性能: 在多个基准数据集上超越GNN模型

安装和使用

git clone https://github.com/IBM/molformer.git
cd molformer
pip install -e .

import torch
from molformer.models import MolFormer
from molformer.tokenizer import MolTranBertTokenizer

# 加载预训练模型和tokenizer
model_path = "ibm/MoLFormer-XL-both-10pct"  # HuggingFace模型路径
tokenizer = MolTranBertTokenizer.from_pretrained(model_path)
model = MolFormer.from_pretrained(model_path)

# 设备配置
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()

def get_molformer_embedding(smiles_list, max_length=512):
    """获取MolFormer分子embedding"""
    # Tokenization
    encoded = tokenizer(
        smiles_list,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    
    # 移动到设备
    input_ids = encoded['input_ids'].to(device)
    attention_mask = encoded['attention_mask'].to(device)
    
    # 前向传播
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # 使用最后一层的隐藏状态
        hidden_states = outputs.last_hidden_state
        
        # 计算分子级表示（掩码平均池化）
        mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_embeddings = torch.sum(hidden_states * mask_expanded, 1)
        sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
        molecular_embeddings = sum_embeddings / sum_mask
    
    return molecular_embeddings

# 使用示例
smiles_examples = [
    "CCO",
    "CC(=O)O", 
    "c1ccccc1",
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
]

embeddings = get_molformer_embedding(smiles_examples)
print(f"MolFormer embeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

MolFormer-XL超大规模版本

# 对于MolFormer-XL（需要更多内存）
model_xl_path = "ibm/MoLFormer-XL-both-10pct"
tokenizer_xl = MolTranBertTokenizer.from_pretrained(model_xl_path)
model_xl = MolFormer.from_pretrained(model_xl_path)

# 使用混合精度以节省内存
model_xl = model_xl.half().to(device)  # 使用半精度

# 对于大批量处理，建议分批处理
def batch_process_molecules(smiles_list, batch_size=32):
    """分批处理大量分子"""
    all_embeddings = []
    
    for i in range(0, len(smiles_list), batch_size):
        batch = smiles_list[i:i+batch_size]
        embeddings = get_molformer_embedding(batch)
        all_embeddings.append(embeddings.cpu())
        
        # 清理GPU缓存
        torch.cuda.empty_cache()
    
    return torch.cat(all_embeddings, dim=0)

3.3 SMILES Transformer

模型简介

SMILES Transformer是首个专门为SMILES序列设计的Transformer模型[9]，采用自编码任务进行预训练，学习分子的潜在表示，适用于低数据量的药物发现任务。

特点

预训练任务: 自编码（去噪自编码器）
数据: 170万ChEMBL分子（不超过100字符）
SMILES增强: 使用SMILES枚举增加数据多样性
应用: 低数据药物发现

安装和使用

git clone https://github.com/DSPsleeporg/smiles-transformer.git
cd smiles-transformer
pip install -r requirements.txt

import torch
import torch.nn as nn
from torch.nn import Transformer
import numpy as np
from rdkit import Chem

class SMILESTransformer(nn.Module):
    """SMILES Transformer模型"""
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6, max_seq_len=100):
        super(SMILESTransformer, self).__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, max_seq_len)
        self.transformer = Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dim_feedforward=2048,
            dropout=0.1
        )
        self.fc_out = nn.Linear(d_model, vocab_size)
        
    def forward(self, src, tgt=None, src_mask=None, tgt_mask=None):
        # 编码器
        src_emb = self.pos_encoder(self.embedding(src) * np.sqrt(self.d_model))
        
        if tgt is not None:
            # 训练模式（编码器-解码器）
            tgt_emb = self.pos_encoder(self.embedding(tgt) * np.sqrt(self.d_model))
            output = self.transformer(src_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask)
            return self.fc_out(output)
        else:
            # 推理模式（仅编码器）
            memory = self.transformer.encoder(src_emb, src_mask)
            return memory

class PositionalEncoding(nn.Module):
    """位置编码"""
    def __init__(self, d_model, max_len=100):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-np.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        return x + self.pe[:x.size(0), :]

class SMILESTokenizer:
    """SMILES分词器"""
    def __init__(self):
        # 基础SMILES字符集
        self.chars = ['<PAD>', '<SOS>', '<EOS>', '<UNK>'] + list("()[]1234567890=+-#@CNOSPFIBrClcnos")
        self.char_to_idx = {char: idx for idx, char in enumerate(self.chars)}
        self.idx_to_char = {idx: char for char, idx in self.char_to_idx.items()}
        self.vocab_size = len(self.chars)
    
    def encode(self, smiles, max_length=100):
        """编码SMILES字符串"""
        tokens = ['<SOS>'] + list(smiles) + ['<EOS>']
        indices = [self.char_to_idx.get(token, self.char_to_idx['<UNK>']) for token in tokens]
        
        # 填充或截断
        if len(indices) < max_length:
            indices += [self.char_to_idx['<PAD>']] * (max_length - len(indices))
        else:
            indices = indices[:max_length]
            
        return torch.tensor(indices, dtype=torch.long)
    
    def decode(self, indices):
        """解码回SMILES字符串"""
        chars = [self.idx_to_char[idx.item()] for idx in indices]
        # 移除特殊token
        chars = [c for c in chars if c not in ['<PAD>', '<SOS>', '<EOS>', '<UNK>']]
        return ''.join(chars)

def get_smiles_embedding(smiles_list, model, tokenizer, device):
    """获取SMILES的分子embedding"""
    model.eval()
    embeddings = []
    
    with torch.no_grad():
        for smiles in smiles_list:
            # 编码SMILES
            encoded = tokenizer.encode(smiles).unsqueeze(0).to(device)
            
            # 获取编码器输出
            encoder_output = model(encoded)
            
            # 平均池化获取分子级表示
            # 忽略padding token
            mask = (encoded != tokenizer.char_to_idx['<PAD>']).float()
            pooled = (encoder_output * mask.unsqueeze(-1)).sum(dim=1) / mask.sum(dim=1, keepdim=True)
            
            embeddings.append(pooled)
    
    return torch.cat(embeddings, dim=0)

# 使用示例
def demo_smiles_transformer():
    """演示SMILES Transformer的使用"""
    # 初始化模型和分词器
    tokenizer = SMILESTokenizer()
    model = SMILESTransformer(vocab_size=tokenizer.vocab_size)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    # 示例SMILES
    smiles_examples = [
        "CCO",
        "CC(=O)O",
        "c1ccccc1",
        "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
    ]
    
    # 验证SMILES
    valid_smiles = []
    for smi in smiles_examples:
        if Chem.MolFromSmiles(smi) is not None:
            valid_smiles.append(smi)
    
    # 获取embeddings（注意：这里使用的是未训练的模型，仅用于演示）
    embeddings = get_smiles_embedding(valid_smiles, model, tokenizer, device)
    print(f"SMILES embeddings shape: {embeddings.shape}")
    
    return embeddings

# 运行演示
# embeddings = demo_smiles_transformer()

3.4 SMILES-BERT

模型简介

SMILES-BERT是Wang等人开发的基于BERT的分子语言模型[10]，专门设计用于处理SMILES序列，采用掩码SMILES恢复任务进行大规模无监督预训练。该模型使用基于注意力机制的Transformer层，能够有效捕获分子序列中的长程依赖关系。

模型特点

半监督学习: 结合大规模无标签数据预训练和下游任务微调
注意力机制: 基于Transformer的注意力机制捕获分子内原子关系
可迁移性: 预训练模型可轻松迁移到不同的分子性质预测任务

使用示例

# SMILES-BERT通常需要从源码安装或使用类似的实现
from transformers import AutoTokenizer, AutoModel
import torch
from rdkit import Chem

def create_smiles_bert_embedding(smiles_list, model_name="DeepChem/ChemBERTa-77M-MLM"):
    """
    使用BERT-like模型生成SMILES embedding
    注：这里使用ChemBERTa作为SMILES-BERT的替代实现
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    # 验证SMILES
    valid_smiles = [smi for smi in smiles_list if Chem.MolFromSmiles(smi) is not None]
    
    # Tokenization和编码
    inputs = tokenizer(valid_smiles, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    # 生成embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        # 使用[CLS] token表示或平均池化
        embeddings = outputs.last_hidden_state.mean(dim=1)  # 平均池化
    
    return embeddings

# 使用示例
smiles_examples = ["CCO", "CC(=O)O", "c1ccccc1", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"]
embeddings = create_smiles_bert_embedding(smiles_examples)
print(f"SMILES-BERT embeddings shape: {embeddings.shape}")

3.5 Smile-to-Bert

模型简介

Smile-to-Bert是最新发布的BERT架构模型[11]，专门预训练用于从SMILES表示预测113个分子描述符，将分子结构和理化性质信息整合到embeddings中。该模型在22个分子性质预测数据集上进行了评估，表现优异。

模型特点

多任务预训练: 同时预测113个RDKit计算的分子描述符
理化性质感知: embeddings包含分子结构和理化性质信息
最新技术: 2024年发布，代表最新的分子BERT技术

使用示例

# Smile-to-Bert的概念实现
from transformers import BertModel, BertTokenizer
import torch
from rdkit import Chem

class SmileToBert:
    """Smile-to-Bert模型的概念实现"""
    
    def __init__(self, model_path="smile-to-bert"):
        """
        初始化Smile-to-Bert模型
        注：实际使用需要从官方仓库获取预训练权重
        """
        # 这里使用通用BERT作为示例，实际应使用预训练的Smile-to-Bert权重
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.model = BertModel.from_pretrained('bert-base-uncased')
        
        # 添加分子特定的特殊token
        special_tokens = ['[MOL]', '[BOND]', '[RING]']
        self.tokenizer.add_tokens(special_tokens)
        self.model.resize_token_embeddings(len(self.tokenizer))
        
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
    
    def preprocess_smiles(self, smiles):
        """预处理SMILES字符串"""
        # 在SMILES中添加空格以便tokenization
        processed = ' '.join(list(smiles))
        return processed
    
    def get_molecular_embedding(self, smiles_list):
        """获取分子的embedding"""
        # 预处理SMILES
        processed_smiles = [self.preprocess_smiles(smi) for smi in smiles_list]
        
        # Tokenization
        inputs = self.tokenizer(
            processed_smiles,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        )
        inputs = {key: value.to(self.device) for key, value in inputs.items()}
        
        # 获取embeddings
        with torch.no_grad():
            outputs = self.model(**inputs)
            # 使用[CLS] token或平均池化
            embeddings = outputs.last_hidden_state[:, 0, :]  # [CLS] token
        
        return embeddings

# 使用示例
def demo_smile_to_bert():
    """演示Smile-to-Bert使用"""
    # 初始化模型
    smile_bert = SmileToBert()
    
    # 示例SMILES
    smiles_examples = [
        "CCO",  # 乙醇
        "CC(=O)O",  # 乙酸
        "c1ccccc1",  # 苯
        "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"  # 咖啡因
    ]
    
    # 验证SMILES有效性
    valid_smiles = []
    for smi in smiles_examples:
        if Chem.MolFromSmiles(smi) is not None:
            valid_smiles.append(smi)
    
    # 生成embeddings
    embeddings = smile_bert.get_molecular_embedding(valid_smiles)
    print(f"Smile-to-Bert embeddings shape: {embeddings.shape}")
    print("Note: 这是概念实现，实际使用需要官方预训练权重")
    
    return embeddings

# 运行演示
# embeddings = demo_smile_to_bert()

3.6 MolBERT

模型简介

MolBERT是专门为化学领域定制的BERT模型[12]，针对处理SMILES字符串进行了优化，能够提取丰富的上下文分子表示。该模型在大规模化学语料库上预训练，特别适合分子相似性搜索和药物发现任务。

模型特点

化学特异性: 专门为化学SMILES数据定制
双向上下文: 利用BERT的双向注意力机制
迁移学习: 在小数据集上表现优异

使用示例

import os
import torch
import yaml
from typing import Sequence, Tuple, Union
import numpy as np

# 这里需要根据实际情况修改类的定义，为了代码完整，从原始文件中提取相关部分
class MolBertFeaturizer:
    def __init__(
        self,
        checkpoint_path: str,
        device: str = None,
        embedding_type: str = 'pooled',
        max_seq_len: int = None,
        permute: bool = False,
    ) -> None:
        super().__init__()
        self.checkpoint_path = checkpoint_path
        self.model_dir = os.path.dirname(os.path.dirname(checkpoint_path))
        self.hparams_path = os.path.join(self.model_dir, 'hparams.yaml')
        self.device = device or 'cuda' if torch.cuda.is_available() else 'cpu'
        self.embedding_type = embedding_type
        self.output_all = False if self.embedding_type in ['pooled'] else True
        self.max_seq_len = max_seq_len
        self.permute = permute

        # load config
        with open(self.hparams_path) as yaml_file:
            config_dict = yaml.load(yaml_file, Loader=yaml.FullLoader)

        # 假设这里有一个简单的 logger 实现，实际使用时需要导入 logging 模块
        class SimpleLogger:
            def debug(self, msg):
                print(msg)
        logger = SimpleLogger()
        logger.debug('loaded model trained with hparams:')
        logger.debug(config_dict)

        # 这里假设 SmilesIndexFeaturizer 已经定义，为了简化，省略其实现
        class SmilesIndexFeaturizer:
            @staticmethod
            def bert_smiles_index_featurizer(max_seq_len, permute):
                return None

        # load smiles index featurizer
        self.featurizer = self.load_featurizer(config_dict)

        # 这里假设 SmilesMolbertModel 已经定义，为了简化，省略其实现
        class SmilesMolbertModel:
            def __init__(self, config):
                self.config = config
            def load_from_checkpoint(self, checkpoint_path, hparam_overrides):
                pass
            def load_state_dict(self, state_dict):
                pass
            def eval(self):
                pass
            def freeze(self):
                pass
            def to(self, device):
                return self

        # load model
        from types import SimpleNamespace
        self.config = SimpleNamespace(**config_dict)
        self.model = SmilesMolbertModel(self.config)
        self.model.load_from_checkpoint(self.checkpoint_path, hparam_overrides=self.model.__dict__)

        # HACK: manually load model weights since they don't seem to load from checkpoint (PL v.0.8.5)
        checkpoint = torch.load(self.checkpoint_path, map_location=lambda storage, loc: storage)
        self.model.load_state_dict(checkpoint['state_dict'])

        self.model.eval()
        self.model.freeze()

        self.model = self.model.to(self.device)

        if self.output_all:
            self.model.model.config.output_hidden_states = True

    def load_featurizer(self, config_dict):
        # load smiles index featurizer
        if self.max_seq_len is None:
            max_seq_len = config_dict.get('max_seq_length')
            # 假设这里有一个简单的 logger 实现，实际使用时需要导入 logging 模块
            class SimpleLogger:
                def debug(self, msg):
                    print(msg)
            logger = SimpleLogger()
            logger.debug('getting smiles index featurizer of length: ', max_seq_len)
        else:
            max_seq_len = self.max_seq_len
        return SmilesIndexFeaturizer.bert_smiles_index_featurizer(max_seq_len, permute=self.permute)

    @staticmethod
    def trim_batch(input_ids, valid):
        # trim input horizontally if there is at least 1 valid data point
        if any(valid):
            _, cols = np.where(input_ids[valid] != 0)
        # else trim input down to 1 column (avoids empty batch error)
        else:
            cols = np.array([0])

        max_idx: int = int(cols.max().item() + 1)

        input_ids = input_ids[:, :max_idx]

        return input_ids

    def transform(self, molecules: Sequence[Any]) -> Tuple[Union[Dict, np.ndarray], np.ndarray]:
        # 这里假设 self.featurizer.transform 已经实现
        input_ids, valid = self.featurizer.transform(molecules)

        input_ids = self.trim_batch(input_ids, valid)

        token_type_ids = np.zeros_like(input_ids, dtype=np.long)
        attention_mask = np.zeros_like(input_ids, dtype=np.long)

        attention_mask[input_ids != 0] = 1

        input_ids = torch.tensor(input_ids, dtype=torch.long, device=self.device)
        token_type_ids = torch.tensor(token_type_ids, dtype=torch.long, device=self.device)
        attention_mask = torch.tensor(attention_mask, dtype=torch.long, device=self.device)

        with torch.no_grad():
            # 这里假设 self.model.model.bert 已经实现
            outputs = self.model.model.bert(
                input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask
            )

        if self.output_all:
            sequence_output, pooled_output, hidden = outputs
        else:
            sequence_output, pooled_output = outputs

        # set invalid outputs to 0s
        valid_tensor = torch.tensor(
            valid, dtype=sequence_output.dtype, device=sequence_output.device, requires_grad=False
        )

        pooled_output = pooled_output * valid_tensor[:, None]

        # concatenate and sum last 4 layers
        if self.embedding_type == 'average-sum-4':
            sequence_out = torch.sum(torch.stack(hidden[-4:]), dim=0)  # B x L x H
        # concatenate and sum last 2 layers
        elif self.embedding_type == 'average-sum-2':
            sequence_out = torch.sum(torch.stack(hidden[-2:]), dim=0)  # B x L x H
        # concatenate last four hidden layer
        elif self.embedding_type == 'average-cat-4':
            sequence_out = torch.cat(hidden[-4:], dim=-1)  # B x L x 4*H
        # concatenate last two hidden layer
        elif self.embedding_type == 'average-cat-2':
            sequence_out = torch.cat(hidden[-2:], dim=-1)  # B x L x 2*H
        # only last layer - same as default sequence output
        elif self.embedding_type == 'average-1':
            sequence_out = hidden[-1]  # B x L x H
        # only penultimate layer
        elif self.embedding_type == 'average-2':
            sequence_out = hidden[-2]  # B x L x H
        # only 3rd to last layer
        elif self.embedding_type == 'average-3':
            sequence_out = hidden[-3]  # B x L x H
        # only 4th to last layer
        elif self.embedding_type == 'average-4':
            sequence_out = hidden[-4]  # B x L x H
        # defaults to last hidden layer
        else:
            sequence_out = sequence_output  # B x L x H

        sequence_out = sequence_out * valid_tensor[:, None, None]

        sequence_out = sequence_out.detach().cpu().numpy()
        pooled_output = pooled_output.detach().cpu().numpy()

        if self.embedding_type == 'pooled':
            out = pooled_output
        elif self.embedding_type == 'average-1-cat-pooled':
            sequence_out = np.mean(sequence_out, axis=1)
            out = np.concatenate([sequence_out, pooled_output], axis=-1)
        elif self.embedding_type.startswith('average'):
            out = np.mean(sequence_out, axis=1)
        else:
            out = dict(sequence_output=sequence_out, pooled_output=pooled_output)

        return out, valid

# 示例使用
if __name__ == "__main__":
    # 从 README 中获取预训练模型的下载链接
    checkpoint_path = 'path/to/your/downloaded/checkpoint.ckpt'
    featurizer = MolBertFeaturizer(checkpoint_path=checkpoint_path)

    # 示例分子的 SMILES 字符串
    smiles_list = ['CCO', 'CCN']
    features, valid = featurizer.transform(smiles_list)
    print("Features:", features)
    print("Valid:", valid)

3.7 通用大语言模型在分子数据上的应用

LLaMA和GPT在SMILES上的应用

最近的研究表明，通用大语言模型如LLaMA和GPT在处理SMILES字符串方面表现出了惊人的能力[13]。这些模型虽然没有专门为化学领域设计，但其强大的语言理解能力使其能够有效处理分子表示。

性能对比

LLaMA: 在分子性质预测和药物-药物相互作用预测中表现优于GPT
GPT: 虽然性能略逊于LLaMA，但仍能产生有意义的分子表示
与专用模型对比: LLaMA在某些任务上可与专门的分子预训练模型相媲美

使用示例

# 使用HuggingFace接口调用通用大语言模型
from transformers import LlamaTokenizer, LlamaModel, GPT2Tokenizer, GPT2Model
import torch
from rdkit import Chem

class UniversalLLMForMolecules:
    """通用大语言模型用于分子表示学习"""
    
    def __init__(self, model_type='llama', model_name=None):
        """
        初始化通用LLM
        
        参数:
            model_type: 'llama' 或 'gpt2'
            model_name: 具体模型名称
        """
        if model_type == 'llama':
            # 注意：需要申请LLaMA访问权限
            model_name = model_name or "meta-llama/Llama-2-7b-hf"
            self.tokenizer = LlamaTokenizer.from_pretrained(model_name)
            self.model = LlamaModel.from_pretrained(model_name)
        elif model_type == 'gpt2':
            model_name = model_name or "gpt2"
            self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
            self.model = GPT2Model.from_pretrained(model_name)
            # GPT2需要设置pad_token
            self.tokenizer.pad_token = self.tokenizer.eos_token
        else:
            raise ValueError(f"Unsupported model type: {model_type}")
        
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.model.eval()
    
    def get_molecular_embeddings(self, smiles_list):
        """使用通用LLM获取分子embeddings"""
        # 验证SMILES
        valid_smiles = []
        for smi in smiles_list:
            mol = Chem.MolFromSmiles(smi)
            if mol is not None:
                valid_smiles.append(smi)
        
        # 为SMILES添加描述性前缀以提高理解
        prompted_smiles = [f"Molecule with SMILES: {smi}" for smi in valid_smiles]
        
        # Tokenization
        inputs = self.tokenizer(
            prompted_smiles,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        )
        inputs = {key: value.to(self.device) for key, value in inputs.items()}
        
        # 生成embeddings
        with torch.no_grad():
            outputs = self.model(**inputs)
            hidden_states = outputs.last_hidden_state
            
            # 使用平均池化获取序列级表示
            attention_mask = inputs['attention_mask'].unsqueeze(-1)
            masked_embeddings = hidden_states * attention_mask
            embeddings = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1)
        
        return embeddings

# 使用示例（需要相应的模型访问权限）
def demo_universal_llm():
    """演示通用LLM在分子数据上的应用"""
    try:
        # 使用GPT-2（更容易获取）
        llm = UniversalLLMForMolecules(model_type='gpt2', model_name='gpt2')
        
        smiles_examples = ["CCO", "CC(=O)O", "c1ccccc1"]
        embeddings = llm.get_molecular_embeddings(smiles_examples)
        
        print(f"Universal LLM embeddings shape: {embeddings.shape}")
        print("注意：通用LLM可能需要更多的提示工程以获得最佳性能")
        
    except Exception as e:
        print(f"Error loading universal LLM: {e}")
        print("请确保已安装相应的模型和权限")

# demo_universal_llm()

四、模型对比与选择指南

4.1 主要模型对比表

类别	模型	参数量	输出维度	预训练数据规模	主要优势	适用场景
蛋白质	ESM-2	8M-15B	320-5120	250M序列	进化信息丰富，多规模选择	蛋白质结构预测、功能注释
	ESM-C	300M-6B	1152	>1B序列	更高效率，更强性能	大规模蛋白质分析
	CARP	640M	1280	~1.7M序列	对比学习，自回归建模	蛋白质生成、设计
	ProtT5	~3B	1024	45M序列	T5架构，编码器-解码器	多任务蛋白质预测
	Ankh	~3B	1536	多语言数据	多语言支持	跨语言蛋白质研究
肽	PepBERT	~300M	320	UniParc肽序列	专门优化短肽	肽-蛋白质相互作用
小分子	ChemBERTa	12M-77M	384-768	77M分子	首个分子BERT，成熟生态	分子性质预测
	MolFormer	47M	512-768	1.1B分子	线性注意力，处理长序列	大规模分子筛选
	SMILES Transformer	~10M	512	1.7M分子	自编码，低数据优化	小数据集药物发现
	SMILES-BERT	~12M	768	大规模SMILES	掩码语言建模，半监督	分子性质预测
	Smile-to-Bert	~110M	768	PubChem+113描述符	多任务预训练，理化性质感知	综合分子性质预测
	MolBERT	~12M	768	化学语料库	化学特异性，双向上下文	分子相似性搜索
	LLaMA (分子)	7B+	4096+	通用+SMILES	强大语言理解，泛化能力	复杂分子推理任务
	GPT (分子)	175B+	12288+	通用+SMILES	生成能力强，对话式交互	分子生成和解释

4.2 性能与效率对比

计算资源需求

模型类别	内存需求	推理速度	训练复杂度	GPU要求
ESM-2 (650M)	~3GB	中等	高	V100/A100推荐
ESM-C (600M)	~2.5GB	快	中等	GTX 1080Ti可用
ChemBERTa	~500MB	快	低	GTX 1060可用
MolFormer	~1GB	快	中等	RTX 2080可用
SMILES-BERT	~500MB	快	中等	GTX 1060可用
Smile-to-Bert	~1GB	中等	中等	RTX 2080可用
MolBERT	~500MB	快	低	GTX 1060可用
LLaMA (7B)	~14GB	慢	极高	A100推荐
GPT (175B)	>350GB	极慢	极高	多卡A100

准确性表现

蛋白质任务
- 结构预测: ESM-2 > ESM-C > ProtT5
- 功能预测: ESM-C ≥ ESM-2 > CARP
- 肽相互作用: PepBERT > 通用蛋白质模型
分子性质预测
- 通用性能: MolFormer > Smile-to-Bert > ChemBERTa-2 > ChemBERTa
- 小数据集: SMILES Transformer > SMILES-BERT > 大模型
- 多任务学习: Smile-to-Bert > MolBERT > ChemBERTa
- 理化性质: Smile-to-Bert > 传统描述符方法
- 通用推理: LLaMA > GPT > 专用模型（在某些复杂任务上）

4.3 选择建议

根据应用场景选择

蛋白质研究

结构生物学: ESM-2 (t33或更大)
大规模分析: ESM-C (600M)
蛋白质设计: CARP
多任务预测: ProtT5

小分子研究

药物发现: MolFormer或Smile-to-Bert
新药研发: ChemBERTa-2或MolBERT
分子生成: 结合GPT/LLaMA的方法
概念验证: ChemBERTa或SMILES Transformer
理化性质预测: Smile-to-Bert（专门优化）

肽研究

肽-蛋白质相互作用: PepBERT
抗菌肽设计: PepBERT + 微调

根据资源条件选择

高性能计算环境

推荐: ESM-2大模型、MolFormer-XL、LLaMA/GPT分子应用
优势: 最佳性能，支持复杂推理

标准工作站

推荐: ESM-C、ChemBERTa、MolFormer标准版、Smile-to-Bert
平衡性能与资源需求

资源受限环境

推荐: ESM-2小模型、SMILES Transformer、SMILES-BERT
确保基本功能

根据数据特点选择

大规模数据

使用预训练大模型: MolFormer、ESM-C、LLaMA/GPT
利用规模优势

小规模数据

使用专门优化的模型: SMILES Transformer、PepBERT、SMILES-BERT
或使用预训练+微调

特定领域

理化性质预测: Smile-to-Bert
短肽: PepBERT
分子生成: GPT/LLaMA方法
化学推理: 通用大语言模型

五、最佳实践与技巧

5.1 模型选择策略

原型阶段: 使用小模型快速验证想法
性能优化: 逐步升级到大模型
生产部署: 平衡性能与资源需求
特殊需求: 选择专门优化的模型

5.2 优化技巧

内存优化

# 使用混合精度
model = model.half()

# 梯度检查点
model.gradient_checkpointing_enable()

# 批处理优化
def batch_inference(data, model, batch_size=32):
    results = []
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        with torch.no_grad():
            result = model(batch)
        results.append(result.cpu())
        torch.cuda.empty_cache()
    return torch.cat(results)

速度优化

# 模型编译（PyTorch 2.0+）
model = torch.compile(model)

# TensorRT优化（NVIDIA GPU）
import torch_tensorrt
optimized_model = torch_tensorrt.compile(model)

5.3 实用工具函数

def standardize_molecular_input(smiles_list):
    """标准化分子输入"""
    from rdkit import Chem
    standardized = []
    for smi in smiles_list:
        mol = Chem.MolFromSmiles(smi)
        if mol is not None:
            # 标准化SMILES
            canonical_smi = Chem.MolToSmiles(mol, canonical=True)
            standardized.append(canonical_smi)
        else:
            print(f"Invalid SMILES: {smi}")
    return standardized

def validate_protein_sequence(sequence):
    """验证蛋白质序列"""
    valid_amino_acids = set('ACDEFGHIKLMNPQRSTVWY')
    return all(aa in valid_amino_acids for aa in sequence.upper())

def estimate_memory_usage(model_name, batch_size, sequence_length):
    """估算内存使用量"""
    memory_map = {
        'esm2_t33_650M': lambda b, l: b * l * 1280 * 4 * 1e-9 + 2.5,
        'chemberta': lambda b, l: b * l * 768 * 4 * 1e-9 + 0.5,
        'molformer': lambda b, l: b * l * 768 * 4 * 1e-9 + 1.0,
    }
    
    if model_name in memory_map:
        estimated_gb = memory_map[model_name](batch_size, sequence_length)
        return f"Estimated memory usage: {estimated_gb:.2f} GB"
    else:
        return "Memory estimation not available for this model"

参考文献

[1] Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130.

[2] EvolutionaryScale. ESM Cambrian: Focused on creating representations of proteins. 2024. Available: https://github.com/evolutionaryscale/esm

[3] Rao R, et al. MSA Transformer. In: International Conference on Machine Learning. 2021:8844-8856.

[4] Elnaggar A, et al. ProtTrans: towards cracking the language of Life’s code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;44(10):7112-7127.

[5] ElNaggar A, et al. Ankh: Optimized protein language model unlocks general-purpose modelling. 2023. Available: https://huggingface.co/ElnaggarLab/ankh-large

[6] Zhang H, et al. PepBERT: A BERT-based model for peptide representation learning. 2023. Available: https://github.com/dzjxzyd/PepBERT-large

[7] Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885. 2020.

[8] Ross J, et al. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence. 2022;4(12):1256-1264.

[9] Honda S, Shi S, Ueda HR. SMILES transformer: Pre-trained molecular fingerprint for low data drug discovery. 2019. Available: https://github.com/DSPsleeporg/smiles-transformer

[10] Wang S, Guo Y, Wang Y, Sun H, Huang J. SMILES-BERT: Large scale unsupervised pre-training for molecular property prediction. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019:429-436.

[11] Barranco-Altirriba M, Würf V, Manzini E, Pauling JK, Perera-Lluna A. Smile-to-Bert: A BERT architecture trained for physicochemical properties prediction and SMILES embeddings generation. bioRxiv. 2024. doi:10.1101/2024.10.31.621293.

[12] MolBERT: A BERT-based model for molecular representation learning. GitHub. Available: https://github.com/BenevolentAI/MolBERT

[13] Al-Ghamdi A, et al. Can large language models understand molecules? BMC Bioinformatics. 2024;25:347.

[14] Molecular Transformer. Schwaller P, et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Science. 2019;5(9):1572-1583.

[15] ST-KD. Li S, et al. Stepping back to SMILES transformers for fast molecular representation inference. 2021. Available: https://openreview.net/forum?id=CyKQiiCPBEv

Mendelevium

Contact

分子表示学习模型全览：从蛋白质到小分子的语言模型

环境配置

基础依赖安装

可选：设置模型缓存路径

一、蛋白质语言模型

1.1 ESM-2系列

模型简介

可用模型规模

安装和使用

高级用法：注意力权重和接触预测

1.2 ESM-C (ESM Cambrian)

模型简介

性能对比

安装和使用

方法一：使用ESM SDK API（推荐）

方法二：使用远程API（需要注册）

1.3 CARP

模型简介

安装和使用

代码实现

1.4 ProtT5

模型简介

从本地路径加载模型

1.5 Ankh

模型简介

实现代码

二、肽语言模型

2.1 PepBERT

模型简介

模型特点

安装和使用

三、小分子语言模型

3.1 ChemBERTa系列

模型简介

主要版本

安装和使用

高级用法：微调ChemBERTa

3.2 MolFormer系列

模型简介

模型特点

安装和使用

MolFormer-XL超大规模版本

3.3 SMILES Transformer

模型简介

特点

安装和使用

3.4 SMILES-BERT

模型简介

模型特点

使用示例

3.5 Smile-to-Bert

模型简介

模型特点

使用示例

3.6 MolBERT

模型简介

模型特点

使用示例

3.7 通用大语言模型在分子数据上的应用

LLaMA和GPT在SMILES上的应用

性能对比

使用示例

四、模型对比与选择指南

4.1 主要模型对比表

4.2 性能与效率对比

计算资源需求

准确性表现

4.3 选择建议

根据应用场景选择

根据资源条件选择

根据数据特点选择

五、最佳实践与技巧

5.1 模型选择策略

5.2 优化技巧

内存优化

速度优化

5.3 实用工具函数

参考文献