相关文章推荐
酒量大的仙人掌  ·  High-precision ...·  1 月前    · 
  • 通讯作者: 陈琦,女,博士,副教授,研究方向:计算生物学、生物催化;E-mail: chenq@ecust.edu.cn
    许建和,男,博士,教授,研究方向:生物催化、生物化工;E-mail: jianhexu@ecust.edu.cn
  • 作者简介: 王慕镪,男,硕士研究生,研究方向:生物化工;E-mail: y30210500@mail.ecust.edu.cn 基金资助:
    国家重点研发计划(2019YFA0905000);国家重点研发计划(2021YFC2102300);国家自然科学基金项目(21871085);国家自然科学基金项目(31971380);国家自然科学基金项目(31971381)

    摘要:

    定向进化法通过模拟自然界的进化过程,可提高酶的进化速度,成为酶分子改造的关键技术。定向进化在生物催化以及药物设计等方面发挥着重要作用,但因突变的随机性所产生的数量庞大的突变体,使得实验筛选的能力面临巨大挑战。近年来,人工智能、大数据处理等新兴技术也发展成为生物催化领域的重要研究手段。其中,机器学习是一种统计学习的方法,通过数据驱动的方式获得序列/结构到酶功能的映射,为提高酶分子工程的效率提供帮助。本文综述了机器学习模型中所涉及的数据处理、描述符和算法等内容,重点叙述了机器学习方法在酶工程方面的研究与应用进展。随着机器学习算法和应用技术的进步,有望提出更加精准和有效的模型,助力新酶筛选与生物催化剂的精准设计改造。

    Abstract:

    Directed evolution can increase the rate of enzyme evolution by mimicking the natural evolutionary process and has become a key technology for enzyme engineering. Directed evolution has played an important role in biocatalysis and drug design, however the experimental screening is in great challenge due to the large number of mutant libraries caused by the randomness of mutations. In recent years, emerging technologies such as artificial intelligence and big data processing have also become crucial in biocatalysis researches. Machine learning methods are statistical learning approaches to obtain sequence/structure mappings to enzyme function in a data-driven manner, which will improve the efficiency of enzyme engineering. This paper reviews the state-of-the-art technologies involved in machine learning models, especially focusing on the research and application progresses of machine learning methods in enzyme engineering. With the advancement of machine learning algorithms and technologies, it is expected that more accurate and effective models will be proposed in the future to promote screening of new enzymes and accurate design of biocatalysts.

    Key words: directed evolution, machine learning, protein engineering, biocatalysis

    类别
    Type
    编码方法/信息
    Encoding methods/Information
    描述符
    Descriptors
    特征信息
    Feature information
    参考文献
    Reference
    基于序列
    的描述符
    单热编码 识别 Identity 表示残基位置 [ 27 ]
    同源信息 位置特异性得分矩阵 PSSM 序列的同源信息 [ 32 ]
    理化性质 一维深度卷积神经网络 DeepSF 序列描述符、序列位置、二级结构和溶剂可及性 [ 33 ]
    几何描述符 Geometric descriptors 转角密度和残基距离密度直方图 [ 34 ]
    联合三元组描述符 Conjoint triad descriptors K间隔残基对和联合三联体 [ 35 ]
    空间和化学特征 Spatial and chemical features 三维网络蛋白配体结构信息 [ 36 ]
    Protr package中的氨基酸组成描述符 氨基酸的含量 [ 37 ]
    谱核函数 Spectrum kernel 远端残基间同源性 [ 38 - 39 ]
    z-标度 zScales 氨基酸的理化特性 [ 40 ]
    主成分分析疏水、立体和电子性质矢量 VHSE 基于疏水特性、立体特性和电子特性的主成分分析降维得到的信息 [ 41 ]
    sScales 基于AAindex的理化特征 [ 42 ]
    ProFET 基于文献检索选择AAindex理化特征 [ 43 ]
    拓扑标度 T-scale 拓扑结构特征 [ 44 ]
    结构拓扑标度 ST-scale 拓扑结构特征 [ 45 ]
    蛋白质指纹图谱 ProtFP 氨基酸的理化性质 [ 46 ]
    AAindex 蛋白质属性 [ 47 ]
    隐藏信息 UniRep 通过神经网络模型自动提取序列特征 [ 48 ]
    基于结构
    的描述符
    理化性质 sPairs 基于氨基酸对接触图和AAindex二维描述符 [ 49 ]
    单热编码 残基-残基接触图Residue-residue contact map 同一家族两个蛋白之间的结构距离 [ 20 ]
    嵌入式
    描述符
    隐藏信息 ProtVec 基于三联氨基酸产生的突变信息 [ 50 ]
    突变描述符 表示突变方式 MutInd 使用0或1表示相应的突变是否发生在突变体序列中 [ 27 ]

    表1 机器学习用于指导酶分子设计时常用的描述符

    Table 1 Descriptors used in machine learning-guided enzyme design

    类别
    Type
    编码方法/信息
    Encoding methods/Information
    描述符
    Descriptors
    特征信息
    Feature information
    参考文献
    Reference
    基于序列
    的描述符
    单热编码 识别 Identity 表示残基位置 [ 27 ]
    同源信息 位置特异性得分矩阵 PSSM 序列的同源信息 [ 32 ]
    理化性质 一维深度卷积神经网络 DeepSF 序列描述符、序列位置、二级结构和溶剂可及性 [ 33 ]
    几何描述符 Geometric descriptors 转角密度和残基距离密度直方图 [ 34 ]
    联合三元组描述符 Conjoint triad descriptors K间隔残基对和联合三联体 [ 35 ]
    空间和化学特征 Spatial and chemical features 三维网络蛋白配体结构信息 [ 36 ]
    Protr package中的氨基酸组成描述符 氨基酸的含量 [ 37 ]
    谱核函数 Spectrum kernel 远端残基间同源性 [ 38 - 39 ]
    z-标度 zScales 氨基酸的理化特性 [ 40 ]
    主成分分析疏水、立体和电子性质矢量 VHSE 基于疏水特性、立体特性和电子特性的主成分分析降维得到的信息 [ 41 ]
    sScales 基于AAindex的理化特征 [ 42 ]
    ProFET 基于文献检索选择AAindex理化特征 [ 43 ]
    拓扑标度 T-scale 拓扑结构特征 [ 44 ]
    结构拓扑标度 ST-scale 拓扑结构特征 [ 45 ]
    蛋白质指纹图谱 ProtFP 氨基酸的理化性质 [ 46 ]
    AAindex 蛋白质属性 [ 47 ]
    隐藏信息 UniRep 通过神经网络模型自动提取序列特征 [ 48 ]
    基于结构
    的描述符
    理化性质 sPairs 基于氨基酸对接触图和AAindex二维描述符 [ 49 ]
    单热编码 残基-残基接触图Residue-residue contact map 同一家族两个蛋白之间的结构距离 [ 20 ]
    嵌入式
    描述符
    隐藏信息 ProtVec 基于三联氨基酸产生的突变信息 [ 50 ]
    突变描述符 表示突变方式 MutInd 使用0或1表示相应的突变是否发生在突变体序列中 [ 27 ]
    类别
    Type
    算法/方法
    Algorithm/Method
    特征
    Feature
    文献
    Reference
    经典机器学习 贝叶斯算法 为变量之间多种关系建模 [ 51 ]
    高斯过程 GP 基于数据对所有可能的模型输出的概率进行统计,根据概率分布情况输出预测结果,使用了协方差确定了数据与数据之间的关系 [ 5 , 27 , 52 - 55 ]
    K近邻 KNN 基于数据-标签关系,比较新数据与旧数据之间的特征距离,提取特征最相似的数据 [ 56 ]
    支持向量机 SVM 基于核函数的算法,可以通过升维将原来线性不可分的关系转变为线性可分的关系 [ 57 ]
    决策树 DTs 对输入数据进行分类或预测 [ 58 ]
    随机森林 RF 一种Bagging方法,每个分类器的数据采集和特征选择数量一致且随机 [ 59 ]
    AdaBoost 一种Boosting方法,将弱分类器融合形成一个强分类器 [ 60 ]
    Stacking 聚合使用多个分类器进行第一轮训练,将输出作为第二轮输入,确定某一个分类器用于训练输出结果 [ 61 ]
    主成分分析 PCA 基于无监督学习将原始数据映射到新的特征空间以提取特征 [ 62 ]
    深度学习 深度神经网络 DNN 具有多个隐藏层的ANN [ 63 ]
    前馈神经网络 FNN 神经网络结构中不含循环结构 [ 27 ]
    循环神经网络 RNN 识别序列的上下文关系并建模 [ 64 ]
    卷积神经网络 CNN 输入数据为图像或类图像形式 [ 65 ]
    对抗生成网络 GAN 同时训练两个独立的竞争网络 [ 66 ]

    表2 机器学习的常用算法

    Table 2 Algorithms of machine learning

    类别
    Type
    算法/方法
    Algorithm/Method
    特征
    Feature
    文献
    Reference
    经典机器学习 贝叶斯算法 为变量之间多种关系建模 [ 51 ]
    高斯过程 GP 基于数据对所有可能的模型输出的概率进行统计,根据概率分布情况输出预测结果,使用了协方差确定了数据与数据之间的关系 [ 5 , 27 , 52 - 55 ]
    K近邻 KNN 基于数据-标签关系,比较新数据与旧数据之间的特征距离,提取特征最相似的数据 [ 56 ]
    支持向量机 SVM 基于核函数的算法,可以通过升维将原来线性不可分的关系转变为线性可分的关系 [ 57 ]
    决策树 DTs 对输入数据进行分类或预测 [ 58 ]
    随机森林 RF 一种Bagging方法,每个分类器的数据采集和特征选择数量一致且随机 [ 59 ]
    AdaBoost 一种Boosting方法,将弱分类器融合形成一个强分类器 [ 60 ]
    Stacking 聚合使用多个分类器进行第一轮训练,将输出作为第二轮输入,确定某一个分类器用于训练输出结果 [ 61 ]
    主成分分析 PCA 基于无监督学习将原始数据映射到新的特征空间以提取特征 [ 62 ]
    深度学习 深度神经网络 DNN 具有多个隐藏层的ANN [ 63 ]
    前馈神经网络 FNN 神经网络结构中不含循环结构 [ 27 ]
    循环神经网络 RNN 识别序列的上下文关系并建模 [ 64 ]
    卷积神经网络 CNN 输入数据为图像或类图像形式 [ 65 ]
    对抗生成网络 GAN 同时训练两个独立的竞争网络 [ 66 ]
    模型
    Model
    任务
    Task
    机器学习算法
    Machine learning algorithm
    输入类型
    Input type
    应用
    Application
    文献
    Reference
    - 酶的功能 RF 分子 乙酰胆碱酯酶抑制剂与非抑制剂的鉴别 [ 70 ]
    CWLy-SVM 酶的分类 SVM 序列 鉴定细胞壁催化酶 [ 71 ]
    SVR 酶的功能 SVM 序列 改善酶的活力与溶解度 [ 72 ]
    GPR/GNB 酶的功能 GP 结构 改善脂肪酰基还原酶的活力 [ 5 ]
    - 酶的分类 HMM/RF/LR/KNN/SVM/RF 序列 分类第七家族糖苷水解酶中的CBH和EG [ 9 ]
    Innov'SAR 酶的功能 PLSR 序列 找到提高活性的最佳突变组合 [ 10 , 73 - 74 ]
    - 酶的功能 LR 结构 测定底物-酶对的反应活性 [ 75 ]
    SoluProt 酶的功能 RF 序列 预测酶在大肠杆菌表达系统中的溶解性 [ 76 ]
    ProSAR 酶的功能 PLSR 序列 提高卤醇脱卤酶的活力 [ 3 ]
    TOME 酶的功能 RF 序列 预测最适温度 [ 13 , 77 ]
    PREvaIL 酶的催化残基 RF 序列和结构 预测酶的催化残基的方法 [ 78 ]
    - 酶的功能 GP 序列 改造绿色荧光蛋白的荧光性 [ 79 ]
    - 酶的功能 GP 结构 改造细胞色素P450的稳定性 [ 20 ]
    - 酶的催化残基 CNN 结构 预测酶的催化残基的框架 [ 80 ]
    SolventNet 酶的功能 CNN 结构 酸、催化剂和溶剂对水解速率的影响 [ 81 ]
    DeepSol 酶的溶解性 ANN 序列 预测蛋白质的溶解性 [ 82 ]
    - 酶的功能 CNN 结构 优化PET水解酶的催化能力和耐受性 [ 83 ]

    表3 机器学习在酶工程的应用

    Table 3 Application of machine learning in enzyme engineering

    模型
    Model
    任务
    Task
    机器学习算法
    Machine learning algorithm
    输入类型
    Input type
    应用
    Application
    文献
    Reference
    - 酶的功能 RF 分子 乙酰胆碱酯酶抑制剂与非抑制剂的鉴别 [ 70 ]
    CWLy-SVM 酶的分类 SVM 序列 鉴定细胞壁催化酶 [ 71 ]
    SVR 酶的功能 SVM 序列 改善酶的活力与溶解度 [ 72 ]
    GPR/GNB 酶的功能 GP 结构 改善脂肪酰基还原酶的活力 [ 5 ]
    - 酶的分类 HMM/RF/LR/KNN/SVM/RF 序列 分类第七家族糖苷水解酶中的CBH和EG [ 9 ]
    Innov'SAR 酶的功能 PLSR 序列 找到提高活性的最佳突变组合 [ 10 , 73 - 74 ]
    - 酶的功能 LR 结构 测定底物-酶对的反应活性 [ 75 ]
    SoluProt 酶的功能 RF 序列 预测酶在大肠杆菌表达系统中的溶解性 [ 76 ]
    ProSAR 酶的功能 PLSR 序列 提高卤醇脱卤酶的活力 [ 3 ]
    TOME 酶的功能 RF 序列 预测最适温度 [ 13 , 77 ]
    PREvaIL 酶的催化残基 RF 序列和结构 预测酶的催化残基的方法 [ 78 ]
    - 酶的功能 GP 序列 改造绿色荧光蛋白的荧光性 [ 79 ]
    - 酶的功能 GP 结构 改造细胞色素P450的稳定性 [ 20 ]
    - 酶的催化残基 CNN 结构 预测酶的催化残基的框架 [ 80 ]
    SolventNet 酶的功能 CNN 结构 酸、催化剂和溶剂对水解速率的影响 [ 81 ]
    DeepSol 酶的溶解性 ANN 序列 预测蛋白质的溶解性 [ 82 ]
    - 酶的功能 CNN 结构 优化PET水解酶的催化能力和耐受性 [ 83 ]
    Chen K, Arnold FH. Tuning the activity of an enzyme for unusual environments: sequential random mutagenesis of subtilisin E for catalysis in dimethylformamide[J]. Proc Natl Acad Sci USA , 1993 , 90 (12): 5618-5622. pmid: 8516309 Tang CD, Zhang ZH, Shi HL, et al. Directed evolution of formate dehydrogenase and its application in the biosynthesis of L-phenylglycine from phenylglyoxylic acid[J]. Mol Catal , 2021 , 513 : 111666. Fox RJ, Davis SC, Mundorff EC, et al. Improving catalytic function by ProSAR-driven enzyme evolution[J]. Nat Biotechnol , 2007 , 25 (3): 338-344. pmid: 17322872 Reetz MT. The importance of additive and non-additive mutational effects in protein engineering[J]. Angewandte Chemie Int Ed , 2013 , 52 (10): 2658-2666. doi: 10.1002/anie.201207842 Greenhalgh JC, Fahlberg SA, Pfleger BF, et al. Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production[J]. Nat Commun , 2021 , 12 (1): 5825. doi: 10.1038/s41467-021-25831-w pmid: 34611172 Miton CM, Tokuriki N. How mutational epistasis impairs predictability in protein evolution and design[J]. Protein Sci , 2016 , 25 (7): 1260-1272. doi: 10.1002/pro.2876 pmid: 26757214 Romero PA, Arnold FH. Exploring protein fitness landscapes by directed evolution[J]. Nat Rev Mol Cell Biol , 2009 , 10 (12): 866-876. doi: 10.1038/nrm2805 Ma EJ, Siirola E, Moore C, et al. Machine-directed evolution of an imine reductase for activity and stereoselectivity[J]. ACS Catal , 2021 , 11 (20): 12433-12445. doi: 10.1021/acscatal.1c02786 Gado JE, Harrison BE, Sandgren M, et al. Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases[J]. J Biol Chem , 2021 , 297 (2): 100931. doi: 10.1016/j.jbc.2021.100931 Ostafe R, Fontaine N, Frank D, et al. One-shot optimization of multiple enzyme parameters: Tailoring glucose oxidase for pH and electron mediators[J]. Biotechnol Bioeng , 2020 , 117 (1): 17-29. doi: 10.1002/bit.27169 pmid: 31520472 Peng M, de Vries RP. Machine learning prediction of novel pectinolytic enzymes in Aspergillus niger through integrating heterogeneous(post-)genomics data[J]. Microb Genom , 2021 , 7 (12): 000674. Wu Z, Kan SBJ, Lewis RD, et al. Machine learning-assisted directed protein evolution with combinatorial libraries[J]. Proc Natl Acad Sci USA , 2019 , 116 (18): 8852-8858. doi: 10.1073/pnas.1901979116 pmid: 30979809 Li GY, Dong YJ, Reetz MT. Can machine learning revolutionize directed evolution of selective enzymes?[J]. Adv Synth Catal , 2019 , 361 (11): 2377-2386. Baştanlar Y, Özuysal M. Introduction to machine learning[M]// miRNomics:microRNA biology and computational analysis . Totowa, NJ: Humana Press, 2013 : 105-128. 蒋迎迎, 曲戈, 孙周通. 机器学习助力酶定向进化[J]. 生物学杂志 , 2020 , 37 (4): 1-11. Sikander R, Wang YP, Ghulam A, et al. Identification of enzymes-specific protein domain based on DDE, and convolutional neural network[J]. Front Genet , 2021 , 12 : 759384. doi: 10.3389/fgene.2021.759384 Jing XY, Li FM. Predicting cell wall lytic enzymes using combined features[J]. Front Bioeng Biotechnol , 2021 , 8 : 627335. doi: 10.3389/fbioe.2020.627335 Wan ZY, Wang QD, Liu DC, et al. Accelerating the optimization of enzyme-catalyzed synthesis conditions via machine learning and reactivity descriptors[J]. Org Biomol Chem , 2021 , 19 (28): 6267-6273. doi: 10.1039/D1OB01066B Kirk PDW, Stumpf MPH. Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data[J]. Bioinformatics , 2009 , 25 (10): 1300-1306. doi: 10.1093/bioinformatics/btp139 pmid: 19289448 Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes[J]. Proc Natl Acad Sci USA , 2013 , 110 (3): E193-E201. Rasmussen CE, Williams CKI. Gaussian processes for machine learning [M]. Cambridge: The MIT Press, 2005 . Zhang ZH, Schott JA, Liu MM, et al. Prediction of carbon dioxide adsorption via deep learning[J]. Angew Chem Int Ed Engl , 2019 , 58 (1): 259-263. doi: 10.1002/anie.201812363 Luo HZ, Gao L, Liu Z, et al. Prediction of phenolic compounds and glucose content from dilute inorganic acid pretreatment of lignocellulosic biomass using artificial neural network modeling[J]. Bioresour Bioprocess , 2021 , 8 : 134. doi: 10.1186/s40643-021-00488-x Saito Y, Oikawa M, Sato T, et al. Machine-learning-guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration[J]. ACS Catal , 2021 , 11 (23): 14615-14624. doi: 10.1021/acscatal.1c03753 del Rio-Chanona EA, Fiorelli F, Zhang DD, et al. An efficient model construction strategy to simulate microalgal lutein photo-production dynamic process[J]. Biotechnol Bioeng , 2017 , 114 (11): 2518-2527. doi: 10.1002/bit.26373 pmid: 28671262 卞佳豪, 杨广宇. 人工智能辅助的蛋白质工程[J]. 合成生物学 , 2022 , 3 (3): 429-444. doi: 10.12211/2096-8280.2021-032 Xu YT, Verma D, Sheridan RP, et al. Deep dive into machine learning models for protein engineering[J]. J Chem Inf Model , 2020 , 60 (6): 2773-2790. doi: 10.1021/acs.jcim.0c00073 pmid: 32250622 Yang KK, Wu Z, Arnold FH. Machine-learning-guided directed evolution for protein engineering[J]. Nat Methods , 2019 , 16 (8): 687-694. doi: 10.1038/s41592-019-0496-6 pmid: 31308553 Yang KK, Wu Z, Bedbrook CN, et al. Learned protein embeddings for machine learning[J]. Bioinformatics , 2018 , 34 (15): 2642-2648. doi: 10.1093/bioinformatics/bty178 pmid: 29584811 Roy S, Martinez D, Platero H, et al. Exploiting amino acid composition for predicting protein-protein interactions[J]. PLoS One , 2009 , 4 (11): e7813. doi: 10.1371/journal.pone.0007813 Wolpert DH. The lack of a priori distinctions between learning algorithms[J]. Neural Comput , 1996 , 8 (7): 1341-1390. doi: 10.1162/neco.1996.8.7.1341 van Westen GJ, Swier RF, Wegner JK, et al. Benchmarking of protein descriptor sets in proteochemometric modeling(part 2): comparative study of 13 amino acid descriptor sets[J]. J Cheminform , 2013 , 5 (1): 41. doi: 10.1186/1758-2946-5-41 Hou J, Adhikari B, Cheng JL. DeepSF: deep convolutional neural network for mapping protein sequences to folds[J]. Bioinformatics , 2018 , 34 (8): 1295-1303. doi: 10.1093/bioinformatics/btx780 pmid: 29228193 Zacharaki EI. Prediction of protein function using a deep convolutional neural network ensemble[J]. Peerj Comput Sci , 2017 , 3 : e124. doi: 10.7717/peerj-cs.124 White C, Ismail HD, Saigo H, et al. CNN-BLPred: a convolutional neural network based predictor for β-lactamases(BL)and their classes[J]. BMC Bioinformatics , 2017 , 18 (Suppl 16): 577. doi: 10.1186/s12859-017-1972-6 Ragoza M, Hochuli J, Idrobo E, et al. Protein-ligand scoring with convolutional neural networks[J]. J Chem Inf Model , 2017 , 57 (4): 942-957. doi: 10.1021/acs.jcim.6b00740 pmid: 28368587 Xiao N, Cao DS, Zhu MF, et al. Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences[J]. Bioinformatics , 2015 , 31 (11): 1857-1859. doi: 10.1093/bioinformatics/btv042 pmid: 25619996 Ismail HD, Saigo H, Kc DB. RF-NR: random forest based approach for improved classification of nuclear receptors[J]. IEEE/ACM Trans Comput Biol Bioinform , 2018 , 15 (6): 1844-1852. doi: 10.1109/TCBB.2017.2773063 pmid: 29990125 Leslie C, Eskin E, Noble WS. The spectrum kernel: a string kernel for SVM protein classification[J]. Pac Symp Biocomput , 2002 : 564-575. Sandberg M, Eriksson L, Jonsson J, et al. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids[J]. J Med Chem , 1998 , 41 (14): 2481-2491. pmid: 9651153 Mei H, Liao ZH, Zhou Y, et al. A new set of amino acid descriptors and its application in peptide QSARs[J]. Biopolymers , 2005 , 80 (6): 775-786. pmid: 15895431 Biou V, Gibrat JF, Levin JM, et al. Secondary structure prediction: combination of three different methods[J]. Protein Eng , 1988 , 2 (3): 185-191. pmid: 3237683 Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions[J]. Bioinformatics , 2015 , 31 (21): 3429-3436. doi: 10.1093/bioinformatics/btv345 pmid: 26130574 Tian FF, Zhou P, Li ZL. T-scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides[J]. J Mol Struct , 2007 , 830 (1/2/3): 106-115. doi: 10.1016/j.molstruc.2006.07.004 Yang L, Shu M, Ma KW, et al. ST-scale as a novel amino acid descriptor and its application in QSAM of peptides and analogues[J]. Amino Acids , 2010 , 38 (3): 805-816. doi: 10.1007/s00726-009-0287-y pmid: 19373543 van Westen GJ, Swier RF, Wegner JK, et al. Benchmarking of protein descriptor sets in proteochemometric modeling(part 1): comparative study of 13 amino acid descriptor sets[J]. J Cheminform , 2013 , 5 (1): 41. doi: 10.1186/1758-2946-5-41 Kawashima S, Pokarowski P, Pokarowska M, et al. AAindex: amino acid index database, progress report 2008[J]. Nucleic Acids Res , 2008 , 36 (Database issue): D202-D205. doi: 10.1093/nar/gkm998 pmid: 17998252 Alley EC, Khimulya G, Biswas S, et al. Unified rational protein engineering with sequence-based deep representation learning[J]. Nat Methods , 2019 , 16 (12): 1315-1322. doi: 10.1038/s41592-019-0598-1 pmid: 31636460 Tanaka S, Scheraga HA. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins[J]. Macromolecules , 1976 , 9 (6): 945-950. pmid: 1004017 Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics[J]. PLoS One , 2015 , 10 (11): e0141287. doi: 10.1371/journal.pone.0141287 Jensen FV. An introduction to Bayesian networks [M]. London: UCL press, 1996 Lim S, Lu Y, Cho CY, et al. A review on compound-protein interaction prediction methods: Data, format, representation and model[J]. Comput Struct Biotechnol J , 2021 , 19 : 1541-1556. del Rio-Chanona EA, Cong XY, Bradford E, et al. Review of advanced physical and data-driven models for dynamic bioprocess simulation: case study of algae-bacteria consortium wastewater treatment[J]. Biotechnol Bioeng , 2019 , 116 (2): 342-353. doi: 10.1002/bit.26881 pmid: 30475404 Natarajan P, Moghadam R, Jagannathan S. Online deep neural network-based feedback control of a Lutein bioprocess[J]. J Process Control , 2021 , 98 : 41-51. doi: 10.1016/j.jprocont.2020.11.011 Kim GB, Kim WJ, Kim HU, et al. Machine learning applications in systems metabolic engineering[J]. Curr Opin Biotechnol , 2020 , 64 : 1-9. doi: 10.1016/j.copbio.2019.08.010 Wettschereck D, Aha DW, Mohri T. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms[M]// Lazy learning . Dordrecht: Springer Netherlands, 1997 : 273-314. Drucker H, Surges CJC, Kaufman L, et al. Support vector regression machines[J]. Adv Neural Inf Process Syst , 1997 : 155-161. Quinlan JR. Induction of decision trees[J]. Mach Learn , 1986 , 1 (1): 81-106. Li Y, Song K, Zhang J, et al. A computational method to predict effects of residue mutations on the catalytic efficiency of hydrolases[J]. Catalysts , 2021 , 11 (2): 286. doi: 10.3390/catal11020286 Schapire RE. Explaining adaboost[M]// SchölkopfB, LuoZY, VovkV. Empirical inference. Verlag:Springer , 2013 : 37-52. Wolpert DH. Stacked generalization[J]. Neural Netw , 1992 , 5 (2): 241-259. doi: 10.1016/S0893-6080(05)80023-1 Abdi H, Williams LJ. Principal component analysis[J]. WIREs Comp Stat , 2010 , 2 (4): 433-459. doi: 10.1002/wics.101 LeCun Y, Bengio Y, Hinton G. Deep learning[J]. Nature , 2015 , 521 (7553): 436-444. doi: 10.1038/nature14539 Lohmann R, Schneider G, Behrens D, et al. A neural network model for the prediction of membrane-spanning amino acid sequences[J]. Protein Sci , 1994 , 3 (9): 1597-1601. pmid: 7833818 Rawat W, Wang ZH. Deep convolutional neural networks for image classification: a comprehensive review[J]. Neural Comput , 2017 , 29 (9): 2352-2449. doi: 10.1162/NECO_a_00990 pmid: 28599112 Creswell A, White T, Dumoulin V, et al. Generative adversarial net-works: an overview[J]. IEEE Signal Process Mag , 2018 , 35 (1): 53-65. doi: 10.1109/MSP.2017.2765202 Auer P. Using confidence bounds for exploitation-exploration trade-offs[J]. J Machine Learning Res , 2002 , 3 (Nov): 397-422. International Conference on Machine Learning. Proceedings of the Twenty-Ninth International Conference on Machine Learning[C]. Madison, Wis: International Machine Learning Society , 2012 . Endelman JB, Silberg JJ, Wang ZG, et al. Site-directed protein recombination as a shortest-path problem[J]. Protein Eng Des Sel , 2004 , 17 (7): 589-594. pmid: 15331774 Sandhu H, Kumar RN, Garg P. Machine learning-based modeling to predict inhibitors of acetylcholinesterase[J]. Mol Divers , 2022 , 26 (1): 331-340. doi: 10.1007/s11030-021-10223-5 Meng CL, Guo F, Zou Q. CWLy-SVM: a support vector machine-based tool for identifying cell wall lytic enzymes[J]. Comput Biol Chem , 2020 , 87 : 107304. doi: 10.1016/j.compbiolchem.2020.107304 Han X, Ning WB, Ma XQ, et al. Improve protein solubility and activity based on machine learning models[J]. bioRxiv , 2019 . DOI: 10.1101/817890 . doi: 10.1101/817890 Cadet F, Fontaine N, Li GY, et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes[J]. Sci Rep , 2018 , 8 (1): 16757. doi: 10.1038/s41598-018-35033-y pmid: 30425279 Cadet F, Fontaine N, Vetrivel I, et al. Application of fourier transform and proteochemometrics principles to protein engineering[J]. BMC Bioinformatics , 2018 , 19 (1): 382. doi: 10.1186/s12859-018-2407-8 pmid: 30326841 Bonk BM, Weis JW, Tidor B. Machine learning identifies chemical characteristics that promote enzyme catalysis[J]. J Am Chem Soc , 2019 , 141 (9): 4108-4118. doi: 10.1021/jacs.8b13879 pmid: 30761897 Hon J, Borko S, Stourac J, et al. EnzymeMiner: automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities[J]. Nucleic Acids Res , 2020 , 48 (W1): W104-W109. doi: 10.1093/nar/gkaa372 Li G, Rabe KS, Nielsen J, et al. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima[J]. ACS Synth Biol , 2019 , 8 (6): 1411-1420. doi: 10.1021/acssynbio.9b00099 pmid: 31117361 Song JN, Li FY, Takemoto K, et al. PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework[J]. J Theor Biol , 2018 , 443 : 125-137. doi: S0022-5193(18)30039-0 pmid: 29408627 Saito Y, Oikawa M, Nakazawa H, et al. Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins[J]. ACS Synth Biol , 2018 , 7 (9): 2014-2022. doi: 10.1021/acssynbio.8b00155 pmid: 30103599 Torng W, Altman RB. High precision protein functional site detection using 3D convolutional neural networks[J]. Bioinformatics , 2019 , 35 (9): 1503-1512. doi: 10.1093/bioinformatics/bty813 pmid: 31051039 Chew A, Jiang SL, Zhang WQ, et al. Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks[J]. Chem Sci , 2019 , 11 : 12464-12476. doi: 10.1039/D0SC03261A Khurana S, Rawi R, Kunji K, et al. DeepSol: a deep learning framework for sequence-based protein solubility prediction[J]. Bioinformatics , 2018 , 34 (15): 2605-2613. doi: 10.1093/bioinformatics/bty166 pmid: 29554211 Lu HY, Diaz DJ, Czarnecki NJ, et al. Machine learning-aided engineering of hydrolases for PET depolymerization[J]. Nature , 2022 , 604 (7907): 662-667. doi: 10.1038/s41586-022-04599-z Dubey A, Realff MJ, Lee JH, et al. Support vector machines for learning to identify the critical positions of a protein[J]. J Theor Biol , 2005 , 234 (3): 351-361. pmid: 15784270 Cai YC, Yang HB, Li WH, et al. Computational prediction of site of metabolism for UGT-catalyzed reactions[J]. J Chem Inf Model , 2019 , 59 (3): 1085-1095. doi: 10.1021/acs.jcim.8b00851 pmid: 30586295 Silberg JJ, Endelman JB, Arnold FH. SCHEMA -guided protein recombination[J]. Methods Enzymol , 2004 , 388 : 35-42. Srinivas N, Krause A, Kakade SM, et al. Gaussian process optimization in the bandit setting: no regret and experimental design[EB/OL]. 2009: arXiv: 0912.3995[cs.LG] . https://arxiv.org/abs/0912.3995. Voigt CA, Martinez C, Wang ZG, et al. Protein building blocks preserved by recombination[J]. Nat Struct Biol , 2002 , 9 (7): 553-558. pmid: 12042875 Shroff R, Cole AW, Diaz DJ, et al. Discovery of novel gain-of-function mutations guided by structure-based deep learning[J]. ACS Synth Biol , 2020 , 9 (11): 2927-2935. doi: 10.1021/acssynbio.0c00345 pmid: 33064458 Paik I, Ngo PHT, Shroff R, et al. Improved bst DNA polymerase variants derived via a machine learning approach[J]. Biochemistry , 2021 . https://doi.org/10.1021/acs.biochem.1c00451 . Kulikova AV, Diaz DJ, Loy JM, et al. Learning the local landscape of protein structures with convolutional neural networks[J]. J Biol Phys , 2021 , 47 (4): 435-454. doi: 10.1007/s10867-021-09593-6 pmid: 34751854 Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold[J]. Nature , 2021 , 596 (7873): 583-589. doi: 10.1038/s41586-021-03819-2 Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network[J]. Science , 2021 , 373 (6557): 871-876. doi: 10.1126/science.abj8754 pmid: 34282049 Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture the effects of mutations[J]. Nat Methods , 2018 , 15 (10): 816-822. doi: 10.1038/s41592-018-0138-4 pmid: 30250057 陈春, 宿玲恰, 夏伟, 吴敬. 定向进化提高来源于Arthrobacter ramosus 的MTHase的热稳定性 [J]. 生物技术通报, 2021, 37(3): 84-91. 石利霞, 高松枫, 朱蕾蕾. PET水解酶的研究进展 [J]. 生物技术通报, 2020, 36(10): 226-236. 王叶, 贾振华, 宋水山. 宏基因组学结合合成生物学法挖掘新型生物催化剂的研究进展 [J]. 生物技术通报, 2018, 34(8): 35-42. 任天雷, 杨海泉, 许菲. 基于分子结构与生物信息学等多维度特征的定向进化改造甲基对硫磷水解酶 [J]. 生物技术通报, 2018, 34(10): 194-200. 王晓璐, 王钰,刘娇,郑平,路福平. 利用定向进化提高基因工程大肠杆菌的甲醇利用能力 [J]. 生物技术通报, 2017, 33(9): 101-109. 郭园, 赵仲麟. 微生物系统定向进化与合成生物学应用研究进展 [J]. 生物技术通报, 2017, 33(1): 76-82. 吴树丽, 刘启顺, 谭海东, 张付云, 尹恒. 5-羟甲基糠醛的生物催化氧化研究进展 [J]. 生物技术通报, 2016, 32(9): 50-58. 张雪玲,陈小利,李荷. 漆酶Lac1338的酶学特性测定及定向突变对其酶解染料影响 [J]. 生物技术通报, 2016, 32(7): 170-177. 吕永坤,堵国成,陈坚,周景文. 合成生物学技术研究进展 [J]. 生物技术通报, 2015, 31(4): 134-148. 王玺,段胜林,熊舒莉,郑桂兰,张贵友,王洪钟. 自诱导系统在酶促合成2’-脱氧胞苷中的应用 [J]. 生物技术通报, 2014, 0(11): 225-232. 刘瑜,李丕武. 黑曲霉葡萄糖氧化酶高产基因工程菌研究进展 [J]. 生物技术通报, 2013, 0(7): 12-19. 邵敏, 李长福, 葛正龙, 周鹤峰. 基于易错PCR技术定向进化枯草芽孢杆菌β-葡聚糖酶 [J]. 生物技术通报, 2013, 0(12): 141-145.
  •