作者简介:
王慕镪,男,硕士研究生,研究方向:生物化工;E-mail:
y30210500@mail.ecust.edu.cn
基金资助:
国家重点研发计划(2019YFA0905000);国家重点研发计划(2021YFC2102300);国家自然科学基金项目(21871085);国家自然科学基金项目(31971380);国家自然科学基金项目(31971381)
摘要:
定向进化法通过模拟自然界的进化过程,可提高酶的进化速度,成为酶分子改造的关键技术。定向进化在生物催化以及药物设计等方面发挥着重要作用,但因突变的随机性所产生的数量庞大的突变体,使得实验筛选的能力面临巨大挑战。近年来,人工智能、大数据处理等新兴技术也发展成为生物催化领域的重要研究手段。其中,机器学习是一种统计学习的方法,通过数据驱动的方式获得序列/结构到酶功能的映射,为提高酶分子工程的效率提供帮助。本文综述了机器学习模型中所涉及的数据处理、描述符和算法等内容,重点叙述了机器学习方法在酶工程方面的研究与应用进展。随着机器学习算法和应用技术的进步,有望提出更加精准和有效的模型,助力新酶筛选与生物催化剂的精准设计改造。
Abstract:
Directed evolution can increase the rate of enzyme evolution by mimicking the natural evolutionary process and has become a key technology for enzyme engineering. Directed evolution has played an important role in biocatalysis and drug design, however the experimental screening is in great challenge due to the large number of mutant libraries caused by the randomness of mutations. In recent years, emerging technologies such as artificial intelligence and big data processing have also become crucial in biocatalysis researches. Machine learning methods are statistical learning approaches to obtain sequence/structure mappings to enzyme function in a data-driven manner, which will improve the efficiency of enzyme engineering. This paper reviews the state-of-the-art technologies involved in machine learning models, especially focusing on the research and application progresses of machine learning methods in enzyme engineering. With the advancement of machine learning algorithms and technologies, it is expected that more accurate and effective models will be proposed in the future to promote screening of new enzymes and accurate design of biocatalysts.
Key words:
directed evolution,
machine learning,
protein engineering,
biocatalysis
类别
Type
|
编码方法/信息
Encoding methods/Information
|
描述符
Descriptors
|
特征信息
Feature information
|
参考文献
Reference
|
基于序列
的描述符
|
单热编码
|
识别 Identity
|
表示残基位置
|
[
27
]
|
同源信息
|
位置特异性得分矩阵 PSSM
|
序列的同源信息
|
[
32
]
|
理化性质
|
一维深度卷积神经网络 DeepSF
|
序列描述符、序列位置、二级结构和溶剂可及性
|
[
33
]
|
几何描述符 Geometric descriptors
|
转角密度和残基距离密度直方图
|
[
34
]
|
联合三元组描述符 Conjoint triad descriptors
|
K间隔残基对和联合三联体
|
[
35
]
|
空间和化学特征 Spatial and chemical features
|
三维网络蛋白配体结构信息
|
[
36
]
|
Protr package中的氨基酸组成描述符
|
氨基酸的含量
|
[
37
]
|
谱核函数 Spectrum kernel
|
远端残基间同源性
|
[
38
-
39
]
|
z-标度 zScales
|
氨基酸的理化特性
|
[
40
]
|
主成分分析疏水、立体和电子性质矢量 VHSE
|
基于疏水特性、立体特性和电子特性的主成分分析降维得到的信息
|
[
41
]
|
sScales
|
基于AAindex的理化特征
|
[
42
]
|
ProFET
|
基于文献检索选择AAindex理化特征
|
[
43
]
|
拓扑标度 T-scale
|
拓扑结构特征
|
[
44
]
|
结构拓扑标度 ST-scale
|
拓扑结构特征
|
[
45
]
|
蛋白质指纹图谱 ProtFP
|
氨基酸的理化性质
|
[
46
]
|
AAindex
|
蛋白质属性
|
[
47
]
|
隐藏信息
|
UniRep
|
通过神经网络模型自动提取序列特征
|
[
48
]
|
基于结构
的描述符
|
理化性质
|
sPairs
|
基于氨基酸对接触图和AAindex二维描述符
|
[
49
]
|
单热编码
|
残基-残基接触图Residue-residue contact map
|
同一家族两个蛋白之间的结构距离
|
[
20
]
|
嵌入式
描述符
|
隐藏信息
|
ProtVec
|
基于三联氨基酸产生的突变信息
|
[
50
]
|
突变描述符
|
表示突变方式
|
MutInd
|
使用0或1表示相应的突变是否发生在突变体序列中
|
[
27
]
|
表1
机器学习用于指导酶分子设计时常用的描述符
Table 1
Descriptors used in machine learning-guided enzyme design
类别
Type
|
编码方法/信息
Encoding methods/Information
|
描述符
Descriptors
|
特征信息
Feature information
|
参考文献
Reference
|
基于序列
的描述符
|
单热编码
|
识别 Identity
|
表示残基位置
|
[
27
]
|
同源信息
|
位置特异性得分矩阵 PSSM
|
序列的同源信息
|
[
32
]
|
理化性质
|
一维深度卷积神经网络 DeepSF
|
序列描述符、序列位置、二级结构和溶剂可及性
|
[
33
]
|
几何描述符 Geometric descriptors
|
转角密度和残基距离密度直方图
|
[
34
]
|
联合三元组描述符 Conjoint triad descriptors
|
K间隔残基对和联合三联体
|
[
35
]
|
空间和化学特征 Spatial and chemical features
|
三维网络蛋白配体结构信息
|
[
36
]
|
Protr package中的氨基酸组成描述符
|
氨基酸的含量
|
[
37
]
|
谱核函数 Spectrum kernel
|
远端残基间同源性
|
[
38
-
39
]
|
z-标度 zScales
|
氨基酸的理化特性
|
[
40
]
|
主成分分析疏水、立体和电子性质矢量 VHSE
|
基于疏水特性、立体特性和电子特性的主成分分析降维得到的信息
|
[
41
]
|
sScales
|
基于AAindex的理化特征
|
[
42
]
|
ProFET
|
基于文献检索选择AAindex理化特征
|
[
43
]
|
拓扑标度 T-scale
|
拓扑结构特征
|
[
44
]
|
结构拓扑标度 ST-scale
|
拓扑结构特征
|
[
45
]
|
蛋白质指纹图谱 ProtFP
|
氨基酸的理化性质
|
[
46
]
|
AAindex
|
蛋白质属性
|
[
47
]
|
隐藏信息
|
UniRep
|
通过神经网络模型自动提取序列特征
|
[
48
]
|
基于结构
的描述符
|
理化性质
|
sPairs
|
基于氨基酸对接触图和AAindex二维描述符
|
[
49
]
|
单热编码
|
残基-残基接触图Residue-residue contact map
|
同一家族两个蛋白之间的结构距离
|
[
20
]
|
嵌入式
描述符
|
隐藏信息
|
ProtVec
|
基于三联氨基酸产生的突变信息
|
[
50
]
|
突变描述符
|
表示突变方式
|
MutInd
|
使用0或1表示相应的突变是否发生在突变体序列中
|
[
27
]
|
类别
Type
|
算法/方法
Algorithm/Method
|
特征
Feature
|
文献
Reference
|
经典机器学习
|
贝叶斯算法
|
为变量之间多种关系建模
|
[
51
]
|
高斯过程 GP
|
基于数据对所有可能的模型输出的概率进行统计,根据概率分布情况输出预测结果,使用了协方差确定了数据与数据之间的关系
|
[
5
,
27
,
52
⇓
⇓
-
55
]
|
K近邻 KNN
|
基于数据-标签关系,比较新数据与旧数据之间的特征距离,提取特征最相似的数据
|
[
56
]
|
支持向量机 SVM
|
基于核函数的算法,可以通过升维将原来线性不可分的关系转变为线性可分的关系
|
[
57
]
|
决策树 DTs
|
对输入数据进行分类或预测
|
[
58
]
|
随机森林 RF
|
一种Bagging方法,每个分类器的数据采集和特征选择数量一致且随机
|
[
59
]
|
AdaBoost
|
一种Boosting方法,将弱分类器融合形成一个强分类器
|
[
60
]
|
Stacking
|
聚合使用多个分类器进行第一轮训练,将输出作为第二轮输入,确定某一个分类器用于训练输出结果
|
[
61
]
|
主成分分析 PCA
|
基于无监督学习将原始数据映射到新的特征空间以提取特征
|
[
62
]
|
深度学习
|
深度神经网络 DNN
|
具有多个隐藏层的ANN
|
[
63
]
|
前馈神经网络 FNN
|
神经网络结构中不含循环结构
|
[
27
]
|
循环神经网络 RNN
|
识别序列的上下文关系并建模
|
[
64
]
|
卷积神经网络 CNN
|
输入数据为图像或类图像形式
|
[
65
]
|
对抗生成网络 GAN
|
同时训练两个独立的竞争网络
|
[
66
]
|
表2
机器学习的常用算法
Table 2
Algorithms of machine learning
类别
Type
|
算法/方法
Algorithm/Method
|
特征
Feature
|
文献
Reference
|
经典机器学习
|
贝叶斯算法
|
为变量之间多种关系建模
|
[
51
]
|
高斯过程 GP
|
基于数据对所有可能的模型输出的概率进行统计,根据概率分布情况输出预测结果,使用了协方差确定了数据与数据之间的关系
|
[
5
,
27
,
52
⇓
⇓
-
55
]
|
K近邻 KNN
|
基于数据-标签关系,比较新数据与旧数据之间的特征距离,提取特征最相似的数据
|
[
56
]
|
支持向量机 SVM
|
基于核函数的算法,可以通过升维将原来线性不可分的关系转变为线性可分的关系
|
[
57
]
|
决策树 DTs
|
对输入数据进行分类或预测
|
[
58
]
|
随机森林 RF
|
一种Bagging方法,每个分类器的数据采集和特征选择数量一致且随机
|
[
59
]
|
AdaBoost
|
一种Boosting方法,将弱分类器融合形成一个强分类器
|
[
60
]
|
Stacking
|
聚合使用多个分类器进行第一轮训练,将输出作为第二轮输入,确定某一个分类器用于训练输出结果
|
[
61
]
|
主成分分析 PCA
|
基于无监督学习将原始数据映射到新的特征空间以提取特征
|
[
62
]
|
深度学习
|
深度神经网络 DNN
|
具有多个隐藏层的ANN
|
[
63
]
|
前馈神经网络 FNN
|
神经网络结构中不含循环结构
|
[
27
]
|
循环神经网络 RNN
|
识别序列的上下文关系并建模
|
[
64
]
|
卷积神经网络 CNN
|
输入数据为图像或类图像形式
|
[
65
]
|
对抗生成网络 GAN
|
同时训练两个独立的竞争网络
|
[
66
]
|
模型
Model
|
任务
Task
|
机器学习算法
Machine learning algorithm
|
输入类型
Input type
|
应用
Application
|
文献
Reference
|
-
|
酶的功能
|
RF
|
分子
|
乙酰胆碱酯酶抑制剂与非抑制剂的鉴别
|
[
70
]
|
CWLy-SVM
|
酶的分类
|
SVM
|
序列
|
鉴定细胞壁催化酶
|
[
71
]
|
SVR
|
酶的功能
|
SVM
|
序列
|
改善酶的活力与溶解度
|
[
72
]
|
GPR/GNB
|
酶的功能
|
GP
|
结构
|
改善脂肪酰基还原酶的活力
|
[
5
]
|
-
|
酶的分类
|
HMM/RF/LR/KNN/SVM/RF
|
序列
|
分类第七家族糖苷水解酶中的CBH和EG
|
[
9
]
|
Innov'SAR
|
酶的功能
|
PLSR
|
序列
|
找到提高活性的最佳突变组合
|
[
10
,
73
-
74
]
|
-
|
酶的功能
|
LR
|
结构
|
测定底物-酶对的反应活性
|
[
75
]
|
SoluProt
|
酶的功能
|
RF
|
序列
|
预测酶在大肠杆菌表达系统中的溶解性
|
[
76
]
|
ProSAR
|
酶的功能
|
PLSR
|
序列
|
提高卤醇脱卤酶的活力
|
[
3
]
|
TOME
|
酶的功能
|
RF
|
序列
|
预测最适温度
|
[
13
,
77
]
|
PREvaIL
|
酶的催化残基
|
RF
|
序列和结构
|
预测酶的催化残基的方法
|
[
78
]
|
-
|
酶的功能
|
GP
|
序列
|
改造绿色荧光蛋白的荧光性
|
[
79
]
|
-
|
酶的功能
|
GP
|
结构
|
改造细胞色素P450的稳定性
|
[
20
]
|
-
|
酶的催化残基
|
CNN
|
结构
|
预测酶的催化残基的框架
|
[
80
]
|
SolventNet
|
酶的功能
|
CNN
|
结构
|
酸、催化剂和溶剂对水解速率的影响
|
[
81
]
|
DeepSol
|
酶的溶解性
|
ANN
|
序列
|
预测蛋白质的溶解性
|
[
82
]
|
-
|
酶的功能
|
CNN
|
结构
|
优化PET水解酶的催化能力和耐受性
|
[
83
]
|
表3
机器学习在酶工程的应用
Table 3
Application of machine learning in enzyme engineering
模型
Model
|
任务
Task
|
机器学习算法
Machine learning algorithm
|
输入类型
Input type
|
应用
Application
|
文献
Reference
|
-
|
酶的功能
|
RF
|
分子
|
乙酰胆碱酯酶抑制剂与非抑制剂的鉴别
|
[
70
]
|
CWLy-SVM
|
酶的分类
|
SVM
|
序列
|
鉴定细胞壁催化酶
|
[
71
]
|
SVR
|
酶的功能
|
SVM
|
序列
|
改善酶的活力与溶解度
|
[
72
]
|
GPR/GNB
|
酶的功能
|
GP
|
结构
|
改善脂肪酰基还原酶的活力
|
[
5
]
|
-
|
酶的分类
|
HMM/RF/LR/KNN/SVM/RF
|
序列
|
分类第七家族糖苷水解酶中的CBH和EG
|
[
9
]
|
Innov'SAR
|
酶的功能
|
PLSR
|
序列
|
找到提高活性的最佳突变组合
|
[
10
,
73
-
74
]
|
-
|
酶的功能
|
LR
|
结构
|
测定底物-酶对的反应活性
|
[
75
]
|
SoluProt
|
酶的功能
|
RF
|
序列
|
预测酶在大肠杆菌表达系统中的溶解性
|
[
76
]
|
ProSAR
|
酶的功能
|
PLSR
|
序列
|
提高卤醇脱卤酶的活力
|
[
3
]
|
TOME
|
酶的功能
|
RF
|
序列
|
预测最适温度
|
[
13
,
77
]
|
PREvaIL
|
酶的催化残基
|
RF
|
序列和结构
|
预测酶的催化残基的方法
|
[
78
]
|
-
|
酶的功能
|
GP
|
序列
|
改造绿色荧光蛋白的荧光性
|
[
79
]
|
-
|
酶的功能
|
GP
|
结构
|
改造细胞色素P450的稳定性
|
[
20
]
|
-
|
酶的催化残基
|
CNN
|
结构
|
预测酶的催化残基的框架
|
[
80
]
|
SolventNet
|
酶的功能
|
CNN
|
结构
|
酸、催化剂和溶剂对水解速率的影响
|
[
81
]
|
DeepSol
|
酶的溶解性
|
ANN
|
序列
|
预测蛋白质的溶解性
|
[
82
]
|
-
|
酶的功能
|
CNN
|
结构
|
优化PET水解酶的催化能力和耐受性
|
[
83
]
|
Chen K, Arnold FH. Tuning the activity of an enzyme for unusual environments: sequential random mutagenesis of subtilisin E for catalysis in dimethylformamide[J].
Proc Natl Acad Sci USA
,
1993
,
90
(12): 5618-5622.
pmid:
8516309
Tang CD, Zhang ZH, Shi HL, et al. Directed evolution of formate dehydrogenase and its application in the biosynthesis of L-phenylglycine from phenylglyoxylic acid[J].
Mol Catal
,
2021
,
513
: 111666.
Fox RJ, Davis SC, Mundorff EC, et al. Improving catalytic function by ProSAR-driven enzyme evolution[J].
Nat Biotechnol
,
2007
,
25
(3): 338-344.
pmid:
17322872
Reetz MT. The importance of additive and non-additive mutational effects in protein engineering[J].
Angewandte Chemie Int Ed
,
2013
,
52
(10): 2658-2666.
doi:
10.1002/anie.201207842
Greenhalgh JC, Fahlberg SA, Pfleger BF, et al. Machine learning-guided acyl-ACP reductase engineering for improved
in vivo
fatty alcohol production[J].
Nat Commun
,
2021
,
12
(1): 5825.
doi:
10.1038/s41467-021-25831-w
pmid:
34611172
Miton CM, Tokuriki N. How mutational epistasis impairs predictability in protein evolution and design[J].
Protein Sci
,
2016
,
25
(7): 1260-1272.
doi:
10.1002/pro.2876
pmid:
26757214
Romero PA, Arnold FH. Exploring protein fitness landscapes by directed evolution[J].
Nat Rev Mol Cell Biol
,
2009
,
10
(12): 866-876.
doi:
10.1038/nrm2805
Ma EJ, Siirola E, Moore C, et al. Machine-directed evolution of an imine reductase for activity and stereoselectivity[J].
ACS Catal
,
2021
,
11
(20): 12433-12445.
doi:
10.1021/acscatal.1c02786
Gado JE, Harrison BE, Sandgren M, et al. Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases[J].
J Biol Chem
,
2021
,
297
(2): 100931.
doi:
10.1016/j.jbc.2021.100931
Ostafe R, Fontaine N, Frank D, et al. One-shot optimization of multiple enzyme parameters: Tailoring glucose oxidase for pH and electron mediators[J].
Biotechnol Bioeng
,
2020
,
117
(1): 17-29.
doi:
10.1002/bit.27169
pmid:
31520472
Peng M, de Vries RP. Machine learning prediction of novel pectinolytic enzymes in
Aspergillus niger
through integrating heterogeneous(post-)genomics data[J].
Microb Genom
,
2021
,
7
(12): 000674.
Wu Z, Kan SBJ, Lewis RD, et al. Machine learning-assisted directed protein evolution with combinatorial libraries[J].
Proc Natl Acad Sci USA
,
2019
,
116
(18): 8852-8858.
doi:
10.1073/pnas.1901979116
pmid:
30979809
Li GY, Dong YJ, Reetz MT. Can machine learning revolutionize directed evolution of selective enzymes?[J].
Adv Synth Catal
,
2019
,
361
(11): 2377-2386.
Baştanlar Y, Özuysal M. Introduction to machine learning[M]//
miRNomics:microRNA biology and computational analysis
. Totowa, NJ: Humana Press,
2013
: 105-128.
蒋迎迎, 曲戈, 孙周通. 机器学习助力酶定向进化[J].
生物学杂志
,
2020
,
37
(4): 1-11.
Sikander R, Wang YP, Ghulam A, et al. Identification of enzymes-specific protein domain based on DDE, and convolutional neural network[J].
Front Genet
,
2021
,
12
: 759384.
doi:
10.3389/fgene.2021.759384
Jing XY, Li FM. Predicting cell wall lytic enzymes using combined features[J].
Front Bioeng Biotechnol
,
2021
,
8
: 627335.
doi:
10.3389/fbioe.2020.627335
Wan ZY, Wang QD, Liu DC, et al. Accelerating the optimization of enzyme-catalyzed synthesis conditions via machine learning and reactivity descriptors[J].
Org Biomol Chem
,
2021
,
19
(28): 6267-6273.
doi:
10.1039/D1OB01066B
Kirk PDW, Stumpf MPH. Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data[J].
Bioinformatics
,
2009
,
25
(10): 1300-1306.
doi:
10.1093/bioinformatics/btp139
pmid:
19289448
Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes[J].
Proc Natl Acad Sci USA
,
2013
,
110
(3): E193-E201.
Rasmussen CE, Williams CKI.
Gaussian processes for machine learning
[M]. Cambridge: The MIT Press,
2005
.
Zhang ZH, Schott JA, Liu MM, et al. Prediction of carbon dioxide adsorption via deep learning[J].
Angew Chem Int Ed Engl
,
2019
,
58
(1): 259-263.
doi:
10.1002/anie.201812363
Luo HZ, Gao L, Liu Z, et al. Prediction of phenolic compounds and glucose content from dilute inorganic acid pretreatment of lignocellulosic biomass using artificial neural network modeling[J].
Bioresour Bioprocess
,
2021
,
8
: 134.
doi:
10.1186/s40643-021-00488-x
Saito Y, Oikawa M, Sato T, et al. Machine-learning-guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration[J].
ACS Catal
,
2021
,
11
(23): 14615-14624.
doi:
10.1021/acscatal.1c03753
del Rio-Chanona EA, Fiorelli F, Zhang DD, et al. An efficient model construction strategy to simulate microalgal lutein photo-production dynamic process[J].
Biotechnol Bioeng
,
2017
,
114
(11): 2518-2527.
doi:
10.1002/bit.26373
pmid:
28671262
卞佳豪, 杨广宇. 人工智能辅助的蛋白质工程[J].
合成生物学
,
2022
,
3
(3): 429-444.
doi:
10.12211/2096-8280.2021-032
Xu YT, Verma D, Sheridan RP, et al. Deep dive into machine learning models for protein engineering[J].
J Chem Inf Model
,
2020
,
60
(6): 2773-2790.
doi:
10.1021/acs.jcim.0c00073
pmid:
32250622
Yang KK, Wu Z, Arnold FH. Machine-learning-guided directed evolution for protein engineering[J].
Nat Methods
,
2019
,
16
(8): 687-694.
doi:
10.1038/s41592-019-0496-6
pmid:
31308553
Yang KK, Wu Z, Bedbrook CN, et al. Learned protein embeddings for machine learning[J].
Bioinformatics
,
2018
,
34
(15): 2642-2648.
doi:
10.1093/bioinformatics/bty178
pmid:
29584811
Roy S, Martinez D, Platero H, et al. Exploiting amino acid composition for predicting protein-protein interactions[J].
PLoS One
,
2009
,
4
(11): e7813.
doi:
10.1371/journal.pone.0007813
Wolpert DH. The lack of a priori distinctions between learning algorithms[J].
Neural Comput
,
1996
,
8
(7): 1341-1390.
doi:
10.1162/neco.1996.8.7.1341
van Westen GJ, Swier RF, Wegner JK, et al. Benchmarking of protein descriptor sets in proteochemometric modeling(part 2): comparative study of 13 amino acid descriptor sets[J].
J Cheminform
,
2013
,
5
(1): 41.
doi:
10.1186/1758-2946-5-41
Hou J, Adhikari B, Cheng JL. DeepSF: deep convolutional neural network for mapping protein sequences to folds[J].
Bioinformatics
,
2018
,
34
(8): 1295-1303.
doi:
10.1093/bioinformatics/btx780
pmid:
29228193
Zacharaki EI. Prediction of protein function using a deep convolutional neural network ensemble[J].
Peerj Comput Sci
,
2017
,
3
: e124.
doi:
10.7717/peerj-cs.124
White C, Ismail HD, Saigo H, et al. CNN-BLPred: a convolutional neural network based predictor for β-lactamases(BL)and their classes[J].
BMC Bioinformatics
,
2017
,
18
(Suppl 16): 577.
doi:
10.1186/s12859-017-1972-6
Ragoza M, Hochuli J, Idrobo E, et al. Protein-ligand scoring with convolutional neural networks[J].
J Chem Inf Model
,
2017
,
57
(4): 942-957.
doi:
10.1021/acs.jcim.6b00740
pmid:
28368587
Xiao N, Cao DS, Zhu MF, et al. Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences[J].
Bioinformatics
,
2015
,
31
(11): 1857-1859.
doi:
10.1093/bioinformatics/btv042
pmid:
25619996
Ismail HD, Saigo H, Kc DB. RF-NR: random forest based approach for improved classification of nuclear receptors[J].
IEEE/ACM Trans Comput Biol Bioinform
,
2018
,
15
(6): 1844-1852.
doi:
10.1109/TCBB.2017.2773063
pmid:
29990125
Leslie C, Eskin E, Noble WS. The spectrum kernel: a string kernel for SVM protein classification[J].
Pac Symp Biocomput
,
2002
: 564-575.
Sandberg M, Eriksson L, Jonsson J, et al. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids[J].
J Med Chem
,
1998
,
41
(14): 2481-2491.
pmid:
9651153
Mei H, Liao ZH, Zhou Y, et al. A new set of amino acid descriptors and its application in peptide QSARs[J].
Biopolymers
,
2005
,
80
(6): 775-786.
pmid:
15895431
Biou V, Gibrat JF, Levin JM, et al. Secondary structure prediction: combination of three different methods[J].
Protein Eng
,
1988
,
2
(3): 185-191.
pmid:
3237683
Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions[J].
Bioinformatics
,
2015
,
31
(21): 3429-3436.
doi:
10.1093/bioinformatics/btv345
pmid:
26130574
Tian FF, Zhou P, Li ZL. T-scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides[J].
J Mol Struct
,
2007
,
830
(1/2/3): 106-115.
doi:
10.1016/j.molstruc.2006.07.004
Yang L, Shu M, Ma KW, et al. ST-scale as a novel amino acid descriptor and its application in QSAM of peptides and analogues[J].
Amino Acids
,
2010
,
38
(3): 805-816.
doi:
10.1007/s00726-009-0287-y
pmid:
19373543
van Westen GJ, Swier RF, Wegner JK, et al. Benchmarking of protein descriptor sets in proteochemometric modeling(part 1): comparative study of 13 amino acid descriptor sets[J].
J Cheminform
,
2013
,
5
(1): 41.
doi:
10.1186/1758-2946-5-41
Kawashima S, Pokarowski P, Pokarowska M, et al. AAindex: amino acid index database, progress report 2008[J].
Nucleic Acids Res
,
2008
,
36
(Database issue): D202-D205.
doi:
10.1093/nar/gkm998
pmid:
17998252
Alley EC, Khimulya G, Biswas S, et al. Unified rational protein engineering with sequence-based deep representation learning[J].
Nat Methods
,
2019
,
16
(12): 1315-1322.
doi:
10.1038/s41592-019-0598-1
pmid:
31636460
Tanaka S, Scheraga HA. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins[J].
Macromolecules
,
1976
,
9
(6): 945-950.
pmid:
1004017
Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics[J].
PLoS One
,
2015
,
10
(11): e0141287.
doi:
10.1371/journal.pone.0141287
Jensen FV.
An introduction to Bayesian networks
[M]. London: UCL press,
1996
Lim S, Lu Y, Cho CY, et al. A review on compound-protein interaction prediction methods: Data, format, representation and model[J].
Comput Struct Biotechnol J
,
2021
,
19
: 1541-1556.
del Rio-Chanona EA, Cong XY, Bradford E, et al. Review of advanced physical and data-driven models for dynamic bioprocess simulation: case study of algae-bacteria consortium wastewater treatment[J].
Biotechnol Bioeng
,
2019
,
116
(2): 342-353.
doi:
10.1002/bit.26881
pmid:
30475404
Natarajan P, Moghadam R, Jagannathan S. Online deep neural network-based feedback control of a Lutein bioprocess[J].
J Process Control
,
2021
,
98
: 41-51.
doi:
10.1016/j.jprocont.2020.11.011
Kim GB, Kim WJ, Kim HU, et al. Machine learning applications in systems metabolic engineering[J].
Curr Opin Biotechnol
,
2020
,
64
: 1-9.
doi:
10.1016/j.copbio.2019.08.010
Wettschereck D, Aha DW, Mohri T. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms[M]//
Lazy learning
. Dordrecht: Springer Netherlands,
1997
: 273-314.
Drucker H, Surges CJC, Kaufman L, et al. Support vector regression machines[J].
Adv Neural Inf Process Syst
,
1997
: 155-161.
Quinlan JR. Induction of decision trees[J].
Mach Learn
,
1986
,
1
(1): 81-106.
Li Y, Song K, Zhang J, et al. A computational method to predict effects of residue mutations on the catalytic efficiency of hydrolases[J].
Catalysts
,
2021
,
11
(2): 286.
doi:
10.3390/catal11020286
Schapire RE. Explaining adaboost[M]// SchölkopfB, LuoZY, VovkV.
Empirical inference. Verlag:Springer
,
2013
: 37-52.
Wolpert DH. Stacked generalization[J].
Neural Netw
,
1992
,
5
(2): 241-259.
doi:
10.1016/S0893-6080(05)80023-1
Abdi H, Williams LJ. Principal component analysis[J].
WIREs Comp Stat
,
2010
,
2
(4): 433-459.
doi:
10.1002/wics.101
LeCun Y, Bengio Y, Hinton G. Deep learning[J].
Nature
,
2015
,
521
(7553): 436-444.
doi:
10.1038/nature14539
Lohmann R, Schneider G, Behrens D, et al. A neural network model for the prediction of membrane-spanning amino acid sequences[J].
Protein Sci
,
1994
,
3
(9): 1597-1601.
pmid:
7833818
Rawat W, Wang ZH. Deep convolutional neural networks for image classification: a comprehensive review[J].
Neural Comput
,
2017
,
29
(9): 2352-2449.
doi:
10.1162/NECO_a_00990
pmid:
28599112
Creswell A, White T, Dumoulin V, et al. Generative adversarial net-works: an overview[J].
IEEE Signal Process Mag
,
2018
,
35
(1): 53-65.
doi:
10.1109/MSP.2017.2765202
Auer P. Using confidence bounds for exploitation-exploration trade-offs[J].
J Machine Learning Res
,
2002
,
3
(Nov): 397-422.
International Conference on Machine Learning. Proceedings of the Twenty-Ninth International Conference on Machine Learning[C].
Madison, Wis: International Machine Learning Society
,
2012
.
Endelman JB, Silberg JJ, Wang ZG, et al. Site-directed protein recombination as a shortest-path problem[J].
Protein Eng Des Sel
,
2004
,
17
(7): 589-594.
pmid:
15331774
Sandhu H, Kumar RN, Garg P. Machine learning-based modeling to predict inhibitors of acetylcholinesterase[J].
Mol Divers
,
2022
,
26
(1): 331-340.
doi:
10.1007/s11030-021-10223-5
Meng CL, Guo F, Zou Q. CWLy-SVM: a support vector machine-based tool for identifying cell wall lytic enzymes[J].
Comput Biol Chem
,
2020
,
87
: 107304.
doi:
10.1016/j.compbiolchem.2020.107304
Han X, Ning WB, Ma XQ, et al. Improve protein solubility and activity based on machine learning models[J].
bioRxiv
,
2019
. DOI:
10.1101/817890
.
doi:
10.1101/817890
Cadet F, Fontaine N, Li GY, et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes[J].
Sci Rep
,
2018
,
8
(1): 16757.
doi:
10.1038/s41598-018-35033-y
pmid:
30425279
Cadet F, Fontaine N, Vetrivel I, et al. Application of fourier transform and proteochemometrics principles to protein engineering[J].
BMC Bioinformatics
,
2018
,
19
(1): 382.
doi:
10.1186/s12859-018-2407-8
pmid:
30326841
Bonk BM, Weis JW, Tidor B. Machine learning identifies chemical characteristics that promote enzyme catalysis[J].
J Am Chem Soc
,
2019
,
141
(9): 4108-4118.
doi:
10.1021/jacs.8b13879
pmid:
30761897
Hon J, Borko S, Stourac J, et al. EnzymeMiner: automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities[J].
Nucleic Acids Res
,
2020
,
48
(W1): W104-W109.
doi:
10.1093/nar/gkaa372
Li G, Rabe KS, Nielsen J, et al. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima[J].
ACS Synth Biol
,
2019
,
8
(6): 1411-1420.
doi:
10.1021/acssynbio.9b00099
pmid:
31117361
Song JN, Li FY, Takemoto K, et al. PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework[J].
J Theor Biol
,
2018
,
443
: 125-137.
doi:
S0022-5193(18)30039-0
pmid:
29408627
Saito Y, Oikawa M, Nakazawa H, et al. Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins[J].
ACS Synth Biol
,
2018
,
7
(9): 2014-2022.
doi:
10.1021/acssynbio.8b00155
pmid:
30103599
Torng W, Altman RB. High precision protein functional site detection using 3D convolutional neural networks[J].
Bioinformatics
,
2019
,
35
(9): 1503-1512.
doi:
10.1093/bioinformatics/bty813
pmid:
31051039
Chew A, Jiang SL, Zhang WQ, et al. Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks[J].
Chem Sci
,
2019
,
11
: 12464-12476.
doi:
10.1039/D0SC03261A
Khurana S, Rawi R, Kunji K, et al. DeepSol: a deep learning framework for sequence-based protein solubility prediction[J].
Bioinformatics
,
2018
,
34
(15): 2605-2613.
doi:
10.1093/bioinformatics/bty166
pmid:
29554211
Lu HY, Diaz DJ, Czarnecki NJ, et al. Machine learning-aided engineering of hydrolases for PET depolymerization[J].
Nature
,
2022
,
604
(7907): 662-667.
doi:
10.1038/s41586-022-04599-z
Dubey A, Realff MJ, Lee JH, et al. Support vector machines for learning to identify the critical positions of a protein[J].
J Theor Biol
,
2005
,
234
(3): 351-361.
pmid:
15784270
Cai YC, Yang HB, Li WH, et al. Computational prediction of site of metabolism for UGT-catalyzed reactions[J].
J Chem Inf Model
,
2019
,
59
(3): 1085-1095.
doi:
10.1021/acs.jcim.8b00851
pmid:
30586295
Silberg JJ, Endelman JB, Arnold FH.
SCHEMA
-guided protein recombination[J].
Methods Enzymol
,
2004
,
388
: 35-42.
Srinivas N, Krause A, Kakade SM, et al. Gaussian process optimization in the bandit setting: no regret and experimental design[EB/OL].
2009: arXiv: 0912.3995[cs.LG]
. https://arxiv.org/abs/0912.3995.
Voigt CA, Martinez C, Wang ZG, et al. Protein building blocks preserved by recombination[J].
Nat Struct Biol
,
2002
,
9
(7): 553-558.
pmid:
12042875
Shroff R, Cole AW, Diaz DJ, et al. Discovery of novel gain-of-function mutations guided by structure-based deep learning[J].
ACS Synth Biol
,
2020
,
9
(11): 2927-2935.
doi:
10.1021/acssynbio.0c00345
pmid:
33064458
Paik I, Ngo PHT, Shroff R, et al. Improved bst DNA polymerase variants derived via a machine learning approach[J].
Biochemistry
,
2021
.
https://doi.org/10.1021/acs.biochem.1c00451
.
Kulikova AV, Diaz DJ, Loy JM, et al. Learning the local landscape of protein structures with convolutional neural networks[J].
J Biol Phys
,
2021
,
47
(4): 435-454.
doi:
10.1007/s10867-021-09593-6
pmid:
34751854
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold[J].
Nature
,
2021
,
596
(7873): 583-589.
doi:
10.1038/s41586-021-03819-2
Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network[J].
Science
,
2021
,
373
(6557): 871-876.
doi:
10.1126/science.abj8754
pmid:
34282049
Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture the effects of mutations[J].
Nat Methods
,
2018
,
15
(10): 816-822.
doi:
10.1038/s41592-018-0138-4
pmid:
30250057
陈春, 宿玲恰, 夏伟, 吴敬.
定向进化提高来源于Arthrobacter ramosus 的MTHase的热稳定性
[J]. 生物技术通报, 2021, 37(3): 84-91.
石利霞, 高松枫, 朱蕾蕾.
PET水解酶的研究进展
[J]. 生物技术通报, 2020, 36(10): 226-236.
王叶, 贾振华, 宋水山.
宏基因组学结合合成生物学法挖掘新型生物催化剂的研究进展
[J]. 生物技术通报, 2018, 34(8): 35-42.
任天雷, 杨海泉, 许菲.
基于分子结构与生物信息学等多维度特征的定向进化改造甲基对硫磷水解酶
[J]. 生物技术通报, 2018, 34(10): 194-200.
王晓璐, 王钰,刘娇,郑平,路福平.
利用定向进化提高基因工程大肠杆菌的甲醇利用能力
[J]. 生物技术通报, 2017, 33(9): 101-109.
郭园, 赵仲麟.
微生物系统定向进化与合成生物学应用研究进展
[J]. 生物技术通报, 2017, 33(1): 76-82.
吴树丽, 刘启顺, 谭海东, 张付云, 尹恒.
5-羟甲基糠醛的生物催化氧化研究进展
[J]. 生物技术通报, 2016, 32(9): 50-58.
张雪玲,陈小利,李荷.
漆酶Lac1338的酶学特性测定及定向突变对其酶解染料影响
[J]. 生物技术通报, 2016, 32(7): 170-177.
吕永坤,堵国成,陈坚,周景文.
合成生物学技术研究进展
[J]. 生物技术通报, 2015, 31(4): 134-148.
王玺,段胜林,熊舒莉,郑桂兰,张贵友,王洪钟.
自诱导系统在酶促合成2’-脱氧胞苷中的应用
[J]. 生物技术通报, 2014, 0(11): 225-232.
刘瑜,李丕武.
黑曲霉葡萄糖氧化酶高产基因工程菌研究进展
[J]. 生物技术通报, 2013, 0(7): 12-19.
邵敏, 李长福, 葛正龙, 周鹤峰.
基于易错PCR技术定向进化枯草芽孢杆菌β-葡聚糖酶
[J]. 生物技术通报, 2013, 0(12): 141-145.