PyPEF – An Integrated Framework for Data-Driven Protein Engineering

  Niklas Siedhoff and Alexander Illig Copyright: © Bio VI Niklas (l.) and Alexander (r.)

Congratulations Niklas Siedhoff and Alexander Illig for their recent publication!

In Niklas’ and Alexander’s new publication, a developed software framework, that was termed “PyPEF” (Pythonic Protein Engineering Framework) is described that can be applied for assisting protein evolution campaigns by screening variants in silico. Protein engineering is critical to improve the required function of proteins outside their in vivo conditions but the exhaustive identification of potentially advantageous substitution positions and the combination of several identified positions is still limiting the efficiency. However, besides established random or semi-rational protein engineering approaches, predictive in silico methods can minimize the experimental screening effort for finding improved variants.

Therefore, PyPEF was developed as a selection tool for recombining experimentally identified substitutions or identifying beneficial variants from unknown sequence space. The framework was written in Python 3 and comprises five major steps including data input, encoding, parameter tuning, model validation, and optional training on all data (see Figure). Here, PyPEF demonstrated the capability to efficiently train models due to a parallelized model validation routine and to screen the predefined, recombinant, or randomly sampled sequence space trajectories with system-dependent throughputs of about one million variants in the time frame of several minutes on a personal computer. Further, PyPEF displayed high accuracies on four public protein engineering datasets learning on lower substitutional variants while predicting model performances on higher substitutional variants, mimicking a data-driven directed evolution approach. With this, PyPEF proved its capability to efficiently assist protein evolution campaigns and to identify and tailor beneficial variants in silico.

This work was realized in the division Computational Biology and was supported by computing resources granted by JARA-HPC from RWTH Aachen University (JARA0169). This work was funded from the Bundesministerium für Bildung und Forschung (BMBF) project (FKZ: 01DJ20014).

PyPEF was written in Python 3 and designed for command line use. The source code is maintained in GitHub and is freely available under CC BY-NC-SA 4.0 license at

To learn more, please access the full paper on publications and patents.

Niklas E. Siedhoff, Alexander-Maurice Illig, Ulrich Schwaneberg, and Mehdi D. Davari, PyPEF – An Integrated Framework for Data-Driven Protein Engineering J. Chem. Inf. Model. 2021, 61, 7, 3463-3476.

  Workflow of PyPEF. Copyright: © Bio VI/JCIM

Figure Workflow of PyPEF. All input data (grey; sequences and corresponding fitness) are split in learning (orange) and validation data (blue). For the process of model training and parameter tuning the learning data is reduced by one (leave-one-out cross-validation, LOOCV) or one-fold (k-fold CV; light orange).