Designer Enzymes for Bioremediation
Using RFDiffusion3 to engineer haloalkane dehalogenases
Enzymes are a class of proteins which speed up chemical reactions. They are made up of amino acids: diverse chemical building blocks which have various properties. The specific sequence of amino acids determines the shape of the enzyme, and that determines how it speeds up the reaction. Since this can be encoded into information, a grand task of synthetic biology is being able to design an enzyme for any reaction.
In the past few years, several steps have been made towards realizing this goal. Many groups—the Baker Lab, DeepMind, and others—have engineered models that essentially generate and predict protein structures. This enables pipelines that take us closer to being able to design enzymes for any reaction. Although we aren’t quite there yet, these models are rapidly improving.
In this post, I’ll walk through how I used some of these models to engineer new versions of LinB: a haloalkane dehalogenase. The reaction I focused on is the conversion of 1,2-dibromoethane (ethlyene dibromide, EDB) into 2-bromoethanol and bromide. EDB is a highly toxic chemical that was used as a pesticide and gasoline additive. Since it is still persistent in many areas, I thought it would be an interesting chemical to design an enzyme for.
I started my design process with a motif of LinB’s active site. After this, I generated potential scaffolds for the site with RFDiffusion3. Following sequence redesign with LigandMPNN and a designability check with Boltz-2, I checked if the active site architecture was maintained. This left me with a final set of sequences that I could test in the lab for activity.
How does LinB work?
As previously mentioned, LinB converts EDB to 2-bromoethanol. The mechanism for this has been studied through many approaches, both in the lab and computationally. Experiments have shown that there are two half-reactions that LinB catalyzes.
First, the EDB is anchored within the active site. This starts an SN2 nucleophilic substitution reaction. The key residue for this is Asp108. Asp108’s oxygen performs a backside attack on EDB’s carbon, breaking its bond with the bromine atom. This results in a covalent bond being formed between the enzyme and EDB, along with a bromide ion being released. Therefore, the product of this reaction is an acyl-enzyme intermediate (AEI) and bromide ion.
To reset the enzyme for more reactions, the AEI must be removed. This happens through a hydrolysis reaction. Water enters through a tunnel in the enzyme and reaches the active site. In here, His272 abstracts a proton from the molecule, converting it to OH-. His272 is able to do this because it is positioned and stabilized by Glu132; these residues, along with Asp108, form LinB’s catalytic triad. Once OH- is formed, it attacks the AEI and forms an oxyanion intermediate. This transition state then collapses, breaking the AEI-enzyme bond and producing 2-bromoethanol. The reaction is now complete.
Designing the Active Site Motif
To get new backbones for an enzyme, it is necessary to feed in some context about the active site. This leads to a core question in enzyme design:
How do you represent an enzyme with the fewest number of atoms?
To get the most diverse enzymes, it is important to not constrain the model with too many fixed atoms. However, it is also important to include atoms that are critical for catalysis. Otherwise, the enzyme simply won’t work. To get a better intuition for selecting the right atoms, I looked to the Baker Lab’s work on RFDiffusion3. In this paper, they used the model to design a cysteine hydrolase. Although it is distinct from LinB, cysteine hydrolases have some similarities: its reaction occurs in two steps, it has a catalytic triad, and its second reaction is a hydrolysis reaction. Thus, I took a lot of inspiration from this paper on how to design the active site motif.
First, I had to get a structure of EDB’s first transition state within the active site. Since this is the middle-ground between the substrate and AEI, if the enzyme could fit around it, there is a high probability that it could work with the other two states as well. This is what the Baker Lab did in several of their enzyme design papers. As there are no similar structures in the PDB, I had to generate this state myself. I first generated a conformer for EDB using RDKit. This conformer had the atoms Br1, C2, C3, and Br4. Now, I had to align the molecule to the active site in a way that mimicked the transition state. I did this using a script that optimized the angles and distances of EDB’s atoms. Here are the constraints that I placed:
The Br1, C2, Asp108(OD2) angle had to be close to 180°
Br1 had to be close to 3.6 and 3.7 angstroms away from Trp109(NE1) and Asn38(ND2), respectively
The Br1, C2, C3, Br4 dihedral angle had to be close to -90°
These values were determined from a molecular dynamics study. In this paper, the researchers showed how LinB catalyzes the dehalogenation of EDB. The computational model was able to describe the relative positions of atoms in the transition state, so I used these values to create my constraints. After writing the script for this process, I ended up with a structure of EDB’s transition state in the active site.
Next, I had to choose which atoms to keep in the motif. Unlike the placement of the EDB transition state, this process cannot be reliably quantified. Thus, I relied upon previous papers and my own intuition to choose which atoms to keep. I ended up with the following selections:
Full Asp108: this amino acid plays the most critical role in the enzyme
Full Trp109: The side chain of this amino acid stabilizes the leaving bromide, while the backbone amide stabilizes the oxyanion in the second reaction
Full Asn38: Similar to Trp109, this residue also stabilizes the bromide and oxyanion intermediate
His272 imidazole ring: the side chain of this residue is important for abstracting a proton from water, but the backbone position can be changed
Glu132 side chain: the carboxyl group helps position the imidazole ring of His272, but the backbone can be altered
After making my selection, I ended up with 45 atoms that I could pass into RFDiffusion3.
You might be wondering why designing around the same active site would lead to any improvements in the enzymes. Well, the atoms that are generated around the active site can still influence it. They can decrease or increase its activity in many ways. While we can’t reliably predict which designs will have better activity, we can still just test them in the lab until better models are discovered.
After understanding this, you might have a different question: why not just make random mutations to see how activity is changed? This is the premise of directed evolution, where mutations are combined to get better variants. There are two reasons why ML-centric enzyme design is of interest to develop further. The first is that natural enzymes are trapped in a specific evolutionary trajectory. Enzymes don’t evolve to optimize their chemistry; they evolve to be “just good enough”. This means that the sequence they currently have could be nowhere close to their optimal architecture. Directed evolution could never find these structures as it does not move far from the natural sequence. Secondly, these models could eventually be able to design enzymes for reactions that don’t exist in nature. This would be a major breakthrough impacting many fields, so improving these models is very important.
Generating Samples with RFDiffusion3
There are generally two methods for designing new enzymes: generating a sequence or generating a structure. Protein language models do the former (which I worked with previously) while RFDiffusion3 does the latter. While protein language models are interesting, I currently think that models which incorporate protein geometry are better for enzyme design. This is for three reasons:
Structure = function: while sequence can be helpful for understanding what a protein does, its structure is what actually makes these hypotheses concrete
Non-protein elements: protein language models are primarily trained on amino acid sequences, so exposure to ligands and other chemicals is limited
Evolutionary history: for enzymes to catalyze reactions not found in nature, they will most likely need to be disconnected from any evolutionary pathway. Since sequences from protein language models are representative of natural proteins, this is not helpful for de-novo enzyme design
Now that we know why RFDiffusion3 is better for this task, let’s take a look at its capabilities. This model is able to build proteins around the motifs you provide it. It has been trained on structures from the PDB, and thus, understands which scaffolds are stable and which ones should be avoided. Furthermore, what separates RFDiffusion3 from its predecessors is that it is an all-atom model. It sees amino acids as their constituent atoms, rather than abstract objects. This allows it to take specific atoms (like I did with His272 and Glu132) and understand ligands such as EDB.

Using RFDiffusion3, I generated 8096 potential backbones. At first, I received many errors. This was expected because: a) EDB is not a part of the chemical component dictionary (CCD), and b) I am modelling a transition state. To get past this, I had to modify some of RFDiffusion3’s configurations. The main one was disabling the generation of a reference conformer for my ligand. Normally, this is done to make sure that the ligand structure is chemically viable with the protein. However, since we are enforcing the active site geometry, this step isn’t necessary. Thus, I removed this feature and generated my samples.
Redesign with LigandMPNN and Validation
Along with the backbone, RFDiffusion3 also generates sequences for each sample. However, these sequences are rough, as the model’s main objective is to diffuse a backbone. This is why we need to use LigandMPNN: a model which is able to design optimized sequences for a given structure. It does this by representing each residue as a node and their interactions as edges. The model then exchanges this data between neighbouring residues, allowing predictions to take this into account. Finally, the model autoregressively predicts the sequence for the backbone. This sequence should fold into the same backbone as the input.
Using this model, I generated a new sequence for each of my backbones. To confirm that samples generated by LigandMPNN reconstructed the backbone, the sequences were folded using Boltz-2. This enabled a comparison between the RFDiffusion backbone and the refolded backbone, ensuring structural fidelity is maintained. This was done using a structure alignment and calculating the RMSD between backbone atoms. A successful design was considered to have an RMSD of less than 2 angstroms. Since structure prediction is computationally expensive, I decided to stop folding at 3614 samples. After filtering, 1554 structures (43%) moved on to the next stage.

Final Filtering
To check if the new sequences still maintained the active site motif, I ran a modified version of the script from the active site design stage. This script essentially checks a structure to see if it can support the first transition state. It then repeats this for each structure in my dataset to see which ones maintain the overall structure. For a sample to pass through this filter, it had to fit several constraints:
The angle had to be at most 5° away from the target (180°)
All distances—which were described previously in the motif design protocol—had to be at most 0.2 angstroms away from the target
Of the 1554 structures that passed the refolding filter, 364 maintained active site viability. As seen in the graph below, this represents a 23.4% success rate. When this filter is limited to sequences which had an RMSD of less than 1 angstrom, the success rate goes up (37.3%). Given the strict nature of the constraints, this passing rate is quite good.
Limitations
If you’ve read the Baker lab’s paper, you would see that I’ve skipped a large portion of the filtering process. This includes a redesign-refolding loop: generating several sequences using LigandMPNN, folding with AlphaFold3, conditioning LigandMPNN on those structures, and repeating. Each step of the loop has some filtering involved.

The reason I’ve avoided doing this is quite obvious: this takes a lot of compute. I think this is one of the major problems in enzyme design right now, as it limits the accessibility of these powerful models. Thus, exploring better filtering objectives and mechanistic interpretability for these models is essential for maximizing their benefit.
Finally, note that these structures mean nothing without actually testing them in the lab. Even with their strict filtering, the Baker lab had an ~18% success rate on their enzymes. This is really impressive compared to previous studies, but it still can be improved. As mentioned, this could be a product of our objectives and training data. These models often treat proteins as static structures; this is not often the case. Furthermore, most structures in the PDB are crystals with a fixed resolution. This could be causing us to miss important aspects of the active site in our models.
Discussion
I hope you’ve gained an appreciation for the importance of enzyme design through this report! As a summary of this project, I’ve designed new versions of LinB: a protein which turns 1,2-dibromoethane (EDB) to 2-bromoethanol. This reaction is important as EDB is a pollutant and carcinogen that was previously used in agriculture. Due to its persistence, it is critical that remediation efforts target its removal from the environment.
I started out by modelling the active site of the enzyme. This was done by generating an EDB transition state conformer at the active site. The conformer fit several constraints determined by a molecular dynamics simulation in a paper. From here, I selected key atoms to include as part of my input motif. This resulted in a set of 45 atoms that were passed into RFDiffusion3.
RFDiffusion3 is a model that enables the generation of diverse backbones which scaffold a motif. It does this through iterative denoising of the atoms, keeping the active site fixed during the process. After generating 8000+ samples using this model, redesign was performed using LigandMPNN. This enabled optimization of the sequence for each backbone. To ensure these sequences remained viable, refolding was done with Boltz-2. Finally, sequences that were left had EDB transition state conformers generated to ensure the active site remained intact.
The final result was 364 potential LinB redesigns. Further filtering and in vitro testing would have to be done to ensure these designs work properly, but this result is a positive indication that enzyme design is becoming more scalable. Future directions include better objectives for these models, new filters that bypass expensive folding loops, and greater consideration of protein dynamics.
Code Availability
Check https://github.com/divyanbavan/LinB-Redesign for the scripts/pipeline
Final Notes
Thanks to 1517 and Lambda Labs for providing the compute credits I needed for this project. Thanks as well to the Baker Lab for their detailed supplementary documentation.
Finally, thank you for reading! If you would like to chat more about this topic, feel free to email me: divyanb at proton dot me.







It amazes me how quickly structural biochemistry software and programming is expanding! Neat article, thanks for sharing!
Great write up !