Using Bioinformatics Tools to Screen,Analyse and Shortlist Prospective Candidates
Enzymes play an important role in our daily lives and are used in variety of industries and sectors like food, detergent and medicine. The demand of certain enzymes has increased exponentially, like lipases, proteases, hydrolases and polymerases. Research laboratories and industries are extensively working to find newer and better candidates. Major enzyme industries are regularly introducing new enzymes in the market. In the past two decades, several patents on enzymes have been filed and issued. Apart from this, there are ongoing efforts to substitute chemical reaction processes in industries with enzymatic processes, as they are greener and environment friendly alternatives.
It has been widely accepted that a cleaner chemical synthesis process should be practiced to prevent pollution and avoid generation of toxic wastes. Enzymatic synthesis of chemical compounds has emerged as a simple, better and competitive route in comparison to chemical methods. Also, a high substrate specificity and better conversion rate with formation of low or no by-products makes enzyme a robust and efficient choice. Recently, Merck and Codexis developed a greener process for the synthesis of Sitagliptin, a drug used in diabetes treatment.
In the recent years, advancement in recombinant DNA technology has resulted in successful approaches to overexpress an enzyme in variety of host cells, which can help in producing the biocatalyst in high amount. To obtain an efficient enzyme candidate, stringent selection criteria are required to achieve high activity, specificity, and stability. In an industrial processes, the substrate, solvent, reaction conditions are important and an enzyme chosen should be able to withstand these components and conditions. It is actually difficult to find a natural enzyme with all the properties desired in an industrial process. To fulfil the massive enzyme demand, various approaches are practiced to constantly explore different resources to obtain new and better enzymes. Among these, in-silico bioprospecting has come up as an efficient, cost and time effective approach to discover new enzyme candidates. Although this approach has been practiced at various laboratories, it has not been reviewed or discussed.
New enzyme discovery can be accomplished using various conventional and contemporary methods as mentioned in Fig. Common methods of screening to identify novel enzymes are performed by exploring natural sources like industrial waste or soil, but they require an established protocol for screening assay or selection method based on the desired properties of the enzyme. This process involves biochemical screening and isolating the organism on selective media, which is usually time and resource consuming and may or may not result in a novel candidate. From these screening assays, the selected organism further needs to be identified, followed by the identification of gene sequence which is coding for the desired enzyme and function.
One approach is to perform random mutagenesis to create enzyme mutant, and then sequence the DNA region. Another way is to perform targeted or whole genome sequencing to identify the desired enzyme gene sequence. As an alternative, amplification of target gene can be performed using degenerate primers. There are challenges involved in primer designing, which affects the success rate. The process is followed by PCR library cloning and screening for prospective candidates with desired properties, which again demands a well-established protocol for screening positive candidates. After selecting the desired clone, the responsible gene can be sequenced, cloned and expressed.
The direct screening and identification methods are preferred where molecular biology resources are inadequate. These experimental approaches are used commonly, but they are time and resource consuming, with low success rate. However, in-silico bioprospecting is a simple, straightforward and promising approach to identify novel enzyme candidates with better enzymatic properties. A compilation of recent reports, where in-silico bioprospecting approach has been used to find novel enzymes, is given in Table. The current fast paced, high-throughput whole genome/metagenome sequencing has tremendously increased the biological database and thus the enzyme diversity. This diversity in turn has increased the complexity and difficulty of finding a novel candidate. The in-silico bioprospecting process can be broadly divided into two steps: (i) Searching databases (ii) Using Bioinformatics tools to screen, analyse and shortlist prospective candidates.
Hypothetical Proteins of wheat’ in NCBI database followed by manual screening to get unique protein candidates. After removing redundant entries, unique candidates were further subjected to physicochemical, localization, function and domain analysis. In another database search, keywords such hydroxybutyrate, hydroxyalkanoate, hydroxyalkanoic, PHA and PHB were used as input. Another common approach practiced by researcher is to search biological databases using a known candidate enzyme sequence. While choosing a potential enzyme gene sequence, it is of utmost importance to select a full length protein sequence having conserved domains, as many incomplete sequences annotated in database do not code for a functional protein, when checked experimentally.
Also, in the search result, the selected candidate’s sequence similarity should not be very high with known sequence. This is to ensure that a novel candidate is shortlisted and not a close homologue of a known sequence. In the similarity search result, the hits with >90 identity are very closely related, sources like different species of same family, and it is more likely that they are very similar. But, the hits with ~ 80% identity or lower are those candidates who are different from the query candidates, not closely related, but do have conserved sequences similar to known candidates. This ensures that novel candidates are chosen, which is predicted to retain the enzyme activity but is different from the search query. There have been reports where researchers had selected candidates with sequence similarity as low as 40 percent. Sharma et al.
searched novel sources of nitrilases from microbial genomes by adopting homology-based approach and selected sequences which exhibited>30% and <80% identity. The shortlisted search results need to be confirmed for a complete coding sequence or sequences. For example, shortlisted candidates of nitrilase were checked by GenMark S tool to verify complete coding sequences or sequences. Since the protein length information is available for the input sequence, the search results should be restricted to length closer to the input sequence length. In case of nitrilases, sequences with less than 100 amino acids were considered as false positive and were discarded. In another instance, sequences less than 250 amino acids were excluded to find novel BVMO (Bayer-Villiger Monooxygenases) enzyme. For PHA synthase, sequences with ~120 to 260 bp were considered as prospective candidate in a database search. These search filters along with others like e-value, can aid in gathering positive sequences which could code for functional enzyme of appropriate length and reduces the chance of false discovery or random or irrelevant search result.
Once the primary list has been generated using various database search approaches, the next step will be to analyse their physiochemical, phylogenetic and functional properties using different bioinformatics tools. ProtParam software using ExPASy server is widely used to access physiochemical properties (such as the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index, and grand average of hydropathicity (GRAVY) of putative candidates.
Predicted values of all parameters of putative enzyme(s) are compared to the well characterized enzyme which affects the confidence level to study the putative enzyme(s) experimentally. For example, ProtParam predicted physiochemical properties of 138 putative nitrilases with in the range of well-characterized nitrilases. All the parameters are based on protein sequence i.e. sequence-dependent analysis; therefore, it is necessary to get complete or nearly complete sequence for accurate analysis and prediction of various physiochemical properties.
Phylogenetic analysis can be performed using tools like Molecular Evolutionary Genetics Analysis (MEGA). For example, phylogenetic analysis of selected putative candidates belonging to CalB-family grouped putative lipases in to different clusters of known lipases depending upon its evolutionary closeness, thus helping in deciding on novel and unique candidates. Structural modelling of putative candidates can be performed using SWISS-MODEL server or MODELLER v9.15 software. Vaquoro et al. used CalB as template to model PlicB, which exhibits 30% sequence identity and 44% similarity. The information about structure and residue conservation prediction is only possible if structural data of protein homologues are available through crystal structures.
Hence, persistent exploration and enrichment of databases are necessary for in-silico bioprospecting of novel enzymes.There are other tools which can predict structural information such as signal peptide (e.g. Signal P) or disulphide linkages (e.g. DiANNA). DiANNA 1.1 web server predicted two disulpfide bonds in PlicB whereas CalB and Uml2 lacks disulfide bonds. Protein functional domains and families are studied by comparing list of putative enzyme(s)against databases like Pfam, CATH, SVM-Prot, CDART, SMART. In one study, hypothetical proteins (HPs) were explored using tools based on domain architecture and profiles. Out of 124 HPs, sequences were annotated with high confidence by using Pfam, CATH, SVM-Prot, CDART, SMART and ProtoNet, and among them, were predicted as enzymes. Functional protein network provides information about the association of hypothetical/putative protein(s) with the known functional protein, which can be generated by STRING database. In the study conducted by Gupta et al. it was found that the predicted HPs such as HAV22 (Q7XAP6) and F-box protein (D0QEJ9) were interacting
Author: Asmita Kamble Sumana Srinivasan Harinder Singh