Publications by year
In Press
Yang ZR (In Press). machine learning approaches to bioinformatics. new jersey, world scientific.
2019
Ireland PM, Bullifent HL, Senior NJ, Southern SJ, Yang ZR, Ireland RE, Nelson M, Atkins HS, Titball RW, Scott AE, et al (2019). Global Analysis of Genes Essential for Francisella tularensis Schu S4 Growth in Vitro and for Fitness during Competitive Infection of Fischer 344 Rats.
J Bacteriol,
201(7).
Abstract:
Global Analysis of Genes Essential for Francisella tularensis Schu S4 Growth in Vitro and for Fitness during Competitive Infection of Fischer 344 Rats.
The highly virulent intracellular pathogen Francisella tularensis is a Gram-negative bacterium that has a wide host range, including humans, and is the causative agent of tularemia. To identify new therapeutic drug targets and vaccine candidates and investigate the genetic basis of Francisella virulence in the Fischer 344 rat, we have constructed an F. tularensis Schu S4 transposon library. This library consists of more than 300,000 unique transposon mutants and represents a transposon insertion for every 6 bp of the genome. A transposon-directed insertion site sequencing (TraDIS) approach was used to identify 453 genes essential for growth in vitro Many of these essential genes were mapped to key metabolic pathways, including glycolysis/gluconeogenesis, peptidoglycan synthesis, fatty acid biosynthesis, and the tricarboxylic acid (TCA) cycle. Additionally, 163 genes were identified as required for fitness during colonization of the Fischer 344 rat spleen. This in vivo selection screen was validated through the generation of marked deletion mutants that were individually assessed within a competitive index study against the wild-type F. tularensis Schu S4 strain.IMPORTANCE the intracellular bacterial pathogen Francisella tularensis causes a disease in humans characterized by the rapid onset of nonspecific symptoms such as swollen lymph glands, fever, and headaches. F. tularensis is one of the most infectious bacteria known and following pulmonary exposure can have a mortality rate exceeding 50% if left untreated. The low infectious dose of this organism and concerns surrounding its potential as a biological weapon have heightened the need for effective and safe therapies. To expand the repertoire of targets for therapeutic development, we initiated a genome-wide analysis. This study has identified genes that are important for F. tularensis under in vitro and in vivo conditions, providing candidates that can be evaluated for vaccine or antibacterial development.
Abstract.
Author URL.
Full text.
2017
Yang ZR, Bullifent HL, Moore K, Paszkiewicz K, Saint RJ, Southern SJ, Champion OL, Senior NJ, Sarkar-Tyson M, Oyston PCF, et al (2017). A Noise Trimming and Positional Significance of Transposon Insertion System to Identify Essential Genes in Yersinia pestis.
Sci Rep,
7Abstract:
A Noise Trimming and Positional Significance of Transposon Insertion System to Identify Essential Genes in Yersinia pestis.
Massively parallel sequencing technology coupled with saturation mutagenesis has provided new and global insights into gene functions and roles. At a simplistic level, the frequency of mutations within genes can indicate the degree of essentiality. However, this approach neglects to take account of the positional significance of mutations - the function of a gene is less likely to be disrupted by a mutation close to the distal ends. Therefore, a systematic bioinformatics approach to improve the reliability of essential gene identification is desirable. We report here a parametric model which introduces a novel mutation feature together with a noise trimming approach to predict the biological significance of Tn5 mutations. We show improved performance of essential gene prediction in the bacterium Yersinia pestis, the causative agent of plague. This method would have broad applicability to other organisms and to the identification of genes which are essential for competitiveness or survival under a broad range of stresses.
Abstract.
Author URL.
Full text.
Senior NJ, Sasidharan K, Saint RJ, Scott AE, Sarkar-Tyson M, Ireland PM, Bullifent HL, Rong Yang Z, Moore K, Oyston PCF, et al (2017). An integrated computational-experimental approach reveals Yersinia pestis genes essential across a narrow or a broad range of environmental conditions.
BMC Microbiology,
17(1).
Abstract:
An integrated computational-experimental approach reveals Yersinia pestis genes essential across a narrow or a broad range of environmental conditions
© 2017 the Author(s). Background: the World Health Organization has categorized plague as a re-emerging disease and the potential for Yersinia pestis to also be used as a bioweapon makes the identification of new drug targets against this pathogen a priority. Environmental temperature is a key signal which regulates virulence of the bacterium. The bacterium normally grows outside the human host at 28 °C. Therefore, understanding the mechanisms that the bacterium used to adapt to a mammalian host at 37 °C is central to the development of vaccines or drugs for the prevention or treatment of human disease. Results: Using a library of over 1 million Y. pestis CO92 random mutants and transposon-directed insertion site sequencing, we identified 530 essential genes when the bacteria were cultured at 28 °C. When the library of mutants was subsequently cultured at 37 °C we identified 19 genes that were essential at 37 °C but not at 28 °C, including genes which encode proteins that play a role in enabling functioning of the type III secretion and in DNA replication and maintenance. Using genome-scale metabolic network reconstruction we showed that growth conditions profoundly influence the physiology of the bacterium, and by combining computational and experimental approaches we were able to identify 54 genes that are essential under a broad range of conditions. Conclusions: Using an integrated computational-experimental approach we identify genes which are required for growth at 37 °C and under a broad range of environments may be the best targets for the development of new interventions to prevent or treat plague in humans.
Abstract.
Full text.
2016
Wappett M, Dulak A, Yang ZR, Al-Watban A, Bradford JR, Dry JR (2016). Multi-omic measurement of mutually exclusive loss-of-function enriches for candidate synthetic lethal gene pairs.
BMC Genomics,
17Abstract:
Multi-omic measurement of mutually exclusive loss-of-function enriches for candidate synthetic lethal gene pairs.
BACKGROUND: Identification of synthetic lethal interactions in cancer cells could offer promising new therapeutic targets. Large-scale functional genomic screening presents an opportunity to test large numbers of cancer synthetic lethal hypotheses. Methods enriching for candidate synthetic lethal targets in molecularly defined cancer cell lines can steer effective design of screening efforts. Loss of one partner of a synthetic lethal gene pair creates a dependency on the other, thus synthetic lethal gene pairs should never show simultaneous loss-of-function. We have developed a computational approach to mine large multi-omic cancer data sets and identify gene pairs with mutually exclusive loss-of-function. Since loss-of-function may not always be genetic, we look for deleterious mutations, gene deletion and/or loss of mRNA expression by bimodality defined with a novel algorithm BiSEp. RESULTS: Applying this toolkit to both tumour cell line and patient data, we achieve statistically significant enrichment for experimentally validated tumour suppressor genes and synthetic lethal gene pairings. Notably non-reliance on genetic loss reveals a number of known synthetic lethal relationships otherwise missed, resulting in marked improvement over genetic-only predictions. We go on to establish biological rationale surrounding a number of novel candidate synthetic lethal gene pairs with demonstrated dependencies in published cancer cell line shRNA screens. CONCLUSIONS: This work introduces a multi-omic approach to define gene loss-of-function, and enrich for candidate synthetic lethal gene pairs in cell lines testable through functional screens. In doing so, we offer an additional resource to generate new cancer drug target and combination hypotheses. Algorithms discussed are freely available in the BiSEp CRAN package at http://cran.r-project.org/web/packages/BiSEp/index.html.
Abstract.
Author URL.
Full text.
de Torres Zabala M, Zhai B, Jayaraman S, Eleftheriadou G, Winsbury R, Yang R, Truman W, Tang S, Smirnoff N, Grant M, et al (2016). Novel JAZ co-operativity and unexpected JA dynamics underpin Arabidopsis defence responses to Pseudomonas syringae infection.
New Phytologist,
209(3), 1120-1134.
Abstract:
Novel JAZ co-operativity and unexpected JA dynamics underpin Arabidopsis defence responses to Pseudomonas syringae infection
© 2016 New Phytologist Trust. Pathogens target phytohormone signalling pathways to promote disease. Plants deploy salicylic acid (SA)-mediated defences against biotrophs. Pathogens antagonize SA immunity by activating jasmonate signalling, for example Pseudomonas syringae pv. tomato DC3000 produces coronatine (COR), a jasmonic acid (JA) mimic. This study found unexpected dynamics between SA, JA and COR and co-operation between JAZ jasmonate repressor proteins during DC3000 infection. We used a systems-based approach involving targeted hormone profiling, high-temporal-resolution micro-array analysis, reverse genetics and mRNA-seq. Unexpectedly, foliar JA did not accumulate until late in the infection process and was higher in leaves challenged with COR-deficient P. syringae or in the more resistant JA receptor mutant coi1. JAZ regulation was complex and COR alone was insufficient to sustainably induce JAZs. JAZs contribute to early basal and subsequent secondary plant defence responses. We showed that JAZ5 and JAZ10 specifically co-operate to restrict COR cytotoxicity and pathogen growth through a complex transcriptional reprogramming that does not involve the basic helix-loop-helix transcription factors MYC2 and related MYC3 and MYC4 previously shown to restrict pathogen growth. mRNA-seq predicts compromised SA signalling in a jaz5/10 mutant and rapid suppression of JA-related components on bacterial infection.
Abstract.
Full text.
2015
Wu Y, Shi B, Ding X, Liu T, Hu X, Yip KY, Yang ZR, Mathews DH, Lu ZJ (2015). Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data.
Nucleic Acids Res,
43(15), 7247-7259.
Abstract:
Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data.
Recently, several experimental techniques have emerged for probing RNA structures based on high-throughput sequencing. However, most secondary structure prediction tools that incorporate probing data are designed and optimized for particular types of experiments. For example, RNAstructure-Fold is optimized for SHAPE data, while SeqFold is optimized for PARS data. Here, we report a new RNA secondary structure prediction method, restrained MaxExpect (RME), which can incorporate multiple types of experimental probing data and is based on a free energy model and an MEA (maximizing expected accuracy) algorithm. We first demonstrated that RME substantially improved secondary structure prediction with perfect restraints (base pair information of known structures). Next, we collected structure-probing data from diverse experiments (e.g. SHAPE, PARS and DMS-seq) and transformed them into a unified set of pairing probabilities with a posterior probabilistic model. By using the probability scores as restraints in RME, we compared its secondary structure prediction performance with two other well-known tools, RNAstructure-Fold (based on a free energy minimization algorithm) and SeqFold (based on a sampling algorithm). For SHAPE data, RME and RNAstructure-Fold performed better than SeqFold, because they markedly altered the energy model with the experimental restraints. For high-throughput data (e.g. PARS and DMS-seq) with lower probing efficiency, the secondary structure prediction performances of the tested tools were comparable, with performance improvements for only a portion of the tested RNAs. However, when the effects of tertiary structure and protein interactions were removed, RME showed the highest prediction accuracy in the DMS-accessible regions by incorporating in vivo DMS-seq data.
Abstract.
Author URL.
Full text.
2014
Mohammed S, Akman OE, Yang ZR (2014). A consensus approach to predict regulatory interactions.
Abstract:
A consensus approach to predict regulatory interactions
Abstract.
Yang Z, Alwatban A, Yang ZR (2014). A mean pattern model for integrative study - Integrative self-organizing map.
Abstract:
A mean pattern model for integrative study - Integrative self-organizing map
Abstract.
Yang ZR, Yang Z (2014). Artificial Neural Networks. In (Ed)
Comprehensive Biomedical Physics, 1-17.
Abstract:
Artificial Neural Networks
Abstract.
Yang Z, Yang ZR (2014). Detection of non-structural outliers for microarray experiments.
Abstract:
Detection of non-structural outliers for microarray experiments
Abstract.
Dry JR, Wappett M, Yang R (2014). Integrating pan-molecular data sets by bimodality to nominate synthetic lethal gene pairs and biomarkers of drug response.
Author URL.
Yang ZH, Alwatban A, Everson R, Yang ZR (2014). Multi-scale Gaussian mixtures for cross-species study.
Abstract:
Multi-scale Gaussian mixtures for cross-species study
Abstract.
King A, Yang Z, Yang ZR (2014). Multivariate multi-scale Gaussian for microarray unsupervised classification.
Abstract:
Multivariate multi-scale Gaussian for microarray unsupervised classification
Abstract.
Knocker L, Yang ZR (2014). SLC- and NDUF-genes expression dynamics in pre-implantation embryonic development between bovine and mouse - a bioinformatics study.
Abstract:
SLC- and NDUF-genes expression dynamics in pre-implantation embryonic development between bovine and mouse - a bioinformatics study
Abstract.
2013
Yang Z, Yang Z (2013). Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster.
BMC Bioinformatics,
14(1).
Abstract:
Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster
Background: Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to bio-logical diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis- ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers.Results: We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non- outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs.Conclusions: the proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers. © 2013 Yang and Yang; licensee BioMed Central Ltd.
Abstract.
Full text.
2012
Lau SK, Winlove P, Moger J, Champion OL, Titball RW, Yang ZH, Yang ZR (2012). A Bayesian Whittaker-Henderson smoother for general-purpose and sample-based spectral baseline estimation and peak extraction.
JOURNAL OF RAMAN SPECTROSCOPY,
43(9), 1299-1305.
Author URL.
Lau SK, Champion OL, Titball RW, Yang ZR, Winlove P, Moger J, Yang ZH (2012). A Bayesian Whittaker-Henderson smoother for general-purpose and sample-based spectral baseline estimation and peak extraction.
Journal of Raman SpectroscopyAbstract:
A Bayesian Whittaker-Henderson smoother for general-purpose and sample-based spectral baseline estimation and peak extraction
Raman spectroscopy is a well-established technique that allows both chemical and structural analysis of materials. Raman spectra are often complex and extracting meaningful information is easily hindered by spectral interferences; one of the most significant sources being variations in background. Raman spectra have diverse sources of background making it hard to eliminate them or theoretically to predict the form of the baseline, which frequently varies between samples. Although many different methods for baseline removal have been proposed, most require some form of user input. User input is also subjective and consequently less reproducible than automated methods and variations in baseline subtraction can distort peak heights leading to erroneous results. We present a Bayesian Whittaker-Henderson smoother for spectral baseline estimation and peak extraction. It is a generalisation of the Whittaker-Henderson smoother, a regularised regression algorithm. We introduce hierarchical priors for model parameters of the smoother and propose a global aligner for consistent peak extraction across multiple spectra. We show that this novel smoother significantly outperforms several existing smoothers. © 2012 John Wiley & Sons, Ltd.
Abstract.
Yang Z, Yang Z, Eftestl T, Steen PA, Lu W, Harrison RG (2012). A Mixture model classifier and its application on the biomedical time series.
Applied Artificial Intelligence,
26(6), 588-597.
Abstract:
A Mixture model classifier and its application on the biomedical time series
This article presents a methodology based on the mixture model to classify the real biomedical time series. The mixture model is shown to be an efficient probabilistic density estimation scheme aimed at approximating the posterior probability distribution of a certain class of data. The approximation is conducted by employing a weighted mixture of a finite number of Gaussian kernels whose parameters and mixing coefficients are estimated iteratively through a maximum likelihood method. A database of the real electrocardiogram (ECG) time series of out-of-hospital cardiac arrest patients suffering ventricular fibrillation (VF) with known defibrillation outcomes was adopted to evaluate the performance of this model and confirm its efficiency compared with other classification methods. Copyright © 2012 Taylor and Francis Group, LLC.
Abstract.
Al-Watban A, Yang ZH, Everson R, Yang ZR (2012). A novel data mining approach for differential genes identification in small cancer expression data.
2012 7th International Symposium on Health Informatics and Bioinformatics, HIBIT 2012, 1-6.
Abstract:
A novel data mining approach for differential genes identification in small cancer expression data
The simple t test is the standard approach for differential gene identification but is not suited to data with low replication. Here, we propose using a multi-scale Gaussian (MSG) to improve the detection accuracy of differential cancerous genes in low replicate microarray experiment. By modelling the gene expression densities as Gaussian scale mixtures, the differential genes are then identified using the estimated density function. We use simulated data and data from GEO to demonstrate that the new algorithm compares favourably to four benchmark algorithms for cancer gene expression data with low replicate. © 2012 IEEE.
Abstract.
Perera V, de Torres Zabala M, Florance H, Smirnoff N, Grant M, Yang ZR (2012). Aligning extracted LC-MS peak lists via density maximization.
Metabolomics,
8, 175-185.
Abstract:
Aligning extracted LC-MS peak lists via density maximization
Rapid improvements in mass spectrometry sensitivity and mass accuracy combined with improved liquid chromatography separation technologies allow acquisition of high throughput metabolomics data, providing an excellent opportunity to understand biological processes. While spectral deconvolution software can identify discrete masses and their associated isotopes and adducts, the utility of metabolomic approaches for many statistical analyses such as identifying differentially abundant ions depends heavily on data quality and robustness, especially, the accuracy of aligning features across multiple biological replicates. We have developed a novel algorithm for feature alignment using density maximization. Instead of a greedy iterative, hence local, merging strategy, which has been widely used in the literature and in commercial applications, we apply a global merging strategy to improve alignment quality. Using both simulated and real data, we demonstrate that our new algorithm provides high map (e. g. chromatogram) coverage, which is critically important for non-targeted comparative metabolite profiling of highly replicated biological datasets. © 2011 Springer Science+Business Media, LLC.
Abstract.
Yang ZR, Grant M (2012). An ultra-fast metabolite prediction algorithm.
PLoS One,
7(6).
Abstract:
An ultra-fast metabolite prediction algorithm.
Small molecules are central to all biological processes and metabolomics becoming an increasingly important discovery tool. Robust, accurate and efficient experimental approaches are critical to supporting and validating predictions from post-genomic studies. To accurately predict metabolic changes and dynamics, experimental design requires multiple biological replicates and usually multiple treatments. Mass spectra from each run are processed and metabolite features are extracted. Because of machine resolution and variation in replicates, one metabolite may have different implementations (values) of retention time and mass in different spectra. A major impediment to effectively utilizing untargeted metabolomics data is ensuring accurate spectral alignment, enabling precise recognition of features (metabolites) across spectra. Existing alignment algorithms use either a global merge strategy or a local merge strategy. The former delivers an accurate alignment, but lacks efficiency. The latter is fast, but often inaccurate. Here we document a new algorithm employing a technique known as quicksort. The results on both simulated data and real data show that this algorithm provides a dramatic increase in alignment speed and also improves alignment accuracy.
Abstract.
Author URL.
Full text.
2011
Perera V, De Torres Zabala M, Florance H, Smirnoff N, Grant M, Yang ZR (2011). Aligning extracted LC-MS peak lists via density maximization. Metabolomics, 1-11.
2010
Yang ZR (2010). Neural networks.
Methods Mol Biol,
609, 197-222.
Abstract:
Neural networks.
Neural networks are a class of intelligent learning machines establishing the relationships between descriptors of real-world objects. As optimisation tools they are also a class of computational algorithms implemented using statistical/numerical techniques for parameter estimate, model selection, and generalisation enhancement. In bioinformatics applications, neural networks have played an important role for classification, function approximation, knowledge discovery, and data visualisation. This chapter will focus on supervised neural networks and discuss their applications to bioinformatics.
Abstract.
Author URL.
2009
Felgner PL, Kayala MA, Vigil A, Burk C, Nakajima-Sasaki R, Pablo J, Molina DM, Hirst S, Chew JSW, Wang D, et al (2009). A Burkholderia pseudomallei protein microarray reveals serodiagnostic and cross-reactive antigens.
Proc Natl Acad Sci U S A,
106(32), 13499-13504.
Abstract:
A Burkholderia pseudomallei protein microarray reveals serodiagnostic and cross-reactive antigens.
Understanding the way in which the immune system responds to infection is central to the development of vaccines and many diagnostics. To provide insight into this area, we fabricated a protein microarray containing 1,205 Burkholderia pseudomallei proteins, probed it with 88 melioidosis patient sera, and identified 170 reactive antigens. This subset of antigens was printed on a smaller array and probed with a collection of 747 individual sera derived from 10 patient groups including melioidosis patients from Northeast Thailand and Singapore, patients with different infections, healthy individuals from the USA, and from endemic and nonendemic regions of Thailand. We identified 49 antigens that are significantly more reactive in melioidosis patients than healthy people and patients with other types of bacterial infections. We also identified 59 cross-reactive antigens that are equally reactive among all groups, including healthy controls from the USA. Using these results we were able to devise a test that can classify melioidosis positive and negative individuals with sensitivity and specificity of 95% and 83%, respectively, a significant improvement over currently available diagnostic assays. Half of the reactive antigens contained a predicted signal peptide sequence and were classified as outer membrane, surface structures or secreted molecules, and an additional 20% were associated with pathogenicity, adaptation or chaperones. These results show that microarrays allow a more comprehensive analysis of the immune response on an antigen-specific, patient-specific, and population-specific basis, can identify serodiagnostic antigens, and contribute to a more detailed understanding of immunogenicity to this pathogen.
Abstract.
Author URL.
Yang ZR, Lertmemongkolchai G, Tan G, Felgner PL, Titball R (2009). A genetic programming approach for Burkholderia pseudomallei diagnostic pattern discovery.
Bioinformatics,
25(17), 2256-2262.
Abstract:
A genetic programming approach for Burkholderia pseudomallei diagnostic pattern discovery.
MOTIVATION: Finding diagnostic patterns for fighting diseases like Burkholderia pseudomallei using biomarkers involves two key issues. First, exhausting all subsets of testable biomarkers (antigens in this context) to find a best one is computationally infeasible. Therefore, a proper optimization approach like evolutionary computation should be investigated. Second, a properly selected function of the antigens as the diagnostic pattern which is commonly unknown is a key to the diagnostic accuracy and the diagnostic effectiveness in clinical use. RESULTS: a conversion function is proposed to convert serum tests of antigens on patients to binary values based on which Boolean functions as the diagnostic patterns are developed. A genetic programming approach is designed for optimizing the diagnostic patterns in terms of their accuracy and effectiveness. During optimization, it is aimed to maximize the coverage (the rate of positive response to antigens) in the infected patients and minimize the coverage in the non-infected patients while maintaining the fewest number of testable antigens used in the Boolean functions as possible. The final coverage in the infected patients is 96.55% using 17 of 215 (7.4%) antigens with zero coverage in the non-infected patients. Among these 17 antigens, BPSL2697 is the most frequently selected one for the diagnosis of Burkholderia Pseudomallei. The approach has been evaluated using both the cross-validation and the Jack-knife simulation methods with the prediction accuracy as 93% and 92%, respectively. A novel approach is also proposed in this study to evaluate a model with binary data using ROC analysis.
Abstract.
Author URL.
Yang ZR (2009). Peptide bioinformatics- peptide classification using peptide machines.
Methods Mol Biol,
458, 155-179.
Abstract:
Peptide bioinformatics- peptide classification using peptide machines.
Peptides scanned from whole protein sequences are the core information for many peptide bioinformatics research subjects, such as functional site prediction, protein structure identification, and protein function recognition. In these applications, we normally need to assign a peptide to one of the given categories using a computer model. They are therefore referred to as peptide classification applications. Among various machine learning approaches, including neural networks, peptide machines have demonstrated excellent performance compared with various conventional machine learning approaches in many applications. This chapter discusses the basic concepts of peptide classification, commonly used feature extraction methods, three peptide machines, and some important issues in peptide classification.
Abstract.
Yang ZR (2009). Predict collagen hydroxyproline sites using support vector machines.
J Comput Biol,
16(5), 691-702.
Abstract:
Predict collagen hydroxyproline sites using support vector machines.
Collagen hydroxyproline is an important posttranslational modification activity because of its close relationship with various diseases and signaling activities. However, there is no study to date for constructing models for predicting collagen hydroxyproline sites. Support vector machines with two kernel functions (the identity kernel function and the bio-kernel function) have been used for constructing models for predicting collagen hydroxyproline sites in this study. The models are constructed based on 37 sequences collected from NCBI. Peptide data are generated using a sliding window with various sizes to scan the sequences. Fivefold cross-validation is used for model evaluation. The best model has specificity of 70% and sensitivity of 90%.
Abstract.
Author URL.
Full text.
Yang ZR (2009). Predict prokaryotic proteins through detecting N-formylmethionine residues in protein sequences using support vector machine.
Biosystems,
97(3), 141-145.
Abstract:
Predict prokaryotic proteins through detecting N-formylmethionine residues in protein sequences using support vector machine.
Identifying prokaryotes in silico is commonly based on DNA sequences. In experiments where DNA sequences may not be immediately available, we need to have a different approach to detect prokaryotes based on RNA or protein sequences. N-formylmethionine (fMet) is known as a typical characteristic of prokaryotes. A web tool has been implemented here for predicting prokaryotes through detecting the N-formylmethionine residues in protein sequences. The predictor is constructed using support vector machine. An online predictor has been implemented using Python. The implemented predictor is able to achieve the total prediction accuracy 80% with the specificity 80% and the sensitivity 81%.
Abstract.
Author URL.
Yang ZR (2009). Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy.
BMC Bioinformatics,
10Abstract:
Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy.
BACKGROUND: Tyrosine sulfation is one of the most important posttranslational modifications. Due to its relevance to various disease developments, tyrosine sulfation has become the target for drug design. In order to facilitate efficient drug design, accurate prediction of sulfotyrosine sites is desirable. A predictor published seven years ago has been very successful with claimed prediction accuracy of 98%. However, it has a particularly low sensitivity when predicting sulfotyrosine sites in some newly sequenced proteins. RESULTS: a new approach has been developed for predicting sulfotyrosine sites using the random forest algorithm after a careful evaluation of seven machine learning algorithms. Peptides are formed by consecutive residues symmetrically flanking tyrosine sites. They are then encoded using an amino acid hydrophobicity scale. This new approach has increased the sensitivity by 22%, the specificity by 3%, and the total prediction accuracy by 10% compared with the previous predictor using the same blind data. Meanwhile, both negative and positive predictive powers have been increased by 9%. In addition, the random forest model has an excellent feature for ranking the residues flanking tyrosine sites, hence providing more information for further investigating the tyrosine sulfation mechanism. A web tool has been implemented at http://ecsb.ex.ac.uk/sulfotyrosine for public use. CONCLUSION: the random forest algorithm is able to deliver a better model compared with the Hidden Markov Model, the support vector machine, artificial neural networks, and others for predicting sulfotyrosine sites. The success shows that the random forest algorithm together with an amino acid hydrophobicity scale encoding can be a good candidate for peptide classification.
Abstract.
Author URL.
Full text.
2008
Yang ZR (2008). Crosstalk and signalling pathway complexity - a case study on synthetic models.
Abstract:
Crosstalk and signalling pathway complexity - a case study on synthetic models
Abstract.
Yang ZR (2008). Explore residue significance in peptide classification.
Abstract:
Explore residue significance in peptide classification
Abstract.
Yin H, Tino P, Magdon-Ismail M, Yang ZR, Corchado E (2008). INTRODUCTION.
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS,
18(6), V-V.
Author URL.
Yin H, Tino P, Magdon-Ismail M, Yang ZR, Corchado E (2008). International Journal of Neural Systems: Introduction. International Journal of Neural Systems, 18(6).
Yang ZR (2008). Peptide bioinformatics: peptide classification using peptide machines.
Methods Mol Biol,
458, 159-183.
Abstract:
Peptide bioinformatics: peptide classification using peptide machines.
Peptides scanned from whole protein sequences are the core information for many peptide bioinformatics research such as functional site prediction, protein structure identification, and protein function recognition. In these applications, we normally need to assign a peptide to one of the given categories using a computer model. They are therefore referred to as peptide classification applications. Among various machine learning approaches, including neural networks, peptide machines have demonstrated excellent performance in many applications. This chapter discusses the basic concepts of peptide classification, commonly used feature extraction methods, three peptide machines, and some important issues in peptide classification.
Abstract.
Author URL.
Yang ZR (2008). Single-layer neural net competes with multi-layer neural net.
Abstract:
Single-layer neural net competes with multi-layer neural net
Abstract.
2007
Yang ZR (2007). A probabilistic peptide machine for predicting hepatitis C virus protease cleavage sites.
IEEE Trans Inf Technol Biomed,
11(5), 593-595.
Abstract:
A probabilistic peptide machine for predicting hepatitis C virus protease cleavage sites.
Although various machine learning approaches have been used for predicting protease cleavage sites, constructing a probabilistic model for these tasks is still challenging. This paper proposes a novel algorithm termed as a probabilistic peptide machine where estimating probability density functions and constructing a classifier for predicting protease cleavage sites are combined into one process. The simulation based on experimentally determined Hepatitis C virus (HCV) protease cleavage data has demonstrated the success of this new algorithm.
Abstract.
Author URL.
Trudgian DC, Yang ZR (2007). A sparse bayesian position weighted bio-kernel network.
Abstract:
A sparse bayesian position weighted bio-kernel network
Abstract.
Yang ZR, Hamer R (2007). Bio-basis function neural networks in protein data mining.
Curr Pharm Des,
13(14), 1403-1413.
Abstract:
Bio-basis function neural networks in protein data mining.
Accurately identifying functional sites in proteins is one of the most important topics in bioinformatics and systems biology. In bioinformatics, identifying protease cleavage sites in protein sequences can aid drug/inhibitor design. In systems biology, post-translational protein-protein interaction activity is one of the major components for analyzing signaling pathway activities. Determining functional sites using laboratory experiments are normally time consuming and expensive. Computer programs have therefore been widely used for this kind of task. Mining protein sequence data using computer programs covers two major issues: 1) discovering how amino acid specificity affects functional sites and 2) discovering what amino acid specificity is. Both need a proper coding mechanism prior to using a proper machine learning algorithm. The development of the bio-basis function neural network (BBFNN) has made a new way for protein sequence data mining. The bio-basis function used in BBFNN is biologically sound in well coding biological information in protein sequences, i.e. well measuring the similarity between protein sequences. BBFNN has therefore been outperforming conventional neural networks in many subjects of protein sequence data mining from protease cleavage site prediction to disordered protein identification. This review focuses on the variants of BBFNN and their applications in mining protein sequence data.
Abstract.
Author URL.
Yang ZR, IAENG (2007). Peptide classification with genetic programming ensemble of generalised indicator models.
Author URL.
Yang ZR (2007). Peptide machines for data mining protein peptides.
AMINO ACIDS,
33(3), XII-XII.
Author URL.
Yang ZR (2007). Predicting palmitoylation sites using a regularised bio-basis function neural network.
Abstract:
Predicting palmitoylation sites using a regularised bio-basis function neural network
Abstract.
Yang ZR, Young N (2007). Regressional inhibitive crosstalk models.
Author URL.
Trudgian DC, Yang ZR (2007). Substitution matrix optimisation for peptide classification.
Abstract:
Substitution matrix optimisation for peptide classification
Abstract.
2006
Yang ZR, Dry J, Thomson R, Charles Hodgman T (2006). A bio-basis function neural network for protein peptide cleavage activity characterisation.
Neural Netw,
19(4), 401-407.
Abstract:
A bio-basis function neural network for protein peptide cleavage activity characterisation.
This paper presents a novel neural learning algorithm for analysing protein peptides which comprise amino acids as non-numerical attributes. The algorithm is derived from the radial basis function neural networks (RBFNNs) and is referred to as a bio-basis function neural network (BBFNN). The basic principle is to replace the radial basis function used by RBFNNs with a bio-basis function. Each basis in BBFNN is supported by a peptide. The bases collectively form a feature space, in which each basis represents a feature dimension. A linear classifier is constructed in the feature space for characterising a protein peptide in terms of functional status. The theoretical basis of BBFNN is that peptides, which perform the same function will have similar compositions of amino acids. Because of this, the similarity between peptides can have statistical significance for modelling while the proposed bio-basis function can well code this information from data. The application to two real cases shows that BBFNN outperformed multi-layer perceptrons and support vector machines.
Abstract.
Author URL.
Yang ZR (2006). A fast algorithm for relevance vector machine.
Author URL.
Yang ZR (2006). A novel radial basis function neural network for discriminant analysis.
IEEE Trans Neural Netw,
17(3), 604-612.
Abstract:
A novel radial basis function neural network for discriminant analysis.
A novel radial basis function neural network for discriminant analysis is presented in this paper. In contrast to many other researches, this work focuses on the exploitation of the weight structure of radial basis function neural networks using the Bayesian method. It is expected that the performance of a radial basis function neural network with a well-explored weight structure can be improved. As the weight structure of a radial basis function neural network is commonly unknown, the Bayesian method is, therefore, used in this paper to study this a priori structure. Two weight structures are investigated in this study, i.e. a single-Gaussian structure and a two-Gaussian structure. An expectation-maximization learning algorithm is used to estimate the weights. The simulation results showed that the proposed radial basis function neural network with a weight structure of two Gaussians outperformed the other algorithms.
Abstract.
Author URL.
Esnouf RM, Hamer R, Sussman JL, Silman I, Trudgian D, Yang ZR, Prilusky J (2006). Honing the in silico toolkit for detecting protein disorder.
Acta Crystallographica Section D: Biological Crystallography,
62(10), 1260-1266.
Abstract:
Honing the in silico toolkit for detecting protein disorder
Not all proteins form well defined three-dimensional structures in their native states. Some amino-acid sequences appear to strongly favour the disordered state, whereas some can apparently transition between disordered and ordered states under the influence of changes in the biological environment, thereby playing an important role in processes such as signalling. Although important biologically, for the structural biologist disordered regions of proteins can be disastrous even preventing successful structure determination. The accurate prediction of disorder is therefore important, not least for directing the design of expression constructs so as to maximize the chances of successful structure determination. Such design criteria have become integral to the construct-design strategies of laboratories within the Structural Proteomics in Europe (SPINE) consortium. This paper assesses the current state of the art in disorder prediction in terms of prediction reliability and considers how best to use these methods to guide construct design. Finally, it presents a brief discussion as to how methods of prediction might be improved in the future. © International Union of Crystallography, 2006.
Abstract.
Thomas AC, Zheng RY (2006). Improved prediction of HIV-1 protease genotypic resistance testing assays using a consensus technique.
Abstract:
Improved prediction of HIV-1 protease genotypic resistance testing assays using a consensus technique
Abstract.
Young N, Yang ZR (2006). Multivariate crosstalk models.
Author URL.
Trudgian DC, Charles-Johnson F, Yang ZR (2006). Predicting HIV-1 T cell epitopes using bio-basis function neural networks.
Abstract:
Predicting HIV-1 T cell epitopes using bio-basis function neural networks
Abstract.
Yang ZR (2006). Predicting hepatitis C virus protease cleavage sites using generalized linear indicator regression models.
IEEE Trans Biomed Eng,
53(10), 2119-2123.
Abstract:
Predicting hepatitis C virus protease cleavage sites using generalized linear indicator regression models.
This paper discusses how to predict hepatitis C virus protease cleavage sites in proteins using generalized linear indicator regression models. The mutual information is used for model-size optimization. Two simulation strategies are adopted, i.e. building a model based on published peptides and building a model based on the published peptides plus newly collected sequences. It is found that the latter outperforms the former significantly. The simulation also shows that the generalized linear indicator regression model far outperforms the multilayer perceptron model.
Abstract.
Author URL.
2005
Yang ZR (2005). Bayesian radial basis function neural network.
Abstract:
Bayesian radial basis function neural network
Abstract.
Yang ZR, Thomson R (2005). Bio-basis function neural network for prediction of protease cleavage sites in proteins.
IEEE Trans Neural Netw,
16(1), 263-274.
Abstract:
Bio-basis function neural network for prediction of protease cleavage sites in proteins.
The prediction of protease cleavage sites in proteins is critical to effective drug design. One of the important issues in constructing an accurate and efficient predictor is how to present nonnumerical amino acids to a model effectively. As this issue has not yet been paid full attention and is closely related to model efficiency and accuracy, we present a novel neural learning algorithm aimed at improving the prediction accuracy and reducing the time involved in training. The algorithm is developed based on the conventional radial basis function neural networks (RBFNNs) and is referred to as a bio-basis function neural network (BBFNN). The basic principle is to replace the radial basis function used in RBFNNs by a novel bio-basis function. Each bio-basis is a feature dimension in a numerical feature space, to which a nonnumerical sequence space is mapped for analysis. The bio-basis function is designed using an amino acid mutation matrix verified in biology. Thus, the biological content in protein sequences can be maximally utilized for accurate modeling. Mutual information (MI) is used to select the most informative bio-bases and an ensemble method is used to enhance a decision-making process, hence, improving the prediction accuracy further. The algorithm has been successfully verified in two case studies, namely the prediction of Human Immunodeficiency Virus (HIV) protease cleavage sites and trypsin cleavage sites in proteins.
Abstract.
Author URL.
Yang ZR, Young N (2005). Bio-kernel self-organizing Map for HIV drug resistance classification.
Abstract:
Bio-kernel self-organizing Map for HIV drug resistance classification
Abstract.
Yang ZR, Dalby AR (2005). International Journal of Neural Systems: Introduction. International Journal of Neural Systems, 15(4).
Yang ZR (2005). Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection.
Bioinformatics,
21(11), 2644-2650.
Abstract:
Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection.
MOTIVATION: Although the outbreak of the severe acute respiratory syndrome (SARS) is currently over, it is expected that it will return to attack human beings. A critical challenge to scientists from various disciplines worldwide is to study the specificity of cleavage activity of SARS-related coronavirus (SARS-CoV) and use the knowledge obtained from the study for effective inhibitor design to fight the disease. The most commonly used inductive programming methods for knowledge discovery from data assume that the elements of input patterns are orthogonal to each other. Suppose a sub-sequence is denoted as P2-P1-P1'-P2', the conventional inductive programming method may result in a rule like 'if P1 = Q, then the sub-sequence is cleaved, otherwise non-cleaved'. If the site P1 is not orthogonal to the others (for instance, P2, P1' and P2'), the prediction power of these kind of rules may be limited. Therefore this study is aimed at developing a novel method for constructing non-orthogonal decision trees for mining protease data. RESULT: Eighteen sequences of coronavirus polyprotein were downloaded from NCBI (http://www.ncbi.nlm.nih.gov). Among these sequences, 252 cleavage sites were experimentally determined. These sequences were scanned using a sliding window with size k to generate about 50,000 k-mer sub-sequences (for short, k-mers). The value of k varies from 4 to 12 with a gap of two. The bio-basis function proposed by Thomson et al. is used to transform the k-mers to a high-dimensional numerical space on which an inductive programming method is applied for the purpose of deriving a decision tree for decision-making. The process of this transform is referred to as a bio-mapping. The constructed decision trees select about 10 out of 50,000 k-mers. This small set of selected k-mers is regarded as a set of decisive templates. By doing so, non-orthogonal decision trees are constructed using the selected templates and the prediction accuracy is significantly improved.
Abstract.
Author URL.
Yang ZR (2005). Orthogonal kernel machine for the prediction of functional sites in proteins.
IEEE Trans Syst Man Cybern B Cybern,
35(1), 100-106.
Abstract:
Orthogonal kernel machine for the prediction of functional sites in proteins.
A novel pattern recognition algorithm called an orthogonal kernel machine (OKM) is presented for the prediction of functional sites in proteins. Two novelties in OKM are that the kernel function is specially designed for measuring the similarity between a pair of protein sequences and the kernels are selected using the orthogonal method. Based on a set of well-recognized orthogonal kernels, this algorithm demonstrates its superior performance compared with other methods. An application of this algorithm to a real problem is presented.
Abstract.
Author URL.
Yang ZR, Wang L, Young N, Trudgian D, Chou K-C (2005). Pattern recognition methods for protein functional site prediction.
Curr Protein Pept Sci,
6(5), 479-491.
Abstract:
Pattern recognition methods for protein functional site prediction.
Protein functional site prediction is closely related to drug design, hence to public health. In order to save the cost and the time spent on identifying the functional sites in sequenced proteins in biology laboratory, computer programs have been widely used for decades. Many of them are implemented using the state-of-the-art pattern recognition algorithms, including decision trees, neural networks and support vector machines. Although the success of this effort has been obvious, advanced and new algorithms are still under development for addressing some difficult issues. This review will go through the major stages in developing pattern recognition algorithms for protein functional site prediction and outline the future research directions in this important area.
Abstract.
Author URL.
Senawongse P, Dalby AR, Yang ZR (2005). Predicting the phosphorylation sites using hidden Markov models and machine learning methods.
J Chem Inf Model,
45(4), 1147-1152.
Abstract:
Predicting the phosphorylation sites using hidden Markov models and machine learning methods.
Accurately predicting phosphorylation sites in proteins is an important issue in postgenomics, for which how to efficiently extract the most predictive features from amino acid sequences for modeling is still challenging. Although both the distributed encoding method and the bio-basis function method work well, they still have some limits in use. The distributed encoding method is unable to code the biological content in sequences efficiently, whereas the bio-basis function method is a nonparametric method, which is often computationally expensive. As hidden Markov models (HMMs) can be used to generate one model for one cluster of aligned protein sequences, the aim in this study is to use HMMs to extract features from amino acid sequences, where sequence clusters are determined using available biological knowledge. In this novel method, HMMs are first constructed using functional sequences only. Both functional and nonfunctional training sequences are then inputted into the trained HMMs to generate functional and nonfunctional feature vectors. From this, a machine learning algorithm is used to construct a classifier based on these feature vectors. It is found in this work that (1) this method provides much better prediction accuracy than the use of HMMs only for prediction, and (2) the support vector machines (SVMs) algorithm outperforms decision trees and neural network algorithms when they are constructed on the features extracted using the trained HMMs.
Abstract.
Author URL.
Senawongse P, Dalby AR, Yang ZR (2005). Predicting the phosphorylation sites using hidden markov models and machine learning methods.
Journal of Chemical Information and Modeling,
45(4), 1147-1152.
Abstract:
Predicting the phosphorylation sites using hidden markov models and machine learning methods
Accurately predicting phosphorylation sites in proteins is an important issue in postgenomics, for which how to efficiently extract the most predictive features from amino acid sequences for modeling is still challenging. Although both the distributed encoding method and the bio-basis function method work well, they still have some limits in use. The distributed encoding method is unable to code the biological content in sequences efficiently, whereas the bio-basis function method is a nonparametric method, which is often computationally expensive. As hidden Markov models (HMMs) can be used to generate one model for one cluster of aligned protein sequences, the aim in this study is to use HMMs to extract features from amino acid sequences, where sequence clusters are determined using available biological knowledge. In this novel method, HMMs are first constructed using functional sequences only. Both functional and nonfunctional training sequences are then inputted into the trained HMMs to generate functional and nonfunctional feature vectors. From this, a machine learning algorithm is used to construct a classifier based on these feature vectors. It is found in this work that (1) this method provides much better prediction accuracy than the use of HMMs only for prediction, and (2) the support vector machines (SVMs) algorithm outperforms decision trees and neural network algorithms when they are constructed on the features extracted using the trained HMMs. © 2005 American Chemical Society.
Abstract.
Yang ZR, Johnson FC (2005). Prediction of T-cell epitopes using biosupport vector machines.
J Chem Inf Model,
45(5), 1424-1428.
Abstract:
Prediction of T-cell epitopes using biosupport vector machines.
The immune system is concerned with the recognition and disposal of foreign or "non self" molecules or cells that enter the body of an immunologically competent individual. The generation of an immune response depends on the interaction of components, namely, the immunogen (nonself or foreign cell or molecule), antibody producing humoral immune system, and sensitized lymphocyte producing cellular immune system. An immunogen possesses surface structures referred to as epitopes; the precise pattern of each epitope enables an individual's immune system to recognize cells or molecules as self or immunogens. During the recognition process, the specific cells known as macrophages identify the epitope structures on the immunogen and save them in the form of short peptides 10-18 amino-acids-long known as immune dominant peptides (IDPs). IDPs are then bound with surface proteins on macrophages known as MHC protein complexes. The macrophages then present this IDP-MHC complex to a T cell that possesses a specific receptor that is specific for the foreign epitope on the IDP bound to MHC complex. This initiates an immune system cascade that results in the disposal of the immunogen. The study and accurate prediction of T-cell epitopes is, thus, very important for designing vaccines against pathogenic diseases. The present study applied the newly developed biosupport vector machine to the T-cell epitope data. This new algorithm introduces a biobasis function into the conventional support vector machines so that the nonnumerical attributes (amino acids) in protein sequences can be recognized without a feature extraction process, which often fails to properly code the biological content in protein sequences. The prediction accuracy of a 10-fold cross validation is 90.31%, compared with 87.86% using support vector machines reported as the best compared with other algorithms in an earlier study.
Abstract.
Author URL.
Yang, Z.R. (2005). Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks. Bioinformatics, 21(9), 1831-1837.
Yang ZR (2005). Probabilistic Mercer kernel clusters.
Abstract:
Probabilistic Mercer kernel clusters
Abstract.
Yang ZR, Esnouf R, McNeil P, Thomson R (2005). RONN: use of the bio-basis function neural network technique for the detection of natively disordered regions in proteins. Bioinformatics, 21(16), 3369-3376.
Yang ZR, Dalby AR (2005). Special issue on bioinformatics - Introduction.
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS,
15(4), V-VI.
Author URL.
2004
Yang ZR, Chou KC (2004). Bio-Support Vector Machines for Computational Proteomics. Bioinformatics, 20(5), 735-741.
Yang ZR (2004). Biological applications of support vector machines.
Brief Bioinform,
5(4), 328-338.
Abstract:
Biological applications of support vector machines.
One of the major tasks in bioinformatics is the classification and prediction of biological data. With the rapid increase in size of the biological databanks, it is essential to use computer programs to automate the classification process. At present, the computer programs that give the best prediction performance are support vector machines (SVMs). This is because SVMs are designed to maximise the margin to separate two classes so that the trained model generalises well on unseen data. Most other computer programs implement a classifier through the minimisation of error occurred in training, which leads to poorer generalisation. Because of this, SVMs have been widely applied to many areas of bioinformatics including protein function prediction, protease functional site recognition, transcription initiation site prediction and gene expression data classification. This paper will discuss the principles of SVMs and the applications of SVMs to the analysis of biological data, mainly protein and DNA sequences.
Abstract.
Author URL.
Venkatraman V, Dalby AR, Yang ZR (2004). Evaluation of mutual information and genetic programming for feature selection in QSAR.
J Chem Inf Comput Sci,
44(5), 1686-1692.
Abstract:
Evaluation of mutual information and genetic programming for feature selection in QSAR.
Feature selection is a key step in Quantitative Structure Activity Relationship (QSAR) analysis. Chance correlations and multicollinearity are two major problems often encountered when attempting to find generalized QSAR models for use in drug design. Optimal QSAR models require an objective variable relevance analysis step for producing robust classifiers with low complexity and good predictive accuracy. Genetic algorithms coupled with information theoretic approaches such as mutual information have been used to find near-optimal solutions to such multicriteria optimization problems. In this paper, we describe a novel approach for analyzing QSAR data based on these methods. Our experiments with the Thrombin dataset, previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001 demonstrate the feasibility of this approach. It has been found that it is important to take into account the data distribution, the rule "interestingness", and the need to look at more invariant and monotonic measures of feature selection.
Abstract.
Author URL.
Yang ZR, Everson R, Yin H (2004).
Intelligent data engineering and automated learning--IDEAL 2004., Springer-Verlag New York Inc.
Abstract:
Intelligent data engineering and automated learning--IDEAL 2004
Abstract.
Yang ZR, Dalby AR, Qiu J (2004). Mining HIV protease cleavage data using genetic programming with a sum-product function.
Bioinformatics,
20(18), 3398-3405.
Abstract:
Mining HIV protease cleavage data using genetic programming with a sum-product function.
MOTIVATION: in order to design effective HIV inhibitors, studying and understanding the mechanism of HIV protease cleavage specification is critical. Various methods have been developed to explore the specificity of HIV protease cleavage activity. However, success in both extracting discriminant rules and maintaining high prediction accuracy is still challenging. The earlier study had employed genetic programming with a min-max scoring function to extract discriminant rules with success. However, the decision will finally be degenerated to one residue making further improvement of the prediction accuracy difficult. The challenge of revising the min-max scoring function so as to improve the prediction accuracy motivated this study. RESULTS: This paper has designed a new scoring function called a sum-product function for extracting HIV protease cleavage discriminant rules using genetic programming methods. The experiments show that the new scoring function is superior to the min-max scoring function. AVAILABILITY: the software package can be obtained by request to Dr Zheng Rong Yang.
Abstract.
Author URL.
Yang, Z.R. (2004). Mining gene expression data using the template theory. Bioinformatics, 20(16), 2759-2766.
Yang ZR, Chou KC (2004). Predicting the O-linkage sites in glycoproteins using bio-basis function neural networks. Bioinformatics, 20(6), 903-908.
Yin H, Yang ZR, Everson R (2004). Preface.
Yang R, Everson RM, Yin H (2004). Proceedings of the Fifth International Conference on Intelligent Data Engineering and Automated Learning -- IDEAL2004., Springer.
Berry EA, Dalby AR, Yang ZR (2004). Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms.
Comput Biol Chem,
28(1), 75-85.
Abstract:
Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms.
Protein phosphorylation is a post-translational modification performed by a group of enzymes known as the protein kinases or phosphotransferases (Enzyme Commission classification 2.7). It is essential to the correct functioning of both proteins and cells, being involved with enzyme control, cell signalling and apoptosis. The major problem when attempting prediction of these sites is the broad substrate specificity of the enzymes. This study employs back-propagation neural networks (BPNNs), the decision tree algorithm C4.5 and the reduced bio-basis function neural network (rBBFNN) to predict phosphorylation sites. The aim is to compare prediction efficiency of the three algorithms for this problem, and examine knowledge extraction capability. All three algorithms are effective for phosphorylation site prediction. Results indicate that rBBFNN is the fastest and most sensitive of the algorithms. BPNN has the highest area under the ROC curve and is therefore the most robust, and C4.5 has the highest prediction accuracy. C4.5 also reveals the amino acid 2 residues upstream from the phosporylation site is important for serine/threonine phosphorylation, whilst the amino acid 3 residues upstream is important for tyrosine phosphorylation.
Abstract.
Author URL.
Yang ZR, Berry EA (2004). Reduced bio-basis function neural networks for protease cleavage site prediction.
J Bioinform Comput Biol,
2(3), 511-531.
Abstract:
Reduced bio-basis function neural networks for protease cleavage site prediction.
This paper presents a new neural learning algorithm for protease cleavage site prediction. The basic idea is to replace the radial basis function used in radial basis function neural networks by a so-called bio-basis function using amino acid similarity matrices. Mutual information is used to select bio-bases and a corresponding selection algorithm is developed. The algorithm has been applied to the prediction of HIV and Hepatitis C virus protease cleavage sites in proteins with success.
Abstract.
Author URL.
2003
Berry E, Yang ZR, Wu XK (2003). A Biology Inspired Neural Learning Algorithm for Analysing Protein Sequences.
Abstract:
A Biology Inspired Neural Learning Algorithm for Analysing Protein Sequences
Abstract.
Yang ZR, Harrison RG (2003). An unsupervised probabilistic net for health inequalities analysis.
IEEE Trans Neural Netw,
14(1), 46-57.
Abstract:
An unsupervised probabilistic net for health inequalities analysis.
An unsupervised probabilistic net (UPN) is introduced to identify health inequalities among countries according to their health status measured by the collected health indicators. By estimating the underlying probability density function of the health indicators using UPN, countries, which have similar health status, will be categorized into the same cluster. From this, the intercluster health inequalities are identified by the Mahalanobis distance, and the intracluster health inequalities are identified by the diversity within the clusters. To extract the typical health status, the concept of virtual objects is used in this study. Each virtual object in this study, therefore, represents a hypothetical country, which does not exist in a data set but can be found through learning. The identified virtual objects represent the hidden knowledge in a data set and can be valuable to social scientists in health promotion planning. Moreover, the investigation of the behavior of the virtual objects can help us to find the realistic and reasonable health promotion target for a country with a poor health status.
Abstract.
Author URL.
Yang ZR, Doyle AK, Hodgman C, Thomson R (2003). Characterising proteolytic cleavage site activity using bio-basis function neural network. Bioinformatics, 19(14), 1741-1747.
Yang ZR, Chou K-C (2003). Mining biological data using self-organizing map.
J Chem Inf Comput Sci,
43(6), 1748-1753.
Abstract:
Mining biological data using self-organizing map.
This paper presents a novel method of mining biological data using a self-organizing map (SOM). After partitioning a set of protein sequences using SOM, conventional homology alignment is applied to each cluster to determine the conserved local motif (biological pattern) for the cluster. These local motifs are then regarded as rules for prediction and classification. In the application to the prediction of HIV protease cleavage sites in proteins, we found that the rules derived from this method are much more robust than those derived from the decision tree method.
Abstract.
Author URL.
Yang ZR, Thomson R, Hodgman TC, Dry J, Doyle AK, Narayanan A, Wu X (2003). Searching for discrimination rules in protease proteolytic cleavage activity using genetic programming with a min-max scoring function.
Biosystems,
72(1-2), 159-176.
Abstract:
Searching for discrimination rules in protease proteolytic cleavage activity using genetic programming with a min-max scoring function.
This paper presents an algorithm which is able to extract discriminant rules from oligopeptides for protease proteolytic cleavage activity prediction. The algorithm is developed using genetic programming. Three important components in the algorithm are a min-max scoring function, the reverse Polish notation (RPN) and the use of minimum description length. The min-max scoring function is developed using amino acid similarity matrices for measuring the similarity between an oligopeptide and a rule, which is a complex algebraic equation of amino acids rather than a simple pattern sequence. The Fisher ratio is then calculated on the scoring values using the class label associated with the oligopeptides. The discriminant ability of each rule can therefore be evaluated. The use of RPN makes the evolutionary operations simpler and therefore reduces the computational cost. To prevent overfitting, the concept of minimum description length is used to penalize over-complicated rules. A fitness function is therefore composed of the Fisher ratio and the use of minimum description length for an efficient evolutionary process. In the application to four protease datasets (Trypsin, Factor Xa, Hepatitis C Virus and HIV protease cleavage site prediction), our algorithm is superior to C5, a conventional method for deriving decision trees.
Abstract.
Author URL.
Yang ZR, IEEE, IEEE (2003). Support vector machines for company failure prediction.
Author URL.
2002
Thomson R, Yang ZR (2002). A novel basis function neural network.
Author URL.
Yang ZR, Harrison RG (2002). Analysing company performance using templates.
Intelligent Data Analysis,
6(1), 3-15.
Abstract:
Analysing company performance using templates
Other than identifying whether a company may fail or not, explaining why a company may fail is essential. The most common way of explaining is to use a template like the standards used in commercial society. Because of the existence of heteroscedasticity, it is impossible to expect that there is only one standard within an industry. For instance, it is unrealistic to use one standard to evaluate performance of both a new-born company and a fifty-year old company. This paper presents a method of searching for templates using probabilistic neural networks. Each template represents a number of companies, which have similar financial performance and therefore similar financial outcomes. A comparison between a company and a template can explain how badly a company performs and what the problem is if its financial situation is not sound. The method has so far been applied to a data set of 2408 UK construction companies. © 2002-IOS Press. All rights reserved.
Abstract.
Yang ZR (2002). Artificial neural networks in analysing health inequalities.
Abstract:
Artificial neural networks in analysing health inequalities
Abstract.
Narayanan A, Wu X, Yang ZR (2002). Mining viral protease data to extract cleavage knowledge.
Bioinformatics,
18 Suppl 1, S5-13.
Abstract:
Mining viral protease data to extract cleavage knowledge.
MOTIVATION: the motivation is to identify, through machine learning techniques, specific patterns in HIV and HCV viral polyprotein amino acid residues where viral protease cleaves the polyprotein as it leaves the ribosome. An understanding of viral protease specificity may help the development of future anti-viral drugs involving protease inhibitors by identifying specific features of protease activity for further experimental investigation. While viral sequence information is growing at a fast rate, there is still comparatively little understanding of how viral polyproteins are cut into their functional unit lengths. The aim of the work reported here is to investigate whether it is possible to generalise from known cleavage sites to unknown cleavage sites for two specific viruses-HIV and HCV. An understanding of proteolytic activity for specific viruses will contribute to our understanding of viral protease function in general, thereby leading to a greater understanding of protease families and their substrate characteristics. RESULTS: Our results show that artificial neural networks and symbolic learning techniques (See5) capture some fundamental and new substrate attributes, but neural networks outperform their symbolic counterpart.
Abstract.
Author URL.
2001
Yang ZR (2001). A Binary Probabilistic Model and genetic algorithm for HIV protease cleavage sites prediction and search.
Author URL.
Yang ZR (2001). A binary probabilistic classification tree for company failure prediction.
Author URL.
Yang ZR (2001). A new method for company failure prediction using probabilistic neural networks.
Author URL.
Yang ZR (2001). Analysing health inequalities using SOM.
Author URL.
Yang ZR, Lu W, Harrison RG (2001). Evolving stacked time series predictors with multiple window scales and sampling gaps.
Neural Processing Letters,
13(3), 203-211.
Abstract:
Evolving stacked time series predictors with multiple window scales and sampling gaps
We apply evolutionary programming to search for the optimal combination of stacked time series predictors with multiple window scales and sampling gaps. In this approach, the evolutionary process is ensured to proceed smoothly towards the optimal solution by using a control strategy based on the similarity level between the genotypes from two successive generations. Our experiments on both sunspots and S&P500 price index predictions demonstrate that this method significantly improves the prediction accuracy compared with the constrained least squared regression.
Abstract.
Zwolinski, M. (2001). Mutual information theory for adaptive mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(4), 396-403.
2000
Yang ZR, Zwolinski M (2000). Applying mutual information to adaptive mixture models.
Abstract:
Applying mutual information to adaptive mixture models
Abstract.
Wolinski MZ, Yang ZR, Kazmierski TJ (2000). Using robust adaptive mixing for statistical fault macromodelling.
IEE Proceedings: Circuits, Devices and Systems,
147(5), 265-270.
Abstract:
Using robust adaptive mixing for statistical fault macromodelling
The design and analysis of analogue circuits can be speeded up if accurate macromodels are used in place of full, transistor-level netlists. Similarly, testability analysis of analogue circuits at the transistor level is difficult because of the large CPU times needed for fault simulation. Macromodelling circuits under catastrophic fault conditions is difficult because the faulty behaviour is not easily predicted. Moreover, the variances in faulty behaviour, because of parametric tolerances, are not the same as the variance of the fault free behaviour. An algorithm is presented for statistical fault macromodelling of analogue circuits. The circuit macros are modelled using a robust adaptive mixing algorithm, which is based on mutual information theory and robust statistical methods. The experimental results show that the CPU time required for statistical fault macromodelling is very small and the model accuracy is very high.
Abstract.