DMRpred is a high-throughput predictor (Random Forest) of disordered moonlighting regions (DMRs) in proteins. DMRs are disordered, and carry out more than one disorder related functions. DMRpred is a Random Forest model using empirically designed features taking into account sequence conservation information, relative solvent accessibility, intrinsic disorder and function-related amino acid indexes. For each residue in the input sequence, it outputs propensity score and binary prediction of being in a DMR. DMRpred takes less than one minute to predict a protein of typical length (about 500 residues long).
fDETECT is a web server that predicts if a given protein sequence can pass the four steps of the crystallization pipeline - material production, purification, crystallization and diffraction-quality crystallization. It allows users to run fast predictions for large datasets and to run slower but more accurate predictions for small datasets. This webserver implements the fDETECT method (linear models) (fast Determination of Eligibility of TargEts for CrysTallization) [Acta Cryst. (2014). D70, 2781-2793] and the PPCpred method (SVM models) (Predictor of protein Production, Purification and Crystallization) [Bioinformatics. (2011). Volume 27, Issue 13, i24–i331].
DFLpred is a high-throughput predictor (Logistic Regression) of disordered flexible linker (DFL) regions in proteins. DFLpred is a Logistic Regression model using selected features that quantify propensities to form certain secondary structures, disordered regions and structured regions. For each residue in the input sequence, it outputs propensity score and binary prediction of being in a DFL region. DFLpred is fast, it takes less than one hour to predict an entire proteome.
I collected autophagy-related proteins from the Autophagy Database, and nuclear proteins and their annotations of the intra-nuclear compartments from the Nsort/DB database. By using a majority-vote based prediction of disorder, I analyzed similarities and differences in the intrinsic disorder distribution of nuclear and non-nuclear proteins related to autophagy.
I analyzed the profile of intrinsic disorder for 3005 mouse proteins localized in specific intra-nuclear organelles and compared them with 29,863 non-nuclear proteins. Nuclear proteins are collected from the Nsort/DB database, and are mapped into the complete mouse proteome (collected from UniProt). By excluding nuclear proteins from the complete mouse proteome, I created the set of 29,863 non-nuclear proteins. I also analysed the differences in PPIs (protein-protein interactions) between nuclear proteins (including subsets of nuclear proteins localized in specific intra-nuclear compartments) and non-nuclear proteins by using PPI network collected from the Mentha database.
I evaluated the extent of intrinsic disorder in the complete proteomes of genotypes of four human dengue virus, to analyze the peculiarities of disorder distribution within individual dengue virus proteins, and to establish potential roles for the structural disorder with respect to their functions. The dengue virus proteins are collected from UniProt, and I use a wide spectrum of bioinformatics techniques to generate 20 types of structural and functional annotations for these proteins.
bi-BPCA is a method to impute missing values of microarray data. I find the most correlated genes (rows) and experimental conditions (columns) with the missing entry by biclustering, and impute these biclusters by an existing method, the BPCA (Bayesian Principal Component Analysis) method. I also developed an automatic parameter learning scheme to obtain the optimal parameters for this method.