RPI Home | RPInfo | RPI Chemistry

Curt M. Breneman Research Group


Modeling and Mining || Small Molecules || DNA & RNA || Proteins

Modeling and Mining
  • ROMS
    The RECCR Online Modeling System (ROMS) is a general web-based machine learning system. By using the available learning methods, users can generate a model and visualize its performance by uploading their data set through the web client. Three learning methods provided are Partial Least Squares (PLS), Kernel-PLS and Support Vector Machine (SVM). In addition to basic modeling functionality, cross validation methods such as Leave-One-Out (LOO) and Monte Carlo Cross Validation (MCCV) are provided for model parameter selection.

  • DMTL
    Data Mining Template Library (DMTL) supports the mining of increasingly complex and informative patterns types, in structured and unstructured datasets, including Itemsets, Sequences, Trees and Graphs (See Fig. 1). DMTL is a C++ library consisting of highly efficient algorithms and data structures, utilizing a generic data mining approach, where all aspects of mining are controlled via a set of properties. Another novel feature of DMTL is that it provides transparent persistency and indexing support for effective computation over massive datasets. We have successfully mined datasets in the 60-100GB range using a desktop PC! DMTL has been publicly released as open-source software on the world-wide SourceForge site, and it has already been downloaded by over 2000 researchers from all over the world.

Small Molecules
    RECON is an algorithm for the rapid reconstruction of molecular charge densities and charge density-based electronic properties of molecules, using atomic charge density fragments pre-computed from ab initio wavefunctions. These are known as Transferable Atom Equivalents, or "TAEs". The method is based on Bader's quantum theory of Atoms in Molecules.

  • PEST
    PEST Shape/Property hybrid descriptor technology, developed in DDASSL, allows better representation of the kinds of intermolecular interactions that are dependent on molecular shape. The inclusion of PEST descriptors has been found to significantly improve QSPR models where intermolecular interactions play an important role in the chemical effects being modeled. PEST descriptors are generated using TAE molecular surface representations to define property-encoded boundaries similar to the Zauhar "Shape Signature" ray-tracing approach to shape/property convolution.

    Web-based descriptor generator that provides a TAE-based representation of the electronic properties of the major or minor grooves of DNA. DIXEL represents electron density features such as electrostatic potential (EP) and local average ionization potential (PIP) on the accessible surfaces of the major or minor groove on a grid of rectangles -- the "Dixel" coordinate system. These features can be displayed graphically and/or employed as input to data mining algorithms.

    The objective of the Mfold web server for nucleic acid folding and hybridization prediction is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large. By making use of universally available web GUIs (Graphical User Interfaces), the server circumvents the problem of portability of this software. Detailed output, in the form of structure plots with or without reliability information, single strand frequency plots and 'energy dot plots', are available for the folding of single sequences.

    A version of the RECON/TAE program optimized for use with proteins, allowing users to rapidly produce a set of descriptors that can characterize protein behavior. Protein Recon is an algorithm for the rapid reconstruction of molecular charge density-based electronic properties of proteins, using peptide fragments precomputed from ab initio wavefunctions. These properties can be displayed graphically and/or employed as input to data mining algorithms.

  • WebPDB
    WebPDB is a web-based workflow system that is flexible and capable of semi-automatic protein structure cleaning activities. The protein data may be provided by the user, but can also be directly downloaded from the PDB archive as part of the automated workflow. In its next generation, WebPDB will produce pH-sensitive protein surface descriptors that take into account appropriate protonation states and fractional protonation/deprotonation of basic and acidic side chain groups. WebPDB prepares proteins for use in virtual screening and predictive modeling. It removes gaps (through self-homology with FASTA information), heteroatoms and ligands (for re-use). Coupled with other modeling tools, WebPDB can be useful in probe development and the interpretation of secondary screening results through docking and scoring computations.

    The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction.

    Proteins of the same class often share a secondary structure packing arrangement but differ in how the secondary structure units are ordered in the sequence. We find that proteins that share a common core also share local sequence-structure similarities, and these can be exploited to align structures with different topologies. In this study, segments from a library of local sequence-structure alignments were assembled hierarchically, enforcing the compactness and conserved inter-residue contacts but not sequential ordering. Previous structure-based alignment methods often ignore sequence similarity, local structural equivalence and compactness. SCALI (Structural Core ALIgnment), can efficiently find conserved packing arrangements, even if they are nonsequentially ordered in space. SCALI alignments conserve remote sequence similarity and contain fewer alignment errors. Clustering of our pairwise non-sequential alignments shows that recurrent packing arrangements exist in topologically different structures.

  • MASKER contacts & MASKER voids
    A fast algorithm for computing the solvent accessible molecular surface area (SAS) using Boolean masks (Le Grand, S. M. & Merz, K. M. J. (1993). J. Comp. Chem. 14, 349-52.) has been modified to estimate the solvent excluded molecular surface area (SES), including contact, toroidal and reentrant surface components. Numerical estimates of arc lengths of intersecting atomic SAS are using to estimate the toroidal surface, and intersections between those arcs are used to estimate the reentrant surface area. The new method is compared to an exact analytical method. Boolean molecular surface areas are continuous and pairwise differentiable, and should be useful for molecular dynamics simulations, especially as the basis for an implicit solvent model. MASKER contacts finds the surface area burial by residue in a protein while MASKER voids finds the locations of empty cavities in proteins (or any molecule).

RECCR ©2005 Curt M. Breneman