Tuesday, April 15, 2014

Drugster: an emerging platform for structure-based drug discovery

I've recently come across a fairly new computing environment for structure-based design, called Drugster.  To some extent the service over-promises, because it calls itself  "A comprehensive drug design, lead and structure optimization suite".  To me, 'lead optimization' promises the informational underpinnings for being able to fully understand why specific compounds have proven potent toward a given endpoint, and for then systematically acting on that knowledge to recommend new compounds for synthesis and testing.  In a practical sense, for me that would require tools for computationally learning and validating SAR rules (i.e., QSAR and comparable data mining techniques) plus an objective framework for characterizing the pharmacophore.  Drugster does have these.

However, just because Drugster doesn't necessarily live up to its full billing, doesn't mean that it's not a potentially useful (perhaps even important) new piece of software.  The technical areas that it does support (ligand and receptor structural preparation, docking, pose refinement and interaction scoring) appear to be effected in a very useful manner, and embedded in a pipeline environment that can accomplish a lot in terms of understanding the structural underpinning of a given ligand interacting with a given target.  The components that it builds on (GROMACS, Dock and LigBuilder) are all powerful and well-regarded tools.  Assembling the middleware necessary to provide seamless information flow between the utilities is definitely a valuable contribution.

What I can not vouch for at this point is just how user friendly it is.  I have not used it much because inertia has sustained my loyalty to other comparable tools for structure preparation, refinement and docking.  I would be most interested in hearing other people's comments, however.

Friday, September 27, 2013

Biological screening hit rate

Please file this in the "for what it's worth" department.  I was going to use this in a paper I'm but I've decided to slant my argument in another direction.

Anyway, given a huge surge of activity in chemical biology screening in the last decade or so, what is the likelihood that a given tested compound will be a biological hit?  If you take PubChem at face value, the answer is about 1.1%: for ballpark purposes one might observe that as of Sept. 25, 2013, the PubChem database reported that 2,306,975 hits had been documented from among 207,698,183 bioassay data points.

I was going to then compare this with the old Yvonne Martin rule of thumb that a Tanimoto coefficient of molecular similarity of greater than 0.85 suggested a 30% chance that a pair of molecules would have analogous bioactivity.  Can we conclude that the Martin rule gives a 30-fold enhancement in hit detection over purely random pair selection?

Well, nah.  Maybe.  I dunno.  The fact of the matter is that much has changed in the 12 years since Martin made her assessment.  For starters, nearly the entire mammoth PubChem data collection came after then.  Couple that with revolutionary enhancement of screening collections, an ever-shifting definition of what constitutes a hit, emerging understanding of aggregation and reactivity as effectors of promiscuity, etc., and the fact of the matter is that we need a whole new study.

We also could use a new consensus way of defining molecular similarity, but that's another story.


Monday, July 29, 2013

Smalter Hall / Lushington Mechanism of Action work selected top paper of BioMedCH 2012


It somehow escaped my attention earlier, but the paper that Aaron Smalter Hall and I wrote last year for the BioMedChem 2012 meeting in Kos Greece was selected as top paper of conference.  See:  http://www.wseas.org/wseas/cms.action?id=3924.

The paper introduced a biclustering-based protocol that we devised for mechanism of action elucidation in complex multimodal data sets.  We're currently assembling a followup paper in which we present practical applications of the method to finding subsets of high throughput phenotypic and ADME chemical screens, wherein each subset contains molecules that appear to have analogous structure-activity relationships (a key indicator of common mechanism of action).

The official citation of our introductory paper is:

Aaron Smalter Hall & Gerald H. Lushington.  Discovering Mechanisms of Action in Chemical Structure-Activity Data Sets.  Recent Advances in Biology and Biomedicine Series.  2012, 1: 47-53.

Tuesday, May 7, 2013

Arange:  a tool for mining docking poses

I frequently use this blog and others to promote interesting and useful chemical informatics / molecular modeling tools that I've come across in my consulting work.  Although I enjoy programming, I have to admit that the plethora of quality, free, open source tools ripe for the picking means that I rarely have to string code together anymore.  That said, though, I do feel the need to give back a little; not just advocating for the selfless developers whose programs I use every day, but maybe contributing a bit of my own methodology.

So here's the first step:  a Lushington in Silico SourceForge page on which I will be stashing some of my more robust tools as I find the time.  Gerald Lushington's first contribution to the repository is Arange, a tool for mining molecular docking poses.

Mining docking poses, you might ask?  Why?

The answer is that many people tend to approach molecular docking with a narrow and somewhat prejudiced mind.  Specifically, they (well, okay, me too) often scan through the docked poses looking for our preconceived notion of what the pharmacophore should be.  Perhaps we have prior NMR data that suggests some key interactions should be conserved.  Possibly, we hone right in on the subset of poses that are conserved across a family of inhibitors.  With these and related mindsets, we thus often skip over lots of poses that probably do not represent a stable bound conformation, but much of that conformational data is still potentially indicative of metastable or transient states that reflect aspects of the overall dynamic interaction that contribute to the entropic favorability of a ligand for the receptor.  Such transient interactions can be important because a ligand never jumps straight into its ultimate bound conformation:  it bounces from surface to surface, either gradually moving toward the best binding spot or sometimes being expelled in order to try again later.  Favorable transient interactions in the right places in the receptor can thus kinetically expedite ligand approach toward a final binding conformer.

These transient interactions can also be useful for assembling chimeric ligands that exploit more than one interaction surface.

So, what Arange does is to examine all docking poses made by known active compounds and contrast their spatial distribution relative to poses made by inactive compounds.  For each unique atom type, a weight is assigned to a 3D grid according to the following formula:

where in the above, most terms are self-explanatory except for IACT which is a simple factor that is equal to 1.0 if the compound is a known active and equal to -1.0 for a known inactive.  This allows the method to discriminate areas of binding site that discriminate between active and inactive compounds.  Specifically, if both actives and inactives bind to a given region with similar efficacy, that region will be accorded a score close to zero and the region will be considered pharmacophorically irrelevant.  If active ligands bind to a region favorably and inactive ones do not, then that region's interactions will be considered to be pharmacophorically favorable.  The reverse is true for regions that predominantly favor interaction with inactive compounds.

In current form, Arange processes the docking outputs of a Surflex docking simulation and generates a spatial depiction of pharmacophorically favorable (green dots) and unfavorable (red dots) that can be loaded into PyMol for plotting next to a receptor model, as per:

where in the above the size of the spheres is indicative of the significance (i.e., weight) of the pharmacophoric interaction at that specific grid point.  The above graphic reflects carbon atom interactions (i.e., lipophiles), but analogous plots are provided for N, O, F, P, S, Cl and Br.

Please don't hesitate to take a look at the code and provide comments.  If people are interested, the code can be readily extended to interfacing with other docking software, and can be made a bit smarter (i.e., so as to differentiate between different valence states of the various atom types).





Tuesday, April 2, 2013

Protein structural alignment: a bit of control with LOVOAlign

The Different Aims of Structural Alignment

Several weeks ago I posted a series of pieces about protein structural alignment tools (see Gerald Lushington Tumblr Blog for the last of the posts) that were geared toward perceiving pharmacophorically similar binding sites across different targets.  My underlying aim in promoting these tools was to emphasize that these correspond to a somewhat newer class of protein alignment tools, relative to the original batch which were primarily focused on superimposing structures according to holistic properties such as conserved folds and domains.  In truth, there are still other reasons why one might want to do some sort of protein structural alignment.  I would encourage readers to suggest those that might occur to them, but the one that I recently grappled with had to do with predicting protein-protein interaction complexes.

The most commonly used computational tools for predicting how multiple proteins associate with each other are:
  1. homology modeling:  if someone has crystallized a bound complex of proteins A and B, and you want to predict how A and C will bind, you could use any sequence homology between B and C to align C to the position / orientation / conformation of the complexed B structure
  2. protein docking:  if there isn't a viable template for a bound AB complex, you can systematically sample the translational and rotational coordinates of C within the region of A to see what orientation would produce the best steric and electrostatic complementarity between A and C.
While the above pair of methods might seem to provide reasonable coverage of different possible modeling scenarios, there is a distinct opportunity that neither are well suited to exploit:  there are numerous crystal structures that span more than one bound protein but large fractions of one or more of those proteins has not been resolved.  Another related scenario is where you have a relatively complete structure of A bound to B, but B is much smaller than C, or only a portion of B near the interaction surface is structurally related to C.  In such a case, homology modeling may fail because the structures B and C are too different to support accurate comparative analysis.  Docking simulations might work, but there is no guarantee they would behave any more reliably than average (which is in general fairly sketchy).  The answer of course is not too outrageous:  a computer program that seeks only to align a user-defined subset of C's structure to a subset of B.  This is one of the potentially very useful services that LOVOAlign can accomplish.  In addition to being fairly easy to use, the program is fast and capable of a broad range of other protein-protein alignment tasks (including the aforementioned goal of pharmacophore alignment.  And better yet, LOVOAlign is available for free and is provided with open source access.



----------------------------------------------------

Posts in this blog represent the honest opinions of Gerald Lushington (click here for CV) and have not been affected by commercial interests or other inducements.

Monday, March 4, 2013

Extended topological / topochemical descriptors? Wow!

Here's a big tip to other practitioners of chemical informatics methods, especially those of you (like me) who have a great affection for open source software:  do not be fooled by false modesty!  If you read the brief description accompanying the free, open source cross-platform molecular descriptor calculator PaDEL, you may not be overly impressed.  If you do not read past the statement:

"The descriptors and fingerprints are calculated using The Chemistry Development Kit with some additional descriptors and fingerprints."

 then you risk assuming that your old version of CDK (or Dragon, JOELib, or whatever you use for descriptor calculation) is probably adequate.  I nearly did.  I'm thrilled that I didn't!

The truth is that "some additional descriptors" includes atom type electrotopological state, and extended topochemical atom (ETA) descriptors.  I had never heard of these descriptors before, and I under the impression that they are not offered in any other of the widely distributed descriptor calculator tools, but as I went into a recent QSPR model development project I decided to include them just to see what they might contribute.  The results were stunning!

First, a little information on what I was trying to do:  I won't tell you exactly what I was trying to model because it hasn't been published yet, but suffice it to say that it's an assay of substantial relevance to drug delivery that has not been particularly well modeled previously.  Starting off with 800+ descriptors, I wanted to keep an open mind regarding specific descriptor types that might be relevant, so I started off with five-fold cross-validated feature selection and retained for further consideration only those descriptors that met the minimum significance threshold in at least two fold models.  This reduced the feature tally down to 31.  I then applied a fairly simple-minded classification algorithm that supported further discarding of features that failed to meet minimum weights in the resulting model.  This produced a 12 parameter model that achieved a predictivity level of R2 = 0.48 relative to a very diverse test set a third the size of the original training set.  Not bad for a fairly tough target.

But here's the really cool news:  of the 31 parameters chosen by the rigorous cross-validated feature selection, 18 out of 31 were plucked from the extended topological/topochemical set!  Of the 12 parameters in the final model, fully NINE were from this set!  This parameter set is apparently capable of dominating the modeling of a complex pharmacological property assembled over structurally very diverse training and test sets.  I am accustomed to seeing substantial informational redundancy within a given descriptor set (even the so-called diversity metrics such as BCUT).  To put it mildly, I am very unaccustomed to seeing informational diversity within a given feature class that dwarfs the orthogonal information content of all of the other feature classes combined!


----------------------------------------------------

Posts in this blog represent the honest opinions of Gerald Lushington (click here for CV) and have not been affected by commercial interests or other inducements.


Tuesday, January 29, 2013

I recently posted a short review of the TechProse writing guidelines on my TechWriter blog, and am putting a plug in for it here.  While most of the comments I've posted here and on my other blogs have been oriented toward specific software tools or analysis methods, I plan to include some excellent links like the document in question