Monday, March 4, 2013

Extended topological / topochemical descriptors? Wow!

Here's a big tip to other practitioners of chemical informatics methods, especially those of you (like me) who have a great affection for open source software:  do not be fooled by false modesty!  If you read the brief description accompanying the free, open source cross-platform molecular descriptor calculator PaDEL, you may not be overly impressed.  If you do not read past the statement:

"The descriptors and fingerprints are calculated using The Chemistry Development Kit with some additional descriptors and fingerprints."

 then you risk assuming that your old version of CDK (or Dragon, JOELib, or whatever you use for descriptor calculation) is probably adequate.  I nearly did.  I'm thrilled that I didn't!

The truth is that "some additional descriptors" includes atom type electrotopological state, and extended topochemical atom (ETA) descriptors.  I had never heard of these descriptors before, and I under the impression that they are not offered in any other of the widely distributed descriptor calculator tools, but as I went into a recent QSPR model development project I decided to include them just to see what they might contribute.  The results were stunning!

First, a little information on what I was trying to do:  I won't tell you exactly what I was trying to model because it hasn't been published yet, but suffice it to say that it's an assay of substantial relevance to drug delivery that has not been particularly well modeled previously.  Starting off with 800+ descriptors, I wanted to keep an open mind regarding specific descriptor types that might be relevant, so I started off with five-fold cross-validated feature selection and retained for further consideration only those descriptors that met the minimum significance threshold in at least two fold models.  This reduced the feature tally down to 31.  I then applied a fairly simple-minded classification algorithm that supported further discarding of features that failed to meet minimum weights in the resulting model.  This produced a 12 parameter model that achieved a predictivity level of R2 = 0.48 relative to a very diverse test set a third the size of the original training set.  Not bad for a fairly tough target.

But here's the really cool news:  of the 31 parameters chosen by the rigorous cross-validated feature selection, 18 out of 31 were plucked from the extended topological/topochemical set!  Of the 12 parameters in the final model, fully NINE were from this set!  This parameter set is apparently capable of dominating the modeling of a complex pharmacological property assembled over structurally very diverse training and test sets.  I am accustomed to seeing substantial informational redundancy within a given descriptor set (even the so-called diversity metrics such as BCUT).  To put it mildly, I am very unaccustomed to seeing informational diversity within a given feature class that dwarfs the orthogonal information content of all of the other feature classes combined!


----------------------------------------------------

Posts in this blog represent the honest opinions of Gerald Lushington (click here for CV) and have not been affected by commercial interests or other inducements.