Published April 7, 2008 | Version v1 | https://doi.org/10.59350/f2ajz-tmr63

The CDK/Metabolomics/Chemometrics Unconference results

As announced earlier, Miguel, Velitchka, Christoph and I held a small CDK/Metabolomics/Chemometrics unconference. We started late, and did not have an evening program, resulting in not overly much results. However, we did do molecular chemometrics.

We used the R statistics software together with Rajarshi's rcdk package (an R wrapper around the CDK library) and Ron's (my PhD supervisor) PLS package (see this paper), to predict retention indices for a number of metabolites.

We ended up with this R script:

library("rJava")
library("rcdk")
library("pls")
mols = load.molecules("data_cdk.sdf")
selection = get.desc.names()
selection = selection[-which(selection=="org.openscience.cdk.qsar.descriptors.molecular.AminoAcidCountDescriptor")]
x = eval.desc(mols, selection, verbose=TRUE)
x2 = x[,apply(x, 2, function(a) {all(!is.na(a))})]
y = read.table("data_cdk_RI")
input = data.frame(x2, y)
pls.model = plsr(V1 ~ ., 50, data=input, validation="CV")
summary(pls.model)
plot(RMSEP(pls.model))
plot(pls.model, ncomp=20)
abline(0,1, col="red")
plot(pls.model, "loadings", comps=1:2)
savehistory("finalHistory.R")

The AminoAcidCountDescriptor threw us a NullPointerException and there were a few NAs in the resulting matrix. The CV results were not so good as Velitchka's best models, but still a good start:

No variable selection; 200 objects, 190 variables.

Questions:

  • Can we do this in Bioclipse2 too?
  • Can we improve the default CDK descriptor parameters to maximize the column count?
  • Rajarshi, what would be involved to write some wrapper code for atomic descriptors for rcdk?

Additional details

Description

As announced earlier, Miguel, Velitchka, Christoph and I held a small CDK/Metabolomics/Chemometrics unconference. We started late, and did not have an evening program, resulting in not overly much results.

Identifiers

UUID
483d0307-62b8-404c-a773-d0843a8f6754
GUID
https://doi.org/10.59350/f2ajz-tmr63
URL
https://chem-bla-ics.linkedchemistry.info/2008/04/07/cdkmetabolomicschemometrics.html

Dates

Issued
2008-04-07T00:10:00
Updated
2008-04-07T00:10:00

References

  1. Mevik, B.-H., & Wehrens, R. (2007). TheplsPackage: Principal Component and Partial Least Squares Regression inR. Journal of Statistical Software, 18(2). https://doi.org/10.18637/jss.v018.i02