Data for Data Fusion in Metabolomic Cancer Diagnostics


The Dataset is a joint dataset of data from Fluorescence Spectroscopy 1H-NMR spectroscopy (CPMG and NOESY-Presat) and Biomarker measurements (TIMP-1 and CEA) on Human plasma samples (sodium citrate anticoagulant) from a study that included patients undergoing large bowel endoscopy due to symptoms which could be associated with CRC (Lomholt et al. 2009; Nielsen et al. 2008). The original dataset contains case control samples with one case (verified colorectal cancer) and three controls for each case. The control group in this dataset is from subjects in witch were found colorectal benign adenomas. The controls are matched by age, gender and location of tumors.


The fluorescence data is represented as PARAFAC scores (see Lawaetz et al. 2012 for details on the PARAFAC models). The NMR data is represented as PCA scores (1st component) of the integrated peaks. (See Bro et al for details). The Biomarker data are log transformed (base 2). The Biomarkers TIMP-1 and CEA are known to change with age and gender. This has been corrected for in the biomarker concentrations by subtracting the concentration of a matched sample from another control group (no findings). Corrections were done on both case and control samples.


The aim of our study was to show how the extended profile of the combined data gave better options for discriminating between cancer and control samples.


The data is saved in one dataset, with 94 samples and 476 variables. The first two variables are the biomarkers; the next 19 are the fluorescence data as PARAFAC scores, and the last 455 are the NMR peaks. The first 201 from CPMG, and the last 254 are the NOESY data.


All class data (cancer/adenoma, case control, age, gender) are found in the data (for example, cancer/adenoma status is found in Data.class{1,1}, and class labels in Data.classlookup{1,1})


