GEMbench: Benchmark of Omics Data and Integration Methods

Data used for this Study

of the samples in this study

The data for our study was obtained from the several sources:

EMTAB-37

Source: Cell Line
Data type: Microarray data
Obtained: March 2019
Number of Samples: 317
From: ebi.ac.uk/arrayexpress/experiments/E-MTAB-37/samples/

ArrayExpress is an archive of functional genomics data that stores high-throughput functional genomics data and provides it for reuse to the research community. The transcriptomics profiles of various cancer cell lines were downloaded from ebi.ac.uk/arrayexpress. The gene expression profiles of 317 different cancer cell lines, categorized into 57 different pathological states and 28 individual tissues for study. In total 22 012 unique genes exist in this dataset.

HPA

Source: Cell Line
Data type: RNA Seq data
Obtained: March 2019
Number of Samples: 32
From: proteinatlas.org/about/download

The Human Protein Atlas (HPA) is a freely available database including Tissue Atlas, Cell Atlas and Pathology Atlas. The Cell Atlas contains mRNA expression profiles of 64 cell lines which were characterized using deep RNA-sequencing. After filtering the non-metastatic human cancer cell lines, 32 cell lines were selected and the data files were downloaded.

ProteomeNCI60

Source: Cell Line
Data type: Mass Spec Proteomics data
Obtained: March 2019
Number of Samples: 59
From: proteomicsdb.org/#projects/35/256

ProteomicsDB is an effort of the Technische Universität München (TUM) that dedicated to expedite the identification of various proteomes and their use across the scientific community. We used the results of the Global Proteome Analysis study. Gholami et al. employed an MS-based proteomics analysis and provided quantitative proteome profiles of all 59 cell lines of the NCI-60 panel. The dataset contains 10 350 identified proteins.

GSE2109

Source: Patient Data
Data type: Microarray data
Obtained: March 2019
Number of Samples: 315
From: ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2109

The Gene Expression Omnibus (GEO) is a public functional genomics data repository that stores microarray- and sequence-based data. The results of DNA microarray analysis of The Expression Project for Oncology (expO) from The International Genomics Consortium were downloaded from ncbi.nlm.nih.gov. From 2 158 patient samples, the metastatic and ambiguous samples were removed and 1,895 samples in 315 different cancer types were selected. In total, 24 442 unique genes exist in this dataset.

TCGA

Source: Patient Data
Data type: RNA Seq data
Obtained: March 2019
Number of Samples: 202
From: portal.gdc.cancer.gov/legacy-archive/search/f

The Cancer Genome Atlas (TCGA) is a joint effort between the National Cancer Institute and the National Human Genome Research Institute that generated high-throughput data. The RNA-seq data was obtained using the GDC Data Portal. The portal has been searched for all RNA-seq cases and HTSeq-FPKM workflow type. In total, 11 571 files for 10 672 cases were found. The metadata file was downloaded and only files of primary tumour or primary blood derived cancer bone marrow or peripheral were selected. Finally, 10 322 files were downloaded through their API. We combined the same disease types by averaging the expression values and obtained 202 samples of different cancer types for the current study.

ProteomePatients

Source: Patient Data
Data type: Mass Spec Proteomics data
Obtained: March 2019
Number of Samples: 10
From: pubmed.ncbi.nlm.nih.gov/27924013

The ProteomeXchange (PX) Consortium provided a standard portal for submission and dissemination of mass spectrometry proteomics data. We extracted 10 data sets of four different cancer types containing 11 961 identified proteins.